What is a Study Diagram?

Amanda Metskas

November 13, 2024

You may have noticed that one of the features of all of our replication reports is the “Study Diagram” near the top. Our Study Diagrams lay out the hypotheses, exactly what participants did in the study, the key findings, and whether those findings replicated.

Why create a Study Diagram?

We create a Study Diagram for each of our reports because we believe that readers should be presented with the key points of the hypotheses, methods, and results in a consistent format that can be understood at a glance. We do this because clear communication is essential to the scientific process functioning well.

Too often in the research literature key pieces of information are spread throughout the text of a paper, making it more time-consuming and difficult to get a clear overall picture of a study. Sometimes the paper itself doesn’t include all of the necessary information, and readers have to refer to supplemental materials to understand what was actually done. This makes it harder for people to find relevant studies, evaluate their claims, and put the information in them to use.

In contrast, imagine a world in which all published empirical research had a Study Diagram. Understanding the gist of a paper would be faster because the Study Diagram takes much less time to review than the whole paper, while also being more standardized and informative than a typical abstract. The Study Diagram would improve the clarity of published research, making it easier to evaluate how well the claims made in the paper correspond to what is being done in the study itself. This would make it easier to identify possible overclaiming or validity issues that can be signs of Importance Hacking. Finally, it would become much easier to sort through literature to find studies that are relevant to your question. At a glance you would be able to compare key features of studies, like their sample size, exclusion criteria, and whether participants were randomly assigned to conditions.

Our goal at Transparent Replications is to incentivize practices that improve quality and robustness of psychology research. We see Study Diagrams as one of those best practices, and would like to see them become widely adopted in the field.

If you’d like to include a Study Diagram in your research, the sections below walk you through how to create one.

How to diagram a study

Here’s an example study diagram from Report #7:

The Study Diagram is in three parts:

Hypotheses – a few sentences in plain language explaining the main hypotheses being tested in the study.
Flowchart of the study – the core of the diagram including information about participants, conditions, study tasks, and exclusions.
Table of findings – a list of results for the key findings.

Making the flowchart of the study

Participants

The first box includes the number of participants, type of participants, and how they participated.

Although this is typically straightforward, here are two things to pay attention to when reporting on participants:

Sample criteria filtering and stratifying – If a sample is limited by certain characteristics, this box is where that information belongs. If an eligibility filter is being used to only collect data from certain subgroups of the population, or to collect a certain number of participants in certain categories, that information also belongs here.

Completed vs. started – Depending on the task and the method being used for data collection it may make sense to report only the number of participants who completed the task, or all of the participants who started the task whether they completed it or not. Either option can be reasonable, but make sure to pay attention here so that the number you are reporting is accurate.

Here’s an example from Report #10 for a study with only one experimental condition, but with more complex requirements for participants:

Study tasks

The next section outlines the tasks that participants did in the study. This section might be one box or a few boxes depending on the complexity of the experimental design. The example diagrams above are for a study with simple randomization to two experimental conditions, and a study with a single condition. The example below is from Report #6 for a study with more complex randomization to multiple conditions:

This section always starts with any initial parts of the experiment that all participants see or complete. Then it goes into the main task which, for studies with multiple conditions, is represented by boxes side-by-side showing what participants in each condition see and do. Finally, if there are parts of the experiment that all participants see or do after the main task, those are presented.

Exclusions

This is the final box of the flowchart, and it reports the number of participants whose data were included in the analysis. It also indicates why other participants were excluded. If participants who completed the study are reported in the first box, then the only exclusions reported here are people who completed the study whose data was not used for some other reason, such as not meeting eligibility criteria. If all participants who started the study are reported in the first box, the number of people who started the study but didn’t complete it would also be reported here.

Making the table of findings

The final section of the Study Diagram is the table of findings. The purpose of this table is to allow the reader to see at a glance what the study tested, and whether the results matched those expectations or not.

Determining what to include in this table can be a bit nuanced. Often there are more results calculated and reported in a paper than would be considered main findings, and including those additional results in this table can make it more difficult for readers to get the high-level overview that the Study Diagram is meant to provide. For example, results related to a manipulation check probably shouldn’t be included in this table. Additionally, if there are multiple statistical tests that pertain to the same claim, reporting those as part of a single row might make sense.

This first column lists each main claim that was tested, and the later columns present the findings in a simplified way. Typically those findings should be represented with a single word (like “more,” “less,” or “equivalent”) or with a single symbol such as +, -, or 0 to indicate a positive, negative, or null result. With our replication studies, we focus on whether the result from the original study replicated, so the table is designed to make it easy to see if the first column and the second column match. In the case of a study that isn’t a replication, but has pre-registered hypotheses, the table would have a column for the prediction that was made before data collection, and a column for the result. If there were no predictions made in advance, the table would just report the main findings.

What isn’t included in the diagram

You may have noticed that the Study Diagram doesn’t include information about how the statistical tests were conducted. The diagram also doesn’t include actual numerical findings. When we were developing this tool, we determined that it was simply not feasible to include that information while keeping the diagram manageable and understandable. The Study Diagram is not meant to be a replacement for the entire paper.

The Study Diagram gives the reader a quick overview of what participants did and what claims were tested on that basis. The body of the paper is a better place for the level of detail required to explain the statistical methods used, and provide the detailed numerical results.

This means that the Study Diagram is a good starting point for evaluating a study, but determining whether one should have confidence in the reported findings will, of course, continue to require going beyond this tool.

Report #11: Replication of “Changes in the prevalence of thin bodies bias young women’s judgements about body size” (Psychological Science | Devine et al. 2022)

Jack Svoboda and Amanda Metskas

November 6, 2024

Executive Summary

Transparency	Replicability	Clarity
	0 of 1 findings replicated*

*Note: Lack of replication is likely due to an experimental design decision that made the study less sensitive to detecting the effect than was anticipated when the sample size was determined.

We ran a replication of the experiment from this paper which found that as women were exposed to more images of thin bodies, they were more likely to consider ambiguous bodies to be overweight. The finding was not replicated in our study, but this isn’t necessarily evidence against the hypothesis itself.

The study asked participants to make many rapid judgments of pictures of bodies. The bodies varied in body mass index (BMI) with a range from emaciated to very overweight. Each body was judged by participants as either “overweight” or “not overweight”. Participants were randomized into two conditions: “increasing prevalence” and “stable prevalence”. The increasing prevalence condition saw more and more thin bodies as the experiment progressed. Meanwhile, stable prevalence participants saw a consistent mixture of thin and overweight bodies throughout the experiment. The original study found support for their first hypothesis; compared to participants in the stable prevalence condition, participants in the increasing prevalence condition became more likely to judge ambiguous bodies as “overweight” as the experiment continued. The original paper also examined two additional hypotheses about body self-image judgements, but did not find support for them – we did not include these in our replication.

The original study received a high transparency rating due to being pre-registered and having publicly available data, experimental materials, and analysis code, but could have benefitted from more robust documentation of its exclusion criteria. The primary result from the original study failed to replicate; however, this failure to replicate is likely due to an experimental design decision that made the study less sensitive to detecting the effect than anticipated. The images with BMIs in the range where the effect was most likely to occur were shown very infrequently in the increasing prevalence condition. As such, it may not constitute substantial evidence against the hypothesis itself. The clarity rating could have been improved by discussing the implications of hypotheses 2 and 3 having non-significant results for the paper’s overall claims. Clarity could also have been improved by giving the reader more information about the BMIs of the body images shown to participants and the implications of that for the experiment.

Study Diagram

Replication Conducted

We ran a replication of the main study from: Devine, S., Germain, N., Ehrlich, S., & Eppinger, B. (2022). Changes in the prevalence of thin bodies bias young women’s judgments about body size. Psychological Science, 33(8), 1212-1225. https://doi.org/10.1177/09567976221082941

How to cite this replication report: Transparent Replications by Clearer Thinking. (2024). Report #11: Replication of a study from “Changes in the prevalence of thin bodies bias young women’s judgments about body size” (Psychological Science | Devine et al., 2022) https://replications.clearerthinking.org/replication-2022psci33-8

Key Links

Our Research Box for this replication report includes the pre-registration, de-identified data, and analysis files.
Our GitLab repository for this replication report includes the code for running the experiment.

Overall Ratings

To what degree was the original study transparent, replicable, and clear?

Transparency: how transparent was the original study?	All materials were publicly available or provided upon request, but some exclusion criteria deviated between pre-registration and publication.
Replicability: to what extent were we able to replicate the findings of the original study?	The original finding did not replicate. Our analysis found that the key three-way interaction between condition, trial number, and size was not statistically significant. In this case, lack of replication is likely due to an experimental design decision that made the study less sensitive to detecting the effect than was anticipated, rather than evidence against the hypothesis itself. This means that the original simulated power analysis underestimated the sample size needed to detect the effect with this testing procedure.
Clarity: how unlikely is it that the study will be misinterpreted?	The discussion accurately summarizes the findings related to hypothesis 1 but does not discuss potential implications of lack of support for hypotheses 2 and 3.It is easy to misinterpret the presentation of the spectrum of stimuli used in the original experiment as they relate to the relative body mass indexes of the images shown to participants. Graphical representations of the original data only include labels for the minimum and maximum model sizes, making it difficult to interpret the relationship between judgements and stimuli. The difficulty readers would have determining the thin/overweight cutoff value, and the range of results for which judgements were ambiguous from the information presented in the paper could leave readers with misunderstandings about the study’s methods and results.

Detailed Transparency Ratings

Overall Transparency Rating:
1. Methods Transparency:	All materials are publicly available. There were some inconsistencies between the exclusion criteria reported in the paper, supplemental materials, and analysis code provided. We were able to determine the exact methods and rationale for the exclusion criteria through communication with the original authors.
2. Analysis Transparency:	Analysis code is publicly available.
3. Data availability:	The data are publicly available.
4. Preregistration:	We noted two minor deviations from pre-registered exclusion criteria: The preregistration indicated that participants would be excluded if they record 5 or more trial responses where the time between the stimulus display and participant response input is greater than 7 seconds. This criteria diverges slightly from both the final supplemental exclusion report and the exclusions as executed in the analysis script. Additionally, The preregistration indicated that participants with greater than 90% similar judgements across their trials would be excluded. One participant who met this criteria was included in the final analysis. Overall, these inconsistencies are minor and likely had no bearing on the results of the original study.

Summary of Study and Results

Both the original study (n = 419) and our replication (n = 201) tested for evidence of the cognitive mechanism prevalence-induced concept change as an explanation for shifting body type ideals in the context of women’s body image judgments.

The original paper tested 3 main hypotheses, but only found support for the first hypothesis. Since the original study didn’t find support for hypotheses 2 or 3, our replication focused on testing hypothesis 1: “…if the prevalence of thin bodies in the environment increases, women will be more likely to judge other women’s bodies as overweight than if this shift did not occur.” Our pre-registration of our analysis plan is available here.

Prevalence-induced concept change happens when a person starts seeing more and more cases of a specific conceptual category. For example, we can consider hair color. Red hair and brown hair are two different conceptual categories of hair color. Some people have hair that is obviously red or obviously brown, but there are many cases where it could go either way. We might call these in-between cases “auburn” or “reddish-brown” or even “brownish-red”. If a person starts seeing many many other people with obviously red hair, then they might start thinking of auburn hair as obviously brown. Their conceptual class of “red hair” has shrunk to exclude the ambiguous cases.

To test prevalence induced concept change in women’s body image, both the original study and our replication showed participants computer-generated images of women’s bodies and asked participants to judge whether they thought any given body was “overweight” or “not overweight”. The image library included 61 images, ranging from a BMI minimum of 13.19 and a maximum BMI of 120.29. Each participant was randomly assigned to one of two conditions: stable-prevalence or increasing-prevalence. Stable-prevalence participants saw an equal 50/50 split of images of bodies with BMIs above 19.79 (the “overweight” category)¹ and images of bodies with BMIs below 19.79 (the “thin” category). Increasing-prevalence participants saw a greater and greater proportion of bodies with BMIs below 19.79 as the experiment proceeded. If participants in the increasing-prevalence condition became more likely to judge thin or ambiguous bodies as overweight in the later trials of the experiment, compared to participants in the stable-prevalence condition, that would be evidence supporting the hypothesis of prevalence-induced concept change affecting body image judgements.

Overview of Main Task

During the task, participants were shown an image of a human body (all images can be found here). The body image stimulus displays on screen for 500 milliseconds (half of a second), followed by 500 milliseconds of a blank screen and finally a question mark, indicating to participants that it is time to input a response. Participants then recorded a binary judgment by pressing the “L” key on their keyboard to indicate “overweight” or by pressing the “A” key to indicate “not overweight”. Judgements were meant to be made quickly, between 100 and 7000 milliseconds, for each trial. This process was repeated for 16 blocks of 50 iterations each, meaning that each participant recorded 800 responses.

Additionally, participants completed a self-assessment once before and once after the main task. For this assessment, participants chose a body image from the stimulus set which most closely resembled their own body. Participants were asked to judge the self-representative body from their first self-assessment as “overweight” or “not overweight” before completing their second and final self-assessment. These self-assessments were used for testing hypothesis 2, hypothesis 3, and the exploratory analyses in the original paper. We focused on hypothesis 1 so did not include self-assessment data in our analysis.

Figure 1: Example frames from the task

Results

The original study found a significant three-way interaction between condition, trial, and size (β = 3.85, SE = 0.38, p = 1.09 × 10⁻²³), meaning that participants were more likely to judge ambiguous bodies as “overweight” as they were exposed to more thin bodies over the course of their trials. Our replication, however, did not find this interaction to be significant (β = 0.53, SE = 1.81, p = 0.26). Although it was not significant, the effect was in the correct direction, and the lack of significance may be due to an experimental design decision resulting in lower-than-estimated statistical power.

Study and Results in Detail

Main Task in Detail

Both the original study and our replication began with a demographic questionnaire. In our replication, the demographic questionnaire from the original study was pared down to only include questions relevant for exclusion criterion and a potential supplemental analysis regarding the original hypothesis 3. The maintained questions are listed below.

What is your gender?
- Options: Female, Male, Transgender, Non-Binary
What is your age in years?
For statistical purposes, what is your weight?
For statistical purposes, what is your height?
Please enter your date of birth.
Please enter your (first) native language.

We included an additional screening question to ensure recruited participants were able to complete the task.

Are you currently using a device with a full keyboard?
- Options: “Yes, I am using a full keyboard”, “No”

The exact proportion of bodies under 19.79 BMI presented out of the total bodies per block for each condition are detailed in Figure 2. Condition manipulations relative to stimuli BMI can be seen in Figure 3.

Figure 2: Stimuli Prevalence by Condition and Block, Table

	Proportion of thin body image stimuli
Block #	Increasing Prevalence	Stable Prevalence
1	0.50	0.50
2	0.50	0.50
3	0.50	0.50
4	0.50	0.50
5	0.60	0.50
6	0.72	0.50
7	0.86	0.50
8	0.94	0.50
9	0.94	0.50
10	0.94	0.50
11	0.94	0.50
12	0.94	0.50
13	0.94	0.50
14	0.94	0.50
15	0.94	0.50
16	0.94	0.50

Figure 3: Estimated BMI of Stimuli and World Health Organization Categories

* N000 was not included in either the original study or our replication.
** Labels for BMI categories defined by WHO guidelines. (WHO, 1995)

Stimulus	Categorization for Study Conditions	BMI	WHO Classification**
T300	Thin	13.19	Thin
T290	Thin	13.38	Thin
T280	Thin	13.47	Thin
T270	Thin	13.77	Thin
T260	Thin	13.86	Thin
T250	Thin	14.10	Thin
T240	Thin	14.28	Thin
T230	Thin	14.46	Thin
T220	Thin	14.65	Thin
T210	Thin	14.87	Thin
T200	Thin	15.06	Thin
T190	Thin	15.24	Thin
T180	Thin	15.49	Thin
T170	Thin	15.67	Thin
T160	Thin	15.74	Thin
T150	Thin	16.12	Thin
T140	Thin	16.40	Thin
T130	Thin	16.64	Thin
T120	Thin	16.81	Thin
T110	Thin	17.08	Thin
T100	Thin	17.28	Thin
T090	Thin	17.56	Thin
T080	Thin	17.77	Thin
T070	Thin	18.01	Thin
T060	Thin	18.26	Thin
T050	Thin	18.50	Normal Range
T040	Thin	18.77	Normal Range
T030	Thin	19.1	Normal Range
T020	Thin	19.3	Normal Range
T010	Thin	19.61	Normal Range
N000	NA*	19.79	Normal Range
H010	Overweight	21.55	Normal Range
H020	Overweight	23.35	Normal Range
H030	Overweight	25.37	Overweight
H040	Overweight	27.37	Overweight
H050	Overweight	29.57	Overweight
H060	Overweight	31.84	Overweight
H070	Overweight	34.13	Overweight
H080	Overweight	36.58	Overweight
H090	Overweight	39.10	Overweight
H100	Overweight	41.76	Overweight
H110	Overweight	44.55	Overweight
H120	Overweight	47.37	Overweight
H130	Overweight	50.23	Overweight
H140	Overweight	53.21	Overweight
H150	Overweight	56.26	Overweight
H160	Overweight	59.31	Overweight
H170	Overweight	62.64	Overweight
H180	Overweight	66.04	Overweight
H190	Overweight	69.56	Overweight
H200	Overweight	73.30	Overweight
H210	Overweight	76.95	Overweight
H220	Overweight	80.98	Overweight
H230	Overweight	85.49	Overweight
H240	Overweight	89.89	Overweight
H250	Overweight	94.40	Overweight
H260	Overweight	99.27	Overweight
H270	Overweight	104.4	Overweight
H280	Overweight	109.45	Overweight
H290	Overweight	114.82	Overweight
H300	Overweight	120.29	Overweight

Data Collection

Data were collected using the Positly recruitment platform and the Pavlovia experiment hosting platform. Data collection began on the 15th of May, 2024 and ended on the 5th of August, 2024.

In consultation with the original authors we determined that a sample size of 200 participants after exclusions would provide adequate statistical power for this replication effort. In the simulations for the original study the authors determined that 140 participants would provide 89% power to detect their expected effect size for hypothesis 1. Typically for replications we aim for a 90% chance to detect an effect that is 75% of the size of the original effect size. To emulate that standard for this study we decided on a sample of 200 participants. It is important to note that the original study had 419 participants after exclusions. This final sample size for the original study was set by simulation-based power analyses for hypotheses 2 and 3 requiring a sample size of ~400 participants for adequate statistical power. Because our replication study did not test hypotheses 2 and 3–since they weren’t supported in the original analysis–we did not need to power the study based on those hypotheses.

While a sample size of 200 subjects was justified at the time, we later learned that the original simulation-based power analysis relied on faulty assumptions, which could only be determined from the empirical data in the original sample. The sample size needed to provide adequate statistical power for hypothesis 1 was underestimated. Because the original study used a larger sample size to power hypotheses 2 and 3, the underestimate of the sample size needed for hypothesis 1 wasn’t detected. As a result, our replication study may have been underpowered.

Excluding Participants and/or Observations

For participants to be eligible to take part in the study, they had to be:

Female
Aged 18-30
English speaking

After data collection, participants were excluded from the analysis under the following circumstances:

Participants who took longer than 7 seconds to respond in more than 10 trials.
Participants who demonstrated obviously erratic behavior e.g. repeated similar responses across long stretches of trials despite variation in stimuli (see Additional Information about the Exclusion Criteria appendix section).
Participants who did not complete the full 800 trials.
Participants who do not meet the eligibility criteria.

Additionally, at the suggestion of the original authors, we excluded any observations in which the response was given more than 7 seconds after the display of the stimulus.

249 participants completed the main task. 8 participants did not have their data written due to technical malfunctions (these participants were still compensated as usual). 8 participants were excluded for reporting anything other than “Female” for their gender on the questionnaire. 23 participants were excluded for being over 30 years old. 6 participants were excluded for taking longer than 7 seconds to respond on more than 10 trials. 4 participants were excluded for obviously erratic behavior. Note that some participants fall into two or more of these exclusion categories, so the sum of exclusions listed above is greater than the total number of excluded participants.

Analysis

Both the original paper and our replication utilized a logistic multilevel model to assess the data:

Y_ij = 𝛽_0j + 𝛽_1jTrial_ij + 𝛽_2jSize_ij + 𝛽_3j(Trial_ij x Size_ij)

𝛽_0j = Ɣ₀₀ + Ɣ₀₁ Condition_j + U_oj

𝛽_1j = Ɣ₁₀ + Ɣ₁₁ Condition_j + U_1j

𝛽_2j = Ɣ₂₀ + Ɣ₂₁Condition_j

𝛽_3j = Ɣ₃₀ + Ɣ₃₁Condition_j

Where Size is the ordinal relative BMI of computerized model images. That is, the degree to which each body image stimulus is thin or overweight.

Y_ij represents the log odds of participant j making an “overweight” judgment for trial i.

U_oj are random intercepts per participant. U_1j are random slopes per participant. Ɣ_xx represents fixed effects.

Results in Detail

The original study found a significant three-way interaction between condition, trial number, and size (β = 3.85, SE = 0.38, p = 1.09 × 10⁻²³), indicating that as the prevalence of thin bodies in the environment increased, participants were more likely to judge ambiguous bodies (not obviously overweight and not obviously underweight) as overweight. The authors note that this effect is restricted to judgements of “thin and average-size stimuli” due to the increasing-prevalence condition requiring a low frequency of “overweight” stimuli.

Figure 4: Original Results Table

Predictors	Log Odds	95% CI	p
Intercept	-1.90	-2.01 – -1.78	<0.001
Condition	0.08	-0.04 – 0.20	0.173
Trial0	-0.62	-0.77 – -0.47	<0.001
Size0	21.21	20.82 – 21.59	<0.001
Condition x Trial0	-0.65	-0.81 – -0.50	<0.001
Condition x Size0	-0.48	-0.85 – -0.11	0.011
Trial x Size0	2.05	1.26 – 2.85	<0.001
Condition x Trial0 x Size0	3.85	3.10 – 4.61	<0.001

Figure 5: Original Results Data Representations

From “Changes in the prevalence of thin bodies bias young women’s judgments about body size,” by Devine, S., Germain, N., Ehrlich, S., & Eppinger, B., 2022, Psychological Science, 33(8), 1212-1225.

Figure 6: Replication Results Table

Predictors	Log Odds	95% CI	p
Intercept	-1.59	-1.73–-1.45	<0.001
Condition	0.15	0.2–0.29	0.028
Trial0	-0.80	-0.99–-0.61	<0.001
Size0	20.01	19.50–20.51	<0.001
Condition x Trial0	-0.68	-0.87–-0.49	<0.001
Condition x Size0	-0.43	-0.91–0.06	0.084
Trial x Size0	-0.24	-1.19–0.71	0.626
Condition x Trial0 x Size0	0.53	-0.38–1.43	0.255

Figure 7: Replication Results Data Representation

Interpreting the Results

The failure of this result to replicate is likely to be due to characteristics of the study design that made the experiment a less sensitive test of the hypothesis. For that reason the failure of this study to replicate should not be taken as strong evidence against the original hypothesis that prevalence induced concept change occurs for body images.

The main study design issue that could possibly account for the non-replication of the results is the categorization of “thin” and “overweight” images for the condition manipulation: “thin” images were 19.61 BMI and below, and “overweight” images were 21.55 BMI and above. This low threshold means that participants in the increasing prevalence condition would have seen a very small number of images of bodies that were in the ambiguous or normal range of BMI in which prevalence induced concept change is most likely to occur. Unfortunately, we did not notice this issue with the BMI cutoff between the thin and overweight groups until after we had collected the replication data. This means that our replication, while having the benefit of being faithful to the original study, has the drawback of being affected by the same study design issue.

We presented this issue to the authors after determining that it may explain the lack of replication. The authors explained their rationale for setting the image cutoff at the baseline image:

“In designing the study, we anticipated the most “ambiguous” stimuli to be those near the reference image (BMI of 19.79; the base model). This was based on two factors. First, WHO guidelines suggest that a “normal” BMI lies between 18.5 and 24.9—hence a BMI of 19.79 fell nicely within this range and, as mentioned, allowed for a clean division of the stimulus set into two equal categories. Second, irrespective of the objective BMI, we anticipated participants would judge the reference image as maximally ambiguous in the context of the stimulus set, owing to the range available to participants’ judgements when completing the experiment. Accordingly, the power analysis we conducted was based on this assumption that responses most sensitive to PICC would be those to images near in size to the reference image. But this turned out not to be the case when we acquired the data from our sample. As you point out, increased sensitivity to PICC was at a slightly higher (and evidently under-sampled) range of size (BMI 23.35 – 31.84). As such, the sample size required to detect effects in these ranges with sufficient power may be higher than previously thought.” (Devine, email communication 9/11/24)

Understanding the Categorization Used

It took us some time to recognize this issue because the original paper does not clearly explain how the “thin” and “overweight” image categories correspond to BMI values of the images, and none of the figures in the original paper show BMI values along the axes representing image sizes. From the paper alone it is not possible for a reader to determine what BMI values the stimuli presented correspond to, with the exception of the endpoints. The paper says:

Specifically, the proportion of thin to overweight bodies had the following order across each block in the increasing-prevalence condition: .50, .50, .50, .50, .60, .72, .86, .94, .94, .94, .94, .94, .94, .94, .94, .94. In the stable-prevalence condition, the proportion of overweight and thin bodies in the environment did not change; it was always .50 (see Fig. 1b). Bodies were categorized as objectively thin or overweight by Moussally et al. (2017) according to World Health Organization (1995) guidelines. Body mass index (BMI) across all bodies ranged from 13.19 (severely underweight) to 120.29 (very severely obese). (Devine et al, 2022) [Bold italics added for emphasis]

From the information provided in the paper, a reader would be likely to assume that the images in the “overweight” category had BMIs of greater than 25, because a BMI of 25 is the dividing line between “healthy/normal” and “overweight” according to the WHO. Another possible interpretation of this text in the paper would suggest that the bodies that were categorized as thin and/or median in the Moussally et al. (2017) stimulus validation paper were the ones increasing in prevalence in that condition, and those categorized as overweight in the validation study were diminishing in prevalence. Either of these likely reader assumptions would also be supported by the presentation of the results in the original paper:

Most importantly, we found a three-way interaction between condition, trial, and size (β = 3.85, SE = 0.38, p = 1.09 × 10−23). As seen in Figure 2a, this result shows that when the prevalence of thin bodies in the environment increased over the course of the task, participants judged more ambiguous bodies (average bodies) as overweight than when the prevalence remained fixed. We emphasize here that this effect is restricted to judgments of thin and average-size stimuli because the nature of our manipulation reduced the number of overweight stimuli seen by participants in the increasing-prevalence condition (as reflected by larger error bars for larger body sizes in Fig. 2a). (Devine et al, 2022) [Bold italics added for emphasis]

Moussally et al. developed the stimuli that were used in this study by using 3D modeling software. They started with a default female model (corresponding to 19.79 BMI according to their analysis), scaling down from that default model in 30 increments of the modeling software’s “thin/heavy” dimension to get lower BMIs (down to a low of 13.9), and then scaling up from that default model by 30 increments to get higher BMIs (up to a high of 120.29). They then validated the image set by asking participants to rate images on a 9 point Likert scale where 1 = “fat” and 9 = “thin”. Based on those ratings they established three categories for body shape: “thin, median, or fat.”

Figure 8: Ratings of Body Shape for all Stimuli from Moussally et al. (2017)

* “Median” defined by Moussally et al. (2017) as stimuli whose average rating across participants on a scale from 1 to 9 (1 = fat, 9 = thin) was within ±1.5 of the mean of ratings for the entire dimension. All stimuli with average ratings below this range were categorized as “thin”. Stimuli with average ratings above the range were categorized as “fat”.

Stimulus	BMI	Mean Rating (Likert 1-9)	Validation Study Classification
T300	13.19	8.94	Thin
T290	13.38	8.95	Thin
T280	13.47	8.97	Thin
T270	13.77	8.88	Thin
T260	13.86	8.91	Thin
T250	14.10	8.86	Thin
T240	14.28	8.77	Thin
T230	14.46	8.70	Thin
T220	14.65	8.63	Thin
T210	14.87	8.67	Thin
T200	15.06	8.59	Thin
T190	15.24	8.56	Thin
T180	15.49	8.37	Thin
T170	15.67	8.18	Thin
T160	15.74	8.22	Thin
T150	16.12	8.11	Thin
T140	16.40	8.12	Thin
T130	16.64	8.05	Thin
T120	16.81	7.95	Thin
T110	17.08	7.90	Thin
T100	17.28	7.79	Thin
T090	17.56	7.90	Thin
T080	17.77	7.79	Thin
T070	18.01	7.88	Thin
T060	18.26	7.74	Thin
T050	18.50	7.84	Thin
T040	18.77	7.76	Thin
T030	19.1	7.74	Thin
T020	19.3	7.78	Thin
T010	19.61	7.50	Thin
N000	19.79	7.63	Thin
H010	21.55	7.28	Thin
H020	23.35	6.21	Median
H030	25.37	5.65	Median
H040	27.37	5.26	Median
H050	29.57	4.85	Median
H060	31.84	4.28	Median
H070	34.13	3.63	Fat
H080	36.58	3.62	Fat
H090	39.10	3.10	Fat
H100	41.76	2.78	Fat
H110	44.55	2.65	Fat
H120	47.37	2.45	Fat
H130	50.23	2.32	Fat
H140	53.21	2.02	Fat
H150	56.26	1.95	Fat
H160	59.31	1.68	Fat
H170	62.64	1.56	Fat
H180	66.04	1.59	Fat
H190	69.56	1.44	Fat
H200	73.30	1.45	Fat
H210	76.95	1.30	Fat
H220	80.98	1.23	Fat
H230	85.49	1.17	Fat
H240	89.89	1.16	Fat
H250	94.40	1.11	Fat
H260	99.27	1.06	Fat
H270	104.4	1.09	Fat
H280	109.45	1.10	Fat
H290	114.82	1.06	Fat
H300	120.29	1.05	Fat

The “median” images according to the judgements reported in Moussally et. al. (2017) ranged from a BMI of 23.35 to 31.84; however, neither of those cutoffs nor the commonly used WHO BMI guideline of 25 and above as “overweight” were used to set the cutoff between the groups of “thin” and “overweight” images in the experiment we replicated. From looking at the study code itself, this study used the 30 images scaled down from the baseline image of 19.79 BMI as the “thin” group and the 30 images scaled up from the baseline as the “overweight” group. The 19.79 BMI image was not included in either group, so it was not presented to participants in the experiment. That means that the “thin” images that were increasing in prevalence ranged from a BMI of 13.19 to 19.61, and the “overweight” images that were decreasing in prevalence ranged from a BMI of 21.55 to 120.29. The 21.55 BMI image was categorized as “thin” in the Moussally et al. (2017) validation study, and is well within the normal/healthy weight range according to the WHO, and yet it was categorized with the “overweight” images in this study. This 21.55 BMI image was judged as “not overweight” for 96% of trials in the original dataset for the present study, further suggesting that the experiment’s cutoff between “thin” and “overweight” was placed at too low of a BMI to adequately capture data for ambiguous body images.

Implications of the Categorization

Figure 2b in the original paper presents the results for a BMI of 23.35, which is within the “normal/healthy” range according to the WHO, and is the lowest BMI “median” image according to the validation study. This is clearly meant to be one of the normal or ambiguous body sizes for which prevalence induced concept change would be most expected. The inclusion of this image in the “overweight” grouping for which the prevalence was decreasing means this image would not have been shown to participants very often. The caveat in the results section that “this effect is restricted to judgments of thin and average-size stimuli because the nature of our manipulation reduced the number of overweight stimuli seen by participants in the increasing-prevalence condition,” applies to the 23.35 BMI image that is presented in the paper as a demonstration of the effect.

In the last 200 trials in the increasing prevalence condition only 6% of the images presented would have been from the set of 30 “overweight” images. That means that each participant only saw 12 presentations of “overweight” images in the last 200 trials. Each individual subject in the increasing prevalence condition would only have had an approximately 33% chance of seeing the BMI 23.35 image at least once during the last 200 trials. Ideally, this image–and others in the ambiguous range–should be shown much more frequently in order to capture possible effects of prevalence induced concept change.

In the original study, looking at the last 200 trials, the 23.35 BMI image was presented only 80 times out of 42,600 image presentations to the increasing prevalence condition. In the replication study, looking at the last 200 trials, that image was only presented 51 times out of 20,000 image presentations to the increasing prevalence condition.

Figure 9 below shows how many times stimuli of BMI values 18.77 – 31.84 were presented and what percentage of them were judged as “overweight” in the last 200 trials in each condition across all subjects for the original dataset and the replication dataset. The rows that are color coded by condition and have BMI values in bold are from the “overweight” group.

Figure 9: Data Presentation Frequency and % “Overweight” judgements in last 200 trials

A (Original Data)

	Increasing (N = 213)		Stable (N = 206)
Stimulus (BMI)	Number of presentations (out of 42,600)	% Judged as “Overweight”	Number of presentations (out of 41,200)	% Judged as “Overweight”
18.77	1337	2.99%	701	2.28%
19.1	1365	4.03%	673	2.23%
19.3	1288	3.80%	718	1.39%
19.61	1334	3.97%	698	2.72%
21.55	73	8.22%	676	3.40%
23.35	80	25.00%	685	9.78%
25.37	81	28.40%	687	21.83%
27.37	96	43.75%	683	34.85%
29.57	93	53.76%	726	52.07%
31.84	83	61.45%	710	56.90%

B (Replication Data)

	Increasing (N = 100)		Stable (N = 101)
Stimulus (BMI)	Number of presentations (out of 20,000)	% Judged as “Overweight”	Number of presentations (out of 20,100)	% Judged as “Overweight”
18.77	611	0.82%	315	2.22%
19.1	590	2.03%	333	2.40%
19.3	634	2.05%	314	2.55%
19.61	631	2.06%	319	2.19%
21.55	41	7.32%	331	4.53%
23.35	51	15.69%	336	10.71%
25.37	41	34.15%	357	24.93%
27.37	35	42.86%	315	35.56%
29.57	35	68.57%	346	52.89%
31.84	42	59.52%	385	58.70%

From looking at these tables, it’s easy to see that in both conditions only a small percentage of the stimuli from 18.77 to 19.61 BMI were judged to be overweight. There is much more variation in judgment in the 21.55 to 31.84 BMI images, but the number of times those were presented in the increasing prevalence condition was very small. The fact that the most important stimuli for demonstrating the proposed effect were presented extremely infrequently in the study likely undermines the reliability of this test of the prevalence induced concept change hypothesis by making it much less sensitive to detecting whether the effect is present.

Implications of Nonreplication for the Prevalence Induced Concept Change Hypothesis

If we look more closely at the results for the range of BMI values for which there is ambiguity in both the original data and the replication data we can see that the pattern of results for those values looks similar.

Figure 10: Data for the last 200 trials

A (Original Data)

B (Replication Data)

Figure 10 above shows that only one datapoint in the replication data has results that are clearly outside of the margin of error (BMI = 29.57), but the pattern looks similar to what we see in the original data. This suggests that despite the issues with the experimental design, the original study may have been able to detect an effect because it was much more highly powered than should have been necessary to test this hypothesis due to the need for a higher statistical power for hypotheses 2 and 3 in the original paper. In the replication study, which was powered appropriately according to the original study’s simulation analysis, the effective power was lower than what was simulated due to the miscategorization of the ambiguous images into the overweight group.

Proposed Experimental Design Changes

In our view, a better threshold between the “thin” and “overweight” images for testing this hypothesis would be 31.84 (the high end of the “median” range reported in the Moussally et. al. (2017) paper). This threshold would ensure that participants are presented with many opportunities to judge the images that are in the ambiguous range where prevalence induced concept change is most likely to be observed. Shifting to this threshold would make this experiment better suited to detecting the hypothesized effect.

Additionally, this experiment would benefit from having more stimuli that are in the ambiguous range of values – i.e. more stimuli with BMIs between 23.35 and 31.84. In this study only 5 of the images (23.35, 25.37, 27.37, 29.57, 31.84) are in the range Moussally et al. determine to be “median.” A larger set of stimuli in the ambiguous range would provide more data points in the relevant range for testing the hypothesis. We recognize that this change would require developing and validating additional stimuli, which would be labor-intensive.

Comparing the stimuli used in this study to those used in the Levari et al. (2018) experiment–on which this study is based–provides an illustration that helps explain why this would be important for testing this hypothesis. Levari et al. tested prevalence induced concept change using images of 100 dots that ranged in color from purple to blue. When they decreased the prevalence of blue dots, they found that people were more likely to consider ambiguous dots to be blue. Stimuli from Levari et al.’s paper can be seen in Figure 11c, where there are 18-19 stimuli at color values in between each of the dots shown. From looking at these representative stimuli it’s clear that there were many examples of different stimuli in the range of values that were ambiguous.

Figure 11: Levari et al. (2018) Colored Dots Study 1

A-B (Results visualization)

C (Color spectrum stimuli examples)

Prevalence-induced concept change should be observable mainly in ambiguous stimuli. We expect this effect to be non-existent for the extreme exemplars of the relevant conceptual category. That is, the bluest dots will always be identified as blue, but judgements of ambiguous dots should be susceptible to the effect. Looking at Figures 11a-11b, a substantial fraction of the 100 different dot images were ambiguous (identified as blue some of the time, rather than 100% or 0% of the time). A wide range of ambiguous stimuli make this effect easier to capture. Additionally, these ambiguous dots were clustered on the purple half of the color spectrum. This is important because Levari et al.’s manipulation increased the frequency of the purple-spectrum dots. So, their data contained many observations of ambiguous dots despite the condition manipulation decreasing the frequency of blue-spectrum dots. Compare the above Figures 11a-11b from Levari et al. to the below Figure 12 generated from the original body image study data:

Figure 12: PICCBI Original Results Visualization

It’s not possible to see the curve shift in the increasing prevalence condition here (Figure 12), despite the model having a significant result. This is likely because there are many fewer observations in the ambiguous range of stimuli. This makes the model more sensitive to noise at the extreme values. Looking at these figures for the replication data in Figure 13, we see that noise in the infrequently presented larger BMI images shapes the divergence between the curves in a way that’s not consistent with the hypothesis:

Figure 13: PICCBI Replication Results Visualization

Taking more measurements in the ambiguous range by having more stimulus images with BMI values in that range would improve the ability of this experiment to reliably detect whether prevalence induced concept change occurs for body images.

It’s also worth noting that this issue with the study design was somewhat obscured by the design of the figures presenting the data in the paper. Instead of using the curves above like the Levari et al. (2018) paper used, the data for this study was presented by showing the percentage of overweight ratings for the first 200 trials subtracted from the last 200 trials, as seen in Figure 5. This method highlights the relevant change from the early trials to the later trials, but has the downside of not clearly presenting the actual values. Many of those values didn’t change from the early to the late trials because they were near the ceiling or the floor (almost all judgements were one-way). It was not possible to tell what the actual percentages of overweight judgements were from the information presented in the paper, which meant it was not clear which stimuli had overweight judgements near the ceiling or floor and which were ambiguous. Being able to tell where the ambiguous values were would have been useful to readers attempting to interpret the results of this study.

By incorporating these changes, a new version of this study would shed a lot of light on the question of whether prevalence induced concept change can be reliably detected for body images.

Conclusion

The results of the original paper failed to replicate, which we suspect was due to the experiment being less sensitive to the effect than anticipated. For this reason we emphasize that our findings do not provide strong evidence against the original hypothesis. Prevalence-induced concept change may affect women’s body image judgements, but the present experiment was not as sensitive to detect this effect with the current sample size as previously believed. The design could be improved by raising the BMI cutoff between “thin” and “overweight” images for the prevalence manipulation and/or including additional stimuli within the range of ambiguous body sizes (BMI 23.35 – 31.84) to increase the frequency of ambiguous stimuli, which are the most important for demonstrating a change in concept.

The clarity rating of 2.5 stars was due to two factors. The original discussion section did not address the potential implications of the lack of support for hypotheses 2 and 3. Since hypotheses 2 and 3 related to people applying these changes in the concept of thinness to their own bodies, the lack of support for those hypotheses may limit the claims that should be made about potential real world effects of prevalence induced concept change for body image. Additionally, the difficulty of determining the stimulus BMI values, the thin/overweight cutoff value, and the range of results for which judgements were ambiguous from the information presented in the paper could leave readers with misunderstandings about the study’s methods and results.

The study had a high transparency rating of 4.5 stars because all of the original data, experiment/analysis code, and pre-registration materials were publicly available. There were minor discrepancies in exclusion criteria based on reaction times between the pre-registration and the analysis, and some documentation for exclusion criteria and code for evaluating participant quality wasn’t publicly posted. However, the undocumented code was provided upon request, and the inconsistency in exclusion criteria was subtle and likely had no bearing on the results.

Author Acknowledgements

We would like to thank the authors of “Changes in the Prevalence of Thin Bodies Bias Young Women’s Judgements About Body Size”: Sean Devine, Nathalie Germain, Stefan Ehrlich, and Ben Eppinger for everything they’ve done to make this replication report possible. We thank them for their original research and for making their data, experiment code, analysis, and other materials publicly available. The original authors provided feedback with expedient, supportive correspondence and this report was greatly improved by their input.

Thank you to Isaac Handley-Miner for your consultation on multilevel modeling for our analysis. Your expertise was invaluable.

Thank you to Soundar and Nathan from the Positly team for your technical support with the data collection.

Thank you to Spencer Greenberg for your guidance and feedback throughout the project.

Last, but certainly not least, thank you to all 249 individuals who participated in the replication experiment.

Response from the Original Authors

The original paper’s authorship team offers this response (PDF) to our report. We are grateful for their thoughtful engagement with our report.

piccbi_replication_author_response Download

Purpose of Transparent Replications by Clearer Thinking

Transparent Replications conducts replications and evaluates the transparency of randomly-selected, recently-published psychology papers in prestigious journals, with the overall aim of rewarding best practices and shifting incentives in social science toward more replicable research.
We welcome reader feedback on this report, and input on this project overall.

Appendices

Additional Information about the Exclusion Criteria

249 participants completed the main task

8 participants were excluded due to technical malfunctions.
- 5 of these participants did not have their data written due to terminating their connection to Pavlovia before the data saving operations could complete. These participants were compensated for completion of the full task.
- 3 of these participants were excluded for incomplete data sets.These 3 exclusions stand out as unexplained data writing malfunctions. These participants were compensated for completion of the full task, despite the partial datasets.
8 participants were excluded for reporting anything other than “Female” for their gender on the questionnaire.
23 participants were excluded for being over 30 years old.
6 participants were excluded for taking longer than 7 seconds to respond on more than 10 trials.
4 participants were excluded for obviously erratic behavior.

The “erratic behavior” exclusions were determined by generating graphical representations of individual subject judgements over time and manually reviewing them for signs of unreasonable behavior. The code for generating these individual subject graphs was provided by the original authors and we consulted with the original authors on their assessment of the graphs. The generation code and a complete set of graphics can be found in our gitlab repository. Figure 14a is an example of expected behavior from a participant. They tended to judge very thin stimuli as “not overweight” and very overweight stimuli as “overweight” with some variance, especially around ambiguous stimuli closer to the middle of the spectrum. Figures 14b-14e are the subjects we excluded based on their curves. 14b made judgments exactly opposite the expected behavior for their first 200 trials which indicates that this participant was confused about which key on their keyboard related to which judgment. In 14c, we see that this participant’s judgements in the last 200 trials were completely random. They likely stopped paying attention at some point during the task and assigned judgements randomly. Because this criterion is somewhat subjective, only the most obviously invalid data were excluded. Any participants with questionable but ambiguous curves had their data included to avoid the possibility of biased exclusions.

Figure 14: Individual Subject Curves

A (Good Subject Curve)

B

C

D

E

References

Devine, S., Germain, N., Ehrlich, S., & Eppinger, B. (2022). Changes in the prevalence of thin bodies bias young women’s judgments about body size. Psychological Science, 33(8), 1212-1225. https://doi.org/10.1177/09567976221082941

Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149-1160. Download PDF

Levari, D. E., Gilbert, D. T., Wilson, T. D., Sievers, B., Amodio, D. M., & Wheatley, T. (2018). Prevalence-induced concept change in human judgment. Science, 360(6396), 1465-1467. https://doi.org/10.1016/j.cognition.2022.105196

Moussally, J. M., Rochat, L., Posada, A., & Van der Linden, M. (2017). A database of body-only computer-generated pictures of women for body-image studies: Development and preliminary validation. Behavior research methods, 49(1), 172-183. https://doi.org/10.3758/s13428-016-0703-7

World Health Organization. (1995). Physical status: The use of and interpretation of anthropometry. Report of a WHO expert committee. https://apps.who.int/iris/handle/10665/37003

We are using the category labels “thin” and “overweight” because they were used in the original paper. These labels do not necessarily correspond to what they would mean in everyday usage, and should not be taken as objective measures of health, perception, nor the opinions of the researchers. More information on the decisions behind the categorizations can be found in the Understanding the Categorizations Used section. ↩︎

Always Conduct the “Simplest Valid Analysis”

Spencer Greenberg

July 10, 2024

This is an opinion piece from our founder, Spencer Greenberg.

A significant and pretty common problem I see when reading papers in social science (and psychology in particular) is that they present a fancy analysis but don’t show the results of what we have named the “Simplest Valid Analysis” – which is the simplest possible way of analyzing the data that is still a valid test of the hypothesis in question.

This creates two potentially serious problems that make me less confident in the reported results:

Fancy analyses impress people (including reviewers), but they are often harder to interpret than simple analyses. And it’s much less likely the reader really understands the fancy analysis, including its limitations, assumptions, and gotchas. So, the fancy analysis can easily be misinterpreted, and is sometimes even invalid for subtle reasons that reviewers, readers (and perhaps the researchers themselves) don’t realize. As a mathematician, I am deeply unimpressed when someone shows me a complex mathematical method when a simple one would have sufficed, but a lot of people fear or are impressed by fancy math, so complex analyses can be a shield that people hide behind.

Fancy analyses typically have more “researcher degrees of freedom.” This means that there is more wiggle room for researchers to choose an analysis that makes the results look the way the researcher would prefer they turn out. These choices can be all too easy to justify for many reasons including confirmation bias, wishful thinking, and a “publish or perish” mentality. In contrast, the Simplest Valid Analysis is often very constrained, with few (if any) choices left to the researcher. This makes it less prone to both unconscious and conscious biases.

When a paper doesn’t include the Simplest Valid Analysis, I think it is wise to downgrade your trust in the result at least a little bit. It doesn’t mean the results are wrong, but it does mean that they are harder to interpret.

I also think it’s fine and even good for researchers to include more sophisticated (valid) analyses and to explain why they believe those are better than the Simplest Valid Analysis, as long as the Simplest Valid Analysis is also included. Fancy methods sometimes are indeed better than simpler ones, but that’s not a good reason to exclude the simpler analysis.

Here are some real-world examples where I’ve seen a fancier analysis used while failing to report the Simplest Valid Analysis:

Running a linear regression with lots of control variables when there is no need to control for all of these variables (or no need to control for more than one or two of the variables)
Use of ANOVA with lots of variables when really the hypothesis only requires a simple comparison of two means
Use of a custom statistical algorithm when a very simple standard algorithm can also test the hypothesis
Use of fancy machine learning when simple regression algorithms may perform just as well
Combining lots of tests into one using fancy methods rather than performing each test one at a time in a simple way

The problems that can occur when the results of Simplest Valid Analysis aren’t reported was one of the reasons that we decided to include a Clarity Criterion in our evaluation of studies for Transparent Replications. As part of evaluating a study’s Clarity, if it does not present the results of the Simplest Valid Analysis, we determine what that analysis would be, and pre-register and conduct the Simplest Valid Analysis on both the original data and the new data we collect for the replication. Usually it is fairly easy to determine what the Simplest Valid Analysis would be for a research question, but not always. When there are multiple analyses that could be used as the Simplest Valid Analysis, we select the one that we believe is most likely to be informative, and we select that analysis prior to running analyses on the original data and prior to collecting the replication data.

In my view, while it is very important that a study replicates, replication alone does not guarantee that a study’s results reflect something real in the world. For that to be the case, we also have to be confident that the results obtained are from valid tests of the hypotheses. One way to increase the likelihood of that being the case is to report the results from the Simplest Valid Analysis.

My advice is that, when you’re reading scientific results, look for the Simplest Valid Analysis, and if it’s not there, downgrade your trust in the results at least a little bit. If you’re a researcher, remember to report the Simplest Valid Analysis to help your work be trusted and to help avoid mistakes (I aspire always to do so, though there have likely been times I have forgotten to). And if you’re a peer reviewer or journal editor, ask authors to report the Simplest Valid Analysis in their papers in order to reduce the risk that the results have been misinterpreted.

Report #10: Replication of a study from “The illusion of moral decline” (Nature | Mastroianni & Gilbert 2023)

Isaac Handley-Miner

July 8, 2024

Executive Summary

Transparency	Replicability	Clarity
	4 of 4 findings replicated

We ran a replication of Study 5b from this paper. This study tested whether people believe that morality is declining over time.

The paper noted that people encounter disproportionately negative information about current-day people (e.g., via the media) and people often have weaker emotional responses to negative events from the past. As such, the authors hypothesized that participants would think people are less moral today than people used to be, but that this perception of moral decline would diminish when comparing timepoints before participants were born.

To test these hypotheses, the study asked each participant to rate how “kind, honest, nice, and good” they thought people are today and were at four previous timepoints corresponding, approximately, to when participants were 20 years old, when they were born, 20 years before they were born, and 40 years before they were born.

The results from the original study confirmed the authors’ predictions: Participants perceived moral decline during their lifetime, but there was no evidence of perceived moral decline for the time periods before participants were born.

Our replication found the same pattern of results.

The study received a transparency rating of 4.25 stars because its materials, data, and code were publicly available, but it was not pre-registered. The paper received a replicability rating of 5 stars because all of its primary findings replicated. The study received a clarity rating of 5 stars because the claims were well-calibrated to the study design and statistical results.