Executive Summary
Transparency | Replicability | Clarity |
---|---|---|
0 of 1 findings replicated* |
We ran a replication of the experiment from this paper which found that as women were exposed to more images of thin bodies, they were more likely to consider ambiguous bodies to be overweight. The finding was not replicated in our study, but this isn’t necessarily evidence against the hypothesis itself.
The study asked participants to make many rapid judgments of pictures of bodies. The bodies varied in body mass index (BMI) with a range from emaciated to very overweight. Each body was judged by participants as either “overweight” or “not overweight”. Participants were randomized into two conditions: “increasing prevalence” and “stable prevalence”. The increasing prevalence condition saw more and more thin bodies as the experiment progressed. Meanwhile, stable prevalence participants saw a consistent mixture of thin and overweight bodies throughout the experiment. The original study found support for their first hypothesis; compared to participants in the stable prevalence condition, participants in the increasing prevalence condition became more likely to judge ambiguous bodies as “overweight” as the experiment continued. The original paper also examined two additional hypotheses about body self-image judgements, but did not find support for them – we did not include these in our replication.
The original study received a high transparency rating due to being pre-registered and having publicly available data, experimental materials, and analysis code, but could have benefitted from more robust documentation of its exclusion criteria. The primary result from the original study failed to replicate; however, this failure to replicate is likely due to an experimental design decision that made the study less sensitive to detecting the effect than anticipated. The images with BMIs in the range where the effect was most likely to occur were shown very infrequently in the increasing prevalence condition. As such, it may not constitute substantial evidence against the hypothesis itself. The clarity rating could have been improved by discussing the implications of hypotheses 2 and 3 having non-significant results for the paper’s overall claims. Clarity could also have been improved by giving the reader more information about the BMIs of the body images shown to participants and the implications of that for the experiment.
Full Report
Study Diagram
Replication Conducted
We ran a replication of the main study from: Devine, S., Germain, N., Ehrlich, S., & Eppinger, B. (2022). Changes in the prevalence of thin bodies bias young women’s judgments about body size. Psychological Science, 33(8), 1212-1225. https://doi.org/10.1177/09567976221082941
How to cite this replication report: Transparent Replications by Clearer Thinking. (2024). Report #11: Replication of a study from “Changes in the prevalence of thin bodies bias young women’s judgments about body size” (Psychological Science | Devine et al., 2022) https://replications.clearerthinking.org/replication-2022psci33-8
Key Links
- Our Research Box for this replication report includes the pre-registration, de-identified data, and analysis files.
- Our GitLab repository for this replication report includes the code for running the experiment.
Overall Ratings
To what degree was the original study transparent, replicable, and clear?
Transparency: how transparent was the original study? | All materials were publicly available or provided upon request, but some exclusion criteria deviated between pre-registration and publication. |
Replicability: to what extent were we able to replicate the findings of the original study? | The original finding did not replicate. Our analysis found that the key three-way interaction between condition, trial number, and size was not statistically significant. In this case, lack of replication is likely due to an experimental design decision that made the study less sensitive to detecting the effect than was anticipated, rather than evidence against the hypothesis itself. This means that the original simulated power analysis underestimated the sample size needed to detect the effect with this testing procedure. |
Clarity: how unlikely is it that the study will be misinterpreted? | The discussion accurately summarizes the findings related to hypothesis 1 but does not discuss potential implications of lack of support for hypotheses 2 and 3.It is easy to misinterpret the presentation of the spectrum of stimuli used in the original experiment as they relate to the relative body mass indexes of the images shown to participants. Graphical representations of the original data only include labels for the minimum and maximum model sizes, making it difficult to interpret the relationship between judgements and stimuli. The difficulty readers would have determining the thin/overweight cutoff value, and the range of results for which judgements were ambiguous from the information presented in the paper could leave readers with misunderstandings about the study’s methods and results. |
Detailed Transparency Ratings
Overall Transparency Rating: | |
---|---|
1. Methods Transparency: | All materials are publicly available. There were some inconsistencies between the exclusion criteria reported in the paper, supplemental materials, and analysis code provided. We were able to determine the exact methods and rationale for the exclusion criteria through communication with the original authors. |
2. Analysis Transparency: | Analysis code is publicly available. |
3. Data availability: | The data are publicly available. |
4. Preregistration: | We noted two minor deviations from pre-registered exclusion criteria: The preregistration indicated that participants would be excluded if they record 5 or more trial responses where the time between the stimulus display and participant response input is greater than 7 seconds. This criteria diverges slightly from both the final supplemental exclusion report and the exclusions as executed in the analysis script. Additionally, The preregistration indicated that participants with greater than 90% similar judgements across their trials would be excluded. One participant who met this criteria was included in the final analysis. Overall, these inconsistencies are minor and likely had no bearing on the results of the original study. |
Summary of Study and Results
Both the original study (n = 419) and our replication (n = 201) tested for evidence of the cognitive mechanism prevalence-induced concept change as an explanation for shifting body type ideals in the context of women’s body image judgments.
The original paper tested 3 main hypotheses, but only found support for the first hypothesis. Since the original study didn’t find support for hypotheses 2 or 3, our replication focused on testing hypothesis 1: “…if the prevalence of thin bodies in the environment increases, women will be more likely to judge other women’s bodies as overweight than if this shift did not occur.” Our pre-registration of our analysis plan is available here.
Prevalence-induced concept change happens when a person starts seeing more and more cases of a specific conceptual category. For example, we can consider hair color. Red hair and brown hair are two different conceptual categories of hair color. Some people have hair that is obviously red or obviously brown, but there are many cases where it could go either way. We might call these in-between cases “auburn” or “reddish-brown” or even “brownish-red”. If a person starts seeing many many other people with obviously red hair, then they might start thinking of auburn hair as obviously brown. Their conceptual class of “red hair” has shrunk to exclude the ambiguous cases.
To test prevalence induced concept change in women’s body image, both the original study and our replication showed participants computer-generated images of women’s bodies and asked participants to judge whether they thought any given body was “overweight” or “not overweight”. The image library included 61 images, ranging from a BMI minimum of 13.19 and a maximum BMI of 120.29. Each participant was randomly assigned to one of two conditions: stable-prevalence or increasing-prevalence. Stable-prevalence participants saw an equal 50/50 split of images of bodies with BMIs above 19.79 (the “overweight” category)1 and images of bodies with BMIs below 19.79 (the “thin” category). Increasing-prevalence participants saw a greater and greater proportion of bodies with BMIs below 19.79 as the experiment proceeded. If participants in the increasing-prevalence condition became more likely to judge thin or ambiguous bodies as overweight in the later trials of the experiment, compared to participants in the stable-prevalence condition, that would be evidence supporting the hypothesis of prevalence-induced concept change affecting body image judgements.
Overview of Main Task
During the task, participants were shown an image of a human body (all images can be found here). The body image stimulus displays on screen for 500 milliseconds (half of a second), followed by 500 milliseconds of a blank screen and finally a question mark, indicating to participants that it is time to input a response. Participants then recorded a binary judgment by pressing the “L” key on their keyboard to indicate “overweight” or by pressing the “A” key to indicate “not overweight”. Judgements were meant to be made quickly, between 100 and 7000 milliseconds, for each trial. This process was repeated for 16 blocks of 50 iterations each, meaning that each participant recorded 800 responses.
Additionally, participants completed a self-assessment once before and once after the main task. For this assessment, participants chose a body image from the stimulus set which most closely resembled their own body. Participants were asked to judge the self-representative body from their first self-assessment as “overweight” or “not overweight” before completing their second and final self-assessment. These self-assessments were used for testing hypothesis 2, hypothesis 3, and the exploratory analyses in the original paper. We focused on hypothesis 1 so did not include self-assessment data in our analysis.
Figure 1: Example frames from the task
A (Introduction) |
B (Example instruction frame) |
C (Block 1 start) |
D (Stimulus display [500ms]) |
E (Prompt to respond) |
Results
The original study found a significant three-way interaction between condition, trial, and size (β = 3.85, SE = 0.38, p = 1.09 × 10−23), meaning that participants were more likely to judge ambiguous bodies as “overweight” as they were exposed to more thin bodies over the course of their trials. Our replication, however, did not find this interaction to be significant (β = 0.53, SE = 1.81, p = 0.26). Although it was not significant, the effect was in the correct direction, and the lack of significance may be due to an experimental design decision resulting in lower-than-estimated statistical power.
Study and Results in Detail
Main Task in Detail
Both the original study and our replication began with a demographic questionnaire. In our replication, the demographic questionnaire from the original study was pared down to only include questions relevant for exclusion criterion and a potential supplemental analysis regarding the original hypothesis 3. The maintained questions are listed below.
- What is your gender?
- Options: Female, Male, Transgender, Non-Binary
- What is your age in years?
- For statistical purposes, what is your weight?
- For statistical purposes, what is your height?
- Please enter your date of birth.
- Please enter your (first) native language.
We included an additional screening question to ensure recruited participants were able to complete the task.
- Are you currently using a device with a full keyboard?
- Options: “Yes, I am using a full keyboard”, “No”
The exact proportion of bodies under 19.79 BMI presented out of the total bodies per block for each condition are detailed in Figure 2. Condition manipulations relative to stimuli BMI can be seen in Figure 3.
Figure 2: Stimuli Prevalence by Condition and Block, Table
Proportion of thin body image stimuli | ||
Block # | Increasing Prevalence | Stable Prevalence |
1 | 0.50 | 0.50 |
2 | 0.50 | 0.50 |
3 | 0.50 | 0.50 |
4 | 0.50 | 0.50 |
5 | 0.60 | 0.50 |
6 | 0.72 | 0.50 |
7 | 0.86 | 0.50 |
8 | 0.94 | 0.50 |
9 | 0.94 | 0.50 |
10 | 0.94 | 0.50 |
11 | 0.94 | 0.50 |
12 | 0.94 | 0.50 |
13 | 0.94 | 0.50 |
14 | 0.94 | 0.50 |
15 | 0.94 | 0.50 |
16 | 0.94 | 0.50 |
Figure 3: Estimated BMI of Stimuli and World Health Organization Categories
Stimulus | Categorization for Study Conditions | BMI | WHO Classification** |
---|---|---|---|
T300 | Thin | 13.19 | Thin |
T290 | Thin | 13.38 | Thin |
T280 | Thin | 13.47 | Thin |
T270 | Thin | 13.77 | Thin |
T260 | Thin | 13.86 | Thin |
T250 | Thin | 14.10 | Thin |
T240 | Thin | 14.28 | Thin |
T230 | Thin | 14.46 | Thin |
T220 | Thin | 14.65 | Thin |
T210 | Thin | 14.87 | Thin |
T200 | Thin | 15.06 | Thin |
T190 | Thin | 15.24 | Thin |
T180 | Thin | 15.49 | Thin |
T170 | Thin | 15.67 | Thin |
T160 | Thin | 15.74 | Thin |
T150 | Thin | 16.12 | Thin |
T140 | Thin | 16.40 | Thin |
T130 | Thin | 16.64 | Thin |
T120 | Thin | 16.81 | Thin |
T110 | Thin | 17.08 | Thin |
T100 | Thin | 17.28 | Thin |
T090 | Thin | 17.56 | Thin |
T080 | Thin | 17.77 | Thin |
T070 | Thin | 18.01 | Thin |
T060 | Thin | 18.26 | Thin |
T050 | Thin | 18.50 | Normal Range |
T040 | Thin | 18.77 | Normal Range |
T030 | Thin | 19.1 | Normal Range |
T020 | Thin | 19.3 | Normal Range |
T010 | Thin | 19.61 | Normal Range |
N000 | NA* | 19.79 | Normal Range |
H010 | Overweight | 21.55 | Normal Range |
H020 | Overweight | 23.35 | Normal Range |
H030 | Overweight | 25.37 | Overweight |
H040 | Overweight | 27.37 | Overweight |
H050 | Overweight | 29.57 | Overweight |
H060 | Overweight | 31.84 | Overweight |
H070 | Overweight | 34.13 | Overweight |
H080 | Overweight | 36.58 | Overweight |
H090 | Overweight | 39.10 | Overweight |
H100 | Overweight | 41.76 | Overweight |
H110 | Overweight | 44.55 | Overweight |
H120 | Overweight | 47.37 | Overweight |
H130 | Overweight | 50.23 | Overweight |
H140 | Overweight | 53.21 | Overweight |
H150 | Overweight | 56.26 | Overweight |
H160 | Overweight | 59.31 | Overweight |
H170 | Overweight | 62.64 | Overweight |
H180 | Overweight | 66.04 | Overweight |
H190 | Overweight | 69.56 | Overweight |
H200 | Overweight | 73.30 | Overweight |
H210 | Overweight | 76.95 | Overweight |
H220 | Overweight | 80.98 | Overweight |
H230 | Overweight | 85.49 | Overweight |
H240 | Overweight | 89.89 | Overweight |
H250 | Overweight | 94.40 | Overweight |
H260 | Overweight | 99.27 | Overweight |
H270 | Overweight | 104.4 | Overweight |
H280 | Overweight | 109.45 | Overweight |
H290 | Overweight | 114.82 | Overweight |
H300 | Overweight | 120.29 | Overweight |
** Labels for BMI categories defined by WHO guidelines. (WHO, 1995)
Data Collection
Data were collected using the Positly recruitment platform and the Pavlovia experiment hosting platform. Data collection began on the 15th of May, 2024 and ended on the 5th of August, 2024.
In consultation with the original authors we determined that a sample size of 200 participants after exclusions would provide adequate statistical power for this replication effort. In the simulations for the original study the authors determined that 140 participants would provide 89% power to detect their expected effect size for hypothesis 1. Typically for replications we aim for a 90% chance to detect an effect that is 75% of the size of the original effect size. To emulate that standard for this study we decided on a sample of 200 participants. It is important to note that the original study had 419 participants after exclusions. This final sample size for the original study was set by simulation-based power analyses for hypotheses 2 and 3 requiring a sample size of ~400 participants for adequate statistical power. Because our replication study did not test hypotheses 2 and 3–since they weren’t supported in the original analysis–we did not need to power the study based on those hypotheses.
While a sample size of 200 subjects was justified at the time, we later learned that the original simulation-based power analysis relied on faulty assumptions, which could only be determined from the empirical data in the original sample. The sample size needed to provide adequate statistical power for hypothesis 1 was underestimated. Because the original study used a larger sample size to power hypotheses 2 and 3, the underestimate of the sample size needed for hypothesis 1 wasn’t detected. As a result, our replication study may have been underpowered.
Excluding Participants and/or Observations
For participants to be eligible to take part in the study, they had to be:
- Female
- Aged 18-30
- English speaking
After data collection, participants were excluded from the analysis under the following circumstances:
- Participants who took longer than 7 seconds to respond in more than 10 trials.
- Participants who demonstrated obviously erratic behavior e.g. repeated similar responses across long stretches of trials despite variation in stimuli (see Additional Information about the Exclusion Criteria appendix section).
- Participants who did not complete the full 800 trials.
- Participants who do not meet the eligibility criteria.
Additionally, at the suggestion of the original authors, we excluded any observations in which the response was given more than 7 seconds after the display of the stimulus.
249 participants completed the main task. 8 participants did not have their data written due to technical malfunctions (these participants were still compensated as usual). 8 participants were excluded for reporting anything other than “Female” for their gender on the questionnaire. 23 participants were excluded for being over 30 years old. 6 participants were excluded for taking longer than 7 seconds to respond on more than 10 trials. 4 participants were excluded for obviously erratic behavior. Note that some participants fall into two or more of these exclusion categories, so the sum of exclusions listed above is greater than the total number of excluded participants.
Analysis
Both the original paper and our replication utilized a logistic multilevel model to assess the data:
Yij = 𝛽0j + 𝛽1jTrialij + 𝛽2jSizeij + 𝛽3j(Trialij x Sizeij)
𝛽0j = Ɣ00 + Ɣ01 Conditionj + Uoj
𝛽1j = Ɣ10 + Ɣ11 Conditionj + U1j
𝛽2j = Ɣ20 + Ɣ21Conditionj
𝛽3j = Ɣ30 + Ɣ31Conditionj
Where Size is the ordinal relative BMI of computerized model images. That is, the degree to which each body image stimulus is thin or overweight.
Yij represents the log odds of participant j making an “overweight” judgment for trial i.
Uoj are random intercepts per participant. U1j are random slopes per participant. Ɣxx represents fixed effects.
Results in Detail
The original study found a significant three-way interaction between condition, trial number, and size (β = 3.85, SE = 0.38, p = 1.09 × 10−23), indicating that as the prevalence of thin bodies in the environment increased, participants were more likely to judge ambiguous bodies (not obviously overweight and not obviously underweight) as overweight. The authors note that this effect is restricted to judgements of “thin and average-size stimuli” due to the increasing-prevalence condition requiring a low frequency of “overweight” stimuli.
Figure 4: Original Results Table
Predictors | Log Odds | 95% CI | p |
---|---|---|---|
Intercept | -1.90 | -2.01 – -1.78 | <0.001 |
Condition | 0.08 | -0.04 – 0.20 | 0.173 |
Trial0 | -0.62 | -0.77 – -0.47 | <0.001 |
Size0 | 21.21 | 20.82 – 21.59 | <0.001 |
Condition x Trial0 | -0.65 | -0.81 – -0.50 | <0.001 |
Condition x Size0 | -0.48 | -0.85 – -0.11 | 0.011 |
Trial x Size0 | 2.05 | 1.26 – 2.85 | <0.001 |
Condition x Trial0 x Size0 | 3.85 | 3.10 – 4.61 | <0.001 |
Figure 5: Original Results Data Representations
From “Changes in the prevalence of thin bodies bias young women’s judgments about body size,” by Devine, S., Germain, N., Ehrlich, S., & Eppinger, B., 2022, Psychological Science, 33(8), 1212-1225.
Figure 6: Replication Results Table
Predictors | Log Odds | 95% CI | p |
---|---|---|---|
Intercept | -1.59 | -1.73–-1.45 | <0.001 |
Condition | 0.15 | 0.2–0.29 | 0.028 |
Trial0 | -0.80 | -0.99–-0.61 | <0.001 |
Size0 | 20.01 | 19.50–20.51 | <0.001 |
Condition x Trial0 | -0.68 | -0.87–-0.49 | <0.001 |
Condition x Size0 | -0.43 | -0.91–0.06 | 0.084 |
Trial x Size0 | -0.24 | -1.19–0.71 | 0.626 |
Condition x Trial0 x Size0 | 0.53 | -0.38–1.43 | 0.255 |
Figure 7: Replication Results Data Representation
Interpreting the Results
The failure of this result to replicate is likely to be due to characteristics of the study design that made the experiment a less sensitive test of the hypothesis. For that reason the failure of this study to replicate should not be taken as strong evidence against the original hypothesis that prevalence induced concept change occurs for body images.
The main study design issue that could possibly account for the non-replication of the results is the categorization of “thin” and “overweight” images for the condition manipulation: “thin” images were 19.61 BMI and below, and “overweight” images were 21.55 BMI and above. This low threshold means that participants in the increasing prevalence condition would have seen a very small number of images of bodies that were in the ambiguous or normal range of BMI in which prevalence induced concept change is most likely to occur. Unfortunately, we did not notice this issue with the BMI cutoff between the thin and overweight groups until after we had collected the replication data. This means that our replication, while having the benefit of being faithful to the original study, has the drawback of being affected by the same study design issue.
We presented this issue to the authors after determining that it may explain the lack of replication. The authors explained their rationale for setting the image cutoff at the baseline image:
“In designing the study, we anticipated the most “ambiguous” stimuli to be those near the reference image (BMI of 19.79; the base model). This was based on two factors. First, WHO guidelines suggest that a “normal” BMI lies between 18.5 and 24.9—hence a BMI of 19.79 fell nicely within this range and, as mentioned, allowed for a clean division of the stimulus set into two equal categories. Second, irrespective of the objective BMI, we anticipated participants would judge the reference image as maximally ambiguous in the context of the stimulus set, owing to the range available to participants’ judgements when completing the experiment. Accordingly, the power analysis we conducted was based on this assumption that responses most sensitive to PICC would be those to images near in size to the reference image. But this turned out not to be the case when we acquired the data from our sample. As you point out, increased sensitivity to PICC was at a slightly higher (and evidently under-sampled) range of size (BMI 23.35 – 31.84). As such, the sample size required to detect effects in these ranges with sufficient power may be higher than previously thought.” (Devine, email communication 9/11/24)
Understanding the Categorization Used
It took us some time to recognize this issue because the original paper does not clearly explain how the “thin” and “overweight” image categories correspond to BMI values of the images, and none of the figures in the original paper show BMI values along the axes representing image sizes. From the paper alone it is not possible for a reader to determine what BMI values the stimuli presented correspond to, with the exception of the endpoints. The paper says:
Specifically, the proportion of thin to overweight bodies had the following order across each block in the increasing-prevalence condition: .50, .50, .50, .50, .60, .72, .86, .94, .94, .94, .94, .94, .94, .94, .94, .94. In the stable-prevalence condition, the proportion of overweight and thin bodies in the environment did not change; it was always .50 (see Fig. 1b). Bodies were categorized as objectively thin or overweight by Moussally et al. (2017) according to World Health Organization (1995) guidelines. Body mass index (BMI) across all bodies ranged from 13.19 (severely underweight) to 120.29 (very severely obese). (Devine et al, 2022) [Bold italics added for emphasis]
From the information provided in the paper, a reader would be likely to assume that the images in the “overweight” category had BMIs of greater than 25, because a BMI of 25 is the dividing line between “healthy/normal” and “overweight” according to the WHO. Another possible interpretation of this text in the paper would suggest that the bodies that were categorized as thin and/or median in the Moussally et al. (2017) stimulus validation paper were the ones increasing in prevalence in that condition, and those categorized as overweight in the validation study were diminishing in prevalence. Either of these likely reader assumptions would also be supported by the presentation of the results in the original paper:
Most importantly, we found a three-way interaction between condition, trial, and size (β = 3.85, SE = 0.38, p = 1.09 × 10−23). As seen in Figure 2a, this result shows that when the prevalence of thin bodies in the environment increased over the course of the task, participants judged more ambiguous bodies (average bodies) as overweight than when the prevalence remained fixed. We emphasize here that this effect is restricted to judgments of thin and average-size stimuli because the nature of our manipulation reduced the number of overweight stimuli seen by participants in the increasing-prevalence condition (as reflected by larger error bars for larger body sizes in Fig. 2a). (Devine et al, 2022) [Bold italics added for emphasis]
Moussally et al. developed the stimuli that were used in this study by using 3D modeling software. They started with a default female model (corresponding to 19.79 BMI according to their analysis), scaling down from that default model in 30 increments of the modeling software’s “thin/heavy” dimension to get lower BMIs (down to a low of 13.9), and then scaling up from that default model by 30 increments to get higher BMIs (up to a high of 120.29). They then validated the image set by asking participants to rate images on a 9 point Likert scale where 1 = “fat” and 9 = “thin”. Based on those ratings they established three categories for body shape: “thin, median, or fat.”
Figure 8: Ratings of Body Shape for all Stimuli from Moussally et al. (2017)
Stimulus | BMI | Mean Rating (Likert 1-9) | Validation Study Classification |
---|---|---|---|
T300 | 13.19 | 8.94 | Thin |
T290 | 13.38 | 8.95 | Thin |
T280 | 13.47 | 8.97 | Thin |
T270 | 13.77 | 8.88 | Thin |
T260 | 13.86 | 8.91 | Thin |
T250 | 14.10 | 8.86 | Thin |
T240 | 14.28 | 8.77 | Thin |
T230 | 14.46 | 8.70 | Thin |
T220 | 14.65 | 8.63 | Thin |
T210 | 14.87 | 8.67 | Thin |
T200 | 15.06 | 8.59 | Thin |
T190 | 15.24 | 8.56 | Thin |
T180 | 15.49 | 8.37 | Thin |
T170 | 15.67 | 8.18 | Thin |
T160 | 15.74 | 8.22 | Thin |
T150 | 16.12 | 8.11 | Thin |
T140 | 16.40 | 8.12 | Thin |
T130 | 16.64 | 8.05 | Thin |
T120 | 16.81 | 7.95 | Thin |
T110 | 17.08 | 7.90 | Thin |
T100 | 17.28 | 7.79 | Thin |
T090 | 17.56 | 7.90 | Thin |
T080 | 17.77 | 7.79 | Thin |
T070 | 18.01 | 7.88 | Thin |
T060 | 18.26 | 7.74 | Thin |
T050 | 18.50 | 7.84 | Thin |
T040 | 18.77 | 7.76 | Thin |
T030 | 19.1 | 7.74 | Thin |
T020 | 19.3 | 7.78 | Thin |
T010 | 19.61 | 7.50 | Thin |
N000 | 19.79 | 7.63 | Thin |
H010 | 21.55 | 7.28 | Thin |
H020 | 23.35 | 6.21 | Median |
H030 | 25.37 | 5.65 | Median |
H040 | 27.37 | 5.26 | Median |
H050 | 29.57 | 4.85 | Median |
H060 | 31.84 | 4.28 | Median |
H070 | 34.13 | 3.63 | Fat |
H080 | 36.58 | 3.62 | Fat |
H090 | 39.10 | 3.10 | Fat |
H100 | 41.76 | 2.78 | Fat |
H110 | 44.55 | 2.65 | Fat |
H120 | 47.37 | 2.45 | Fat |
H130 | 50.23 | 2.32 | Fat |
H140 | 53.21 | 2.02 | Fat |
H150 | 56.26 | 1.95 | Fat |
H160 | 59.31 | 1.68 | Fat |
H170 | 62.64 | 1.56 | Fat |
H180 | 66.04 | 1.59 | Fat |
H190 | 69.56 | 1.44 | Fat |
H200 | 73.30 | 1.45 | Fat |
H210 | 76.95 | 1.30 | Fat |
H220 | 80.98 | 1.23 | Fat |
H230 | 85.49 | 1.17 | Fat |
H240 | 89.89 | 1.16 | Fat |
H250 | 94.40 | 1.11 | Fat |
H260 | 99.27 | 1.06 | Fat |
H270 | 104.4 | 1.09 | Fat |
H280 | 109.45 | 1.10 | Fat |
H290 | 114.82 | 1.06 | Fat |
H300 | 120.29 | 1.05 | Fat |
The “median” images according to the judgements reported in Moussally et. al. (2017) ranged from a BMI of 23.35 to 31.84; however, neither of those cutoffs nor the commonly used WHO BMI guideline of 25 and above as “overweight” were used to set the cutoff between the groups of “thin” and “overweight” images in the experiment we replicated. From looking at the study code itself, this study used the 30 images scaled down from the baseline image of 19.79 BMI as the “thin” group and the 30 images scaled up from the baseline as the “overweight” group. The 19.79 BMI image was not included in either group, so it was not presented to participants in the experiment. That means that the “thin” images that were increasing in prevalence ranged from a BMI of 13.19 to 19.61, and the “overweight” images that were decreasing in prevalence ranged from a BMI of 21.55 to 120.29. The 21.55 BMI image was categorized as “thin” in the Moussally et al. (2017) validation study, and is well within the normal/healthy weight range according to the WHO, and yet it was categorized with the “overweight” images in this study. This 21.55 BMI image was judged as “not overweight” for 96% of trials in the original dataset for the present study, further suggesting that the experiment’s cutoff between “thin” and “overweight” was placed at too low of a BMI to adequately capture data for ambiguous body images.
Implications of the Categorization
Figure 2b in the original paper presents the results for a BMI of 23.35, which is within the “normal/healthy” range according to the WHO, and is the lowest BMI “median” image according to the validation study. This is clearly meant to be one of the normal or ambiguous body sizes for which prevalence induced concept change would be most expected. The inclusion of this image in the “overweight” grouping for which the prevalence was decreasing means this image would not have been shown to participants very often. The caveat in the results section that “this effect is restricted to judgments of thin and average-size stimuli because the nature of our manipulation reduced the number of overweight stimuli seen by participants in the increasing-prevalence condition,” applies to the 23.35 BMI image that is presented in the paper as a demonstration of the effect.
In the last 200 trials in the increasing prevalence condition only 6% of the images presented would have been from the set of 30 “overweight” images. That means that each participant only saw 12 presentations of “overweight” images in the last 200 trials. Each individual subject in the increasing prevalence condition would only have had an approximately 33% chance of seeing the BMI 23.35 image at least once during the last 200 trials. Ideally, this image–and others in the ambiguous range–should be shown much more frequently in order to capture possible effects of prevalence induced concept change.
In the original study, looking at the last 200 trials, the 23.35 BMI image was presented only 80 times out of 42,600 image presentations to the increasing prevalence condition. In the replication study, looking at the last 200 trials, that image was only presented 51 times out of 20,000 image presentations to the increasing prevalence condition.
Figure 9 below shows how many times stimuli of BMI values 18.77 – 31.84 were presented and what percentage of them were judged as “overweight” in the last 200 trials in each condition across all subjects for the original dataset and the replication dataset. The rows that are color coded by condition and have BMI values in bold are from the “overweight” group.
Figure 9: Data Presentation Frequency and % “Overweight” judgements in last 200 trials
A (Original Data)
Increasing (N = 213) | Stable (N = 206) | |||
Stimulus (BMI) | Number of presentations (out of 42,600) | % Judged as “Overweight” | Number of presentations (out of 41,200) | % Judged as “Overweight” |
18.77 | 1337 | 2.99% | 701 | 2.28% |
19.1 | 1365 | 4.03% | 673 | 2.23% |
19.3 | 1288 | 3.80% | 718 | 1.39% |
19.61 | 1334 | 3.97% | 698 | 2.72% |
21.55 | 73 | 8.22% | 676 | 3.40% |
23.35 | 80 | 25.00% | 685 | 9.78% |
25.37 | 81 | 28.40% | 687 | 21.83% |
27.37 | 96 | 43.75% | 683 | 34.85% |
29.57 | 93 | 53.76% | 726 | 52.07% |
31.84 | 83 | 61.45% | 710 | 56.90% |
B (Replication Data)
Increasing (N = 100) | Stable (N = 101) | |||
Stimulus (BMI) | Number of presentations (out of 20,000) | % Judged as “Overweight” | Number of presentations (out of 20,100) | % Judged as “Overweight” |
18.77 | 611 | 0.82% | 315 | 2.22% |
19.1 | 590 | 2.03% | 333 | 2.40% |
19.3 | 634 | 2.05% | 314 | 2.55% |
19.61 | 631 | 2.06% | 319 | 2.19% |
21.55 | 41 | 7.32% | 331 | 4.53% |
23.35 | 51 | 15.69% | 336 | 10.71% |
25.37 | 41 | 34.15% | 357 | 24.93% |
27.37 | 35 | 42.86% | 315 | 35.56% |
29.57 | 35 | 68.57% | 346 | 52.89% |
31.84 | 42 | 59.52% | 385 | 58.70% |
From looking at these tables, it’s easy to see that in both conditions only a small percentage of the stimuli from 18.77 to 19.61 BMI were judged to be overweight. There is much more variation in judgment in the 21.55 to 31.84 BMI images, but the number of times those were presented in the increasing prevalence condition was very small. The fact that the most important stimuli for demonstrating the proposed effect were presented extremely infrequently in the study likely undermines the reliability of this test of the prevalence induced concept change hypothesis by making it much less sensitive to detecting whether the effect is present.
Implications of Nonreplication for the Prevalence Induced Concept Change Hypothesis
If we look more closely at the results for the range of BMI values for which there is ambiguity in both the original data and the replication data we can see that the pattern of results for those values looks similar.
Figure 10: Data for the last 200 trials
A (Original Data)
B (Replication Data)
Figure 10 above shows that only one datapoint in the replication data has results that are clearly outside of the margin of error (BMI = 29.57), but the pattern looks similar to what we see in the original data. This suggests that despite the issues with the experimental design, the original study may have been able to detect an effect because it was much more highly powered than should have been necessary to test this hypothesis due to the need for a higher statistical power for hypotheses 2 and 3 in the original paper. In the replication study, which was powered appropriately according to the original study’s simulation analysis, the effective power was lower than what was simulated due to the miscategorization of the ambiguous images into the overweight group.
Proposed Experimental Design Changes
In our view, a better threshold between the “thin” and “overweight” images for testing this hypothesis would be 31.84 (the high end of the “median” range reported in the Moussally et. al. (2017) paper). This threshold would ensure that participants are presented with many opportunities to judge the images that are in the ambiguous range where prevalence induced concept change is most likely to be observed. Shifting to this threshold would make this experiment better suited to detecting the hypothesized effect.
Additionally, this experiment would benefit from having more stimuli that are in the ambiguous range of values – i.e. more stimuli with BMIs between 23.35 and 31.84. In this study only 5 of the images (23.35, 25.37, 27.37, 29.57, 31.84) are in the range Moussally et al. determine to be “median.” A larger set of stimuli in the ambiguous range would provide more data points in the relevant range for testing the hypothesis. We recognize that this change would require developing and validating additional stimuli, which would be labor-intensive.
Comparing the stimuli used in this study to those used in the Levari et al. (2018) experiment–on which this study is based–provides an illustration that helps explain why this would be important for testing this hypothesis. Levari et al. tested prevalence induced concept change using images of 100 dots that ranged in color from purple to blue. When they decreased the prevalence of blue dots, they found that people were more likely to consider ambiguous dots to be blue. Stimuli from Levari et al.’s paper can be seen in Figure 11c, where there are 18-19 stimuli at color values in between each of the dots shown. From looking at these representative stimuli it’s clear that there were many examples of different stimuli in the range of values that were ambiguous.
Figure 11: Levari et al. (2018) Colored Dots Study 1
A-B (Results visualization)
C (Color spectrum stimuli examples)
Prevalence-induced concept change should be observable mainly in ambiguous stimuli. We expect this effect to be non-existent for the extreme exemplars of the relevant conceptual category. That is, the bluest dots will always be identified as blue, but judgements of ambiguous dots should be susceptible to the effect. Looking at Figures 11a-11b, a substantial fraction of the 100 different dot images were ambiguous (identified as blue some of the time, rather than 100% or 0% of the time). A wide range of ambiguous stimuli make this effect easier to capture. Additionally, these ambiguous dots were clustered on the purple half of the color spectrum. This is important because Levari et al.’s manipulation increased the frequency of the purple-spectrum dots. So, their data contained many observations of ambiguous dots despite the condition manipulation decreasing the frequency of blue-spectrum dots. Compare the above Figures 11a-11b from Levari et al. to the below Figure 12 generated from the original body image study data:
Figure 12: PICCBI Original Results Visualization
It’s not possible to see the curve shift in the increasing prevalence condition here (Figure 12), despite the model having a significant result. This is likely because there are many fewer observations in the ambiguous range of stimuli. This makes the model more sensitive to noise at the extreme values. Looking at these figures for the replication data in Figure 13, we see that noise in the infrequently presented larger BMI images shapes the divergence between the curves in a way that’s not consistent with the hypothesis:
Figure 13: PICCBI Replication Results Visualization
Taking more measurements in the ambiguous range by having more stimulus images with BMI values in that range would improve the ability of this experiment to reliably detect whether prevalence induced concept change occurs for body images.
It’s also worth noting that this issue with the study design was somewhat obscured by the design of the figures presenting the data in the paper. Instead of using the curves above like the Levari et al. (2018) paper used, the data for this study was presented by showing the percentage of overweight ratings for the first 200 trials subtracted from the last 200 trials, as seen in Figure 5. This method highlights the relevant change from the early trials to the later trials, but has the downside of not clearly presenting the actual values. Many of those values didn’t change from the early to the late trials because they were near the ceiling or the floor (almost all judgements were one-way). It was not possible to tell what the actual percentages of overweight judgements were from the information presented in the paper, which meant it was not clear which stimuli had overweight judgements near the ceiling or floor and which were ambiguous. Being able to tell where the ambiguous values were would have been useful to readers attempting to interpret the results of this study.
By incorporating these changes, a new version of this study would shed a lot of light on the question of whether prevalence induced concept change can be reliably detected for body images.
Conclusion
The results of the original paper failed to replicate, which we suspect was due to the experiment being less sensitive to the effect than anticipated. For this reason we emphasize that our findings do not provide strong evidence against the original hypothesis. Prevalence-induced concept change may affect women’s body image judgements, but the present experiment was not as sensitive to detect this effect with the current sample size as previously believed. The design could be improved by raising the BMI cutoff between “thin” and “overweight” images for the prevalence manipulation and/or including additional stimuli within the range of ambiguous body sizes (BMI 23.35 – 31.84) to increase the frequency of ambiguous stimuli, which are the most important for demonstrating a change in concept.
The clarity rating of 2.5 stars was due to two factors. The original discussion section did not address the potential implications of the lack of support for hypotheses 2 and 3. Since hypotheses 2 and 3 related to people applying these changes in the concept of thinness to their own bodies, the lack of support for those hypotheses may limit the claims that should be made about potential real world effects of prevalence induced concept change for body image. Additionally, the difficulty of determining the stimulus BMI values, the thin/overweight cutoff value, and the range of results for which judgements were ambiguous from the information presented in the paper could leave readers with misunderstandings about the study’s methods and results.
The study had a high transparency rating of 4.5 stars because all of the original data, experiment/analysis code, and pre-registration materials were publicly available. There were minor discrepancies in exclusion criteria based on reaction times between the pre-registration and the analysis, and some documentation for exclusion criteria and code for evaluating participant quality wasn’t publicly posted. However, the undocumented code was provided upon request, and the inconsistency in exclusion criteria was subtle and likely had no bearing on the results.
Author Acknowledgements
We would like to thank the authors of “Changes in the Prevalence of Thin Bodies Bias Young Women’s Judgements About Body Size”: Sean Devine, Nathalie Germain, Stefan Ehrlich, and Ben Eppinger for everything they’ve done to make this replication report possible. We thank them for their original research and for making their data, experiment code, analysis, and other materials publicly available. The original authors provided feedback with expedient, supportive correspondence and this report was greatly improved by their input.
Thank you to Isaac Handley-Miner for your consultation on multilevel modeling for our analysis. Your expertise was invaluable.
Thank you to Soundar and Nathan from the Positly team for your technical support with the data collection.
Thank you to Spencer Greenberg for your guidance and feedback throughout the project.
Last, but certainly not least, thank you to all 249 individuals who participated in the replication experiment.
Response from the Original Authors
The original paper’s authorship team offers this response (PDF) to our report. We are grateful for their thoughtful engagement with our report.
Purpose of Transparent Replications by Clearer Thinking
Transparent Replications conducts replications and evaluates the transparency of randomly-selected, recently-published psychology papers in prestigious journals, with the overall aim of rewarding best practices and shifting incentives in social science toward more replicable research.
We welcome reader feedback on this report, and input on this project overall.
Appendices
Additional Information about the Exclusion Criteria
249 participants completed the main task
- 8 participants were excluded due to technical malfunctions.
- 5 of these participants did not have their data written due to terminating their connection to Pavlovia before the data saving operations could complete. These participants were compensated for completion of the full task.
- 3 of these participants were excluded for incomplete data sets.These 3 exclusions stand out as unexplained data writing malfunctions. These participants were compensated for completion of the full task, despite the partial datasets.
- 8 participants were excluded for reporting anything other than “Female” for their gender on the questionnaire.
- 23 participants were excluded for being over 30 years old.
- 6 participants were excluded for taking longer than 7 seconds to respond on more than 10 trials.
- 4 participants were excluded for obviously erratic behavior.
The “erratic behavior” exclusions were determined by generating graphical representations of individual subject judgements over time and manually reviewing them for signs of unreasonable behavior. The code for generating these individual subject graphs was provided by the original authors and we consulted with the original authors on their assessment of the graphs. The generation code and a complete set of graphics can be found in our gitlab repository. Figure 14a is an example of expected behavior from a participant. They tended to judge very thin stimuli as “not overweight” and very overweight stimuli as “overweight” with some variance, especially around ambiguous stimuli closer to the middle of the spectrum. Figures 14b-14e are the subjects we excluded based on their curves. 14b made judgments exactly opposite the expected behavior for their first 200 trials which indicates that this participant was confused about which key on their keyboard related to which judgment. In 14c, we see that this participant’s judgements in the last 200 trials were completely random. They likely stopped paying attention at some point during the task and assigned judgements randomly. Because this criterion is somewhat subjective, only the most obviously invalid data were excluded. Any participants with questionable but ambiguous curves had their data included to avoid the possibility of biased exclusions.
Figure 14: Individual Subject Curves
A (Good Subject Curve)
B
C
D
E
References
Devine, S., Germain, N., Ehrlich, S., & Eppinger, B. (2022). Changes in the prevalence of thin bodies bias young women’s judgments about body size. Psychological Science, 33(8), 1212-1225. https://doi.org/10.1177/09567976221082941
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149-1160. Download PDF
Levari, D. E., Gilbert, D. T., Wilson, T. D., Sievers, B., Amodio, D. M., & Wheatley, T. (2018). Prevalence-induced concept change in human judgment. Science, 360(6396), 1465-1467. https://doi.org/10.1016/j.cognition.2022.105196
Moussally, J. M., Rochat, L., Posada, A., & Van der Linden, M. (2017). A database of body-only computer-generated pictures of women for body-image studies: Development and preliminary validation. Behavior research methods, 49(1), 172-183. https://doi.org/10.3758/s13428-016-0703-7
World Health Organization. (1995). Physical status: The use of and interpretation of anthropometry. Report of a WHO expert committee. https://apps.who.int/iris/handle/10665/37003
- We are using the category labels “thin” and “overweight” because they were used in the original paper. These labels do not necessarily correspond to what they would mean in everyday usage, and should not be taken as objective measures of health, perception, nor the opinions of the researchers. More information on the decisions behind the categorizations can be found in the Understanding the Categorizations Used section. ↩︎