Transparent Replications

by Clearer Thinking

Rapid replications for reliable research

Explaining our Replicability Ratings


Transparent Replications rates the studies that we replicate on three main criteria: transparency, replicability, and clarity. You can read more about our transparency ratings here.

The replicability rating is our evaluation of the degree of consistency between the findings that we obtained in our replication study and the findings in the original study. Our goal with this rating is to give readers an at-a-glance understanding of how closely our results matched the results of the original study. We report this as the number of findings that replicated out of the total number of findings reported in the original study. We also convert this to a star rating (out of 5 stars). So if 50% of the findings replicated we would give the paper 2.5 stars, and if 100% of the findings replicated we would give 5 stars.

That initially sounds simple, but we had to make a few key decisions about what counts and what doesn’t when it comes to assessing whether findings replicate.

What findings count when replicating a study? 

Studies often examine several questions. Sometimes a table with many results will be presented, but only some of those results pertain to hypotheses that the researchers are testing. Should all of the presented results be considered when assessing how well a study replicates? If not, then how does one choose which results to consider?

Our answer to this question is that the results we consider to be the study’s findings are the ones that pertain to the primary hypotheses the researchers present in their paper. 

This means that if a table of results shows a significant relationship between certain variables, but that relationship isn’t part of the theory that the paper is testing, we don’t consider whether that result is significant in our replication study when assigning our replicability rating. For example, a study using a linear regression model may include socioeconomic status as a control variable, and in the original regression, socioeconomic status may have a significant relationship with the dependent variable. In the replication, maybe there isn’t a significant relationship between socioeconomic status and the dependent variable. If the original study doesn’t have any hypotheses proposing a relationship between socioeconomic status and their dependent variable, then that relationship not being present in the replication results would not impact the replicability rating of the study.

This also means that typically if a result is null in the original paper, whether it turns out to be null in the replication is only relevant to our ratings if the authors of the original paper hypothesized about the result being null for reasons driven by the claims they make in the paper.

We use this approach to determine which findings to evaluate because we want our replication to be fair to the original study, and we want our ratings to communicate clearly about what we found. If we included an evaluation of results that the authors are not making claims about in our rating, it would not be a fair assessment of how well the study replicated. And if a reader glances at the main claims of study and then glances at our replicability rating, the reader should get an accurate impression about whether our results were consistent with the authors’ main claims.

In our replication pre-registrations, we list which findings are included when evaluating replicability. In some cases this will involve judgment calls. For instance, if a statistic is somewhat but not very related to a key hypothesis in the paper, we have to decide if it is closely related enough to include. It’s important that we make this decision in advance of collecting data. This ensures that the findings that comprise the rating are determined before we know the replication results.

What counts as getting the same results?

When evaluating the replication results, we need to know in advance what statistical result would count as a finding replicating and what wouldn’t. Typically, for a result to be considered the same as in the original study, it must be in the same direction as the original result, and it must meet a standard for statistical significance.

There may be cases where the original study does not find support for one of the authors’ original hypotheses, but we find a significant result supporting the hypothesis in our replication study. Although this result is different from the results obtained in the original study, it is a result in support of the original authors’ hypothesis. We would discuss this result in our report, but it would not be included in the replicability rating because the original study’s null result is not considered one of the original study’s findings being tested in the replication (as explained above).

There have been many critiques of the way statistical significance is used to inform one’s understanding of results in the social sciences, and some researchers have started using alternative methods to assess whether a result should be considered evidence of a real effect rather than random chance. The way we determine statistical significance in our replication studies will typically be consistent with the method used in the original paper, since we are attempting to see if the results as presented in the original study can be reproduced on their own terms. If we have concerns about how the statistical significance of the results is established in the original study, those concerns will be addressed in our report, and may inform the study’s clarity rating. In such cases, we may also conduct extra analyses (in addition to those performed in the original paper) and compare them to the original paper’s results as well.

In some replication projects with very large sample sizes, such as Many Labs, a minimum effect size might also need to be established to determine whether a finding has replicated because the extremely high statistical power will mean that even tiny effects are statistically significant. In our case this isn’t likely to be necessary because, unless the original study was underpowered, the statistical power of our studies will not be dramatically larger than that of the original study.

In our replication pre-registrations, we define what statistical results we would count as replicating the original findings.

What does the replicability rating mean?

With this understanding of how the replicability rating is constructed, how should it be interpreted?

If a study has a high replicability rating, that means that conducting the same experiment and performing the same analyses on the newly collected data generated results that were largely consistent with the findings of the original paper.

If a study has a low replicability rating, it means that a large number of the results in the replication study were not consistent with the findings reported in the original study. This means that the level of confidence a person should have that the original hypotheses are correct should be reduced.

A low replicability rating for a study does not mean that the original researchers did something wrong. A study that is well-designed, properly conducted, and correctly analyzed will sometimes generate results that do not replicate. Even the best research has some chance of being a false positive. When that happens, researchers have the opportunity to incorporate the results from the replication into their understanding of the questions under study and to use those results to aid in their future investigations. It’s also possible that we will obtain a false negative result in a replication study (no study has 100% power to detect an effect).

The replicability rating also does not evaluate the validity of the original experiment as a test of the theory being presented, or whether the analyses chosen were the best analyses to test the hypotheses. Questions about the validity, generalizability, and appropriateness of the analyses are addressed in our “clarity” rating, not our “replicability” rating.

For these reasons, we encourage looking at the replicability rating in the context of the transparency and clarity ratings to get a more complete picture of the study being evaluated. For example, if a study with a high replicability rating received a low transparency rating, then the study didn’t use open science best practices, which means that we may not have had access to all the original materials needed to replicate the study accurately. Or in the case of a study with a high replicability rating, but a low clarity rating, we can infer that the experimental protocol generates consistent results, but that there may be questions about what those results should be understood to mean.

As we conduct more replications, we expect to learn from the process. Hence, our procedures (including those mentioned in this article) may change over time as we discover flaws in our process and improve it.

By rigorously evaluating studies using these three criteria (“transparency,” “replicability,” and “clarity”), we aim to encourage and reward the publication of reliable research results that people can be highly confident in applying or building on in future research.