Excerpts and comments on researcher degrees of freedom and the file drawer problem

I have been trying to reproduce several studies and have noticed that the reporting of results from these studies often presents a much stronger impression of results than I get from an investigation of the data itself. I plan to report some of these reproduction attempts, so I have been reading literature on researcher degrees of freedom and the file drawer problem. Below I’ll post and comment on some interesting passages that I have happened upon.

To put it another way: without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology. (Gelman and Loken, 2013, 14-15, emphasis in the original)


I wonder how many people in the general population take seriously general claims based on only small mTurk and college student samples, provided that these people are informed that these general claims are based only on small unrepresentative samples; I suspect that some of the “taking seriously” that leads to publication in leading psychology journals reflects professional courtesy among peer researchers whose work is also largely based on small unrepresentative samples.

Maybe it’s because I haven’t done much work with small unrepresentative samples, but I feel cheated when investing time in an article framed in general language that has conclusions based on small unrepresentative samples. Here’s an article that I recently happened upon: “White Americans’ opposition to affirmative action: Group interest and the harm to beneficiaries objection.” The abstract:

We focused on a powerful objection to affirmative action – that affirmative action harms its intended beneficiaries by undermining their self-esteem. We tested whether White Americans would raise the harm to beneficiaries objection particularly when it is in their group interest. When led to believe that affirmative action harmed Whites, participants endorsed the harm to beneficiaries objection more than when led to believe that affirmative action did not harm Whites. Endorsement of a merit-based objection to affirmative action did not differ as a function of the policy’s impact on Whites. White Americans used a concern for the intended beneficiaries of affirmative action in a way that seems to further the interest of their own group.


So who were these white Americans?

Sixty White American students (37% female, mean age = 19.6) at the University of Kansas participated in exchange for partial course credit. One participant did not complete the dependent measure, leaving 59 participants in the final sample. (p. 898)


I won’t argue that this sort of research should not be done, but I’d like to see this sort of exploratory research replicated with a more representative sample. One of the four co-authors listed her institutional affiliation at California State University San Bernardino, and two other co-authors listed their institutional affiliation at Tulane University, so I would have liked to have seen a second study among a different sample of students. At the very least, I’d like to see a description of the restricted nature of the sample in the abstract to let me and other readers make a more informed judgment about the value of investing time in the article.

The Gelman and Loken (2013) passage cited above reminded me of a recent controversy regarding a replication attempt of Schnall et al. (2008). I read about the controversy in a Nicole Janz post at Political Science Replication. The result of the replication (a perceived failure to replicate) was not shocking because Schnall et al. (2008) had reported only two experiments based on data from 40 and 43 University of Plymouth undergraduates.

Schnall in a post on the replication attempt:

My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.


I like the term “data detectives” a bit better than “replication police” (h/t Nicole Janz), so I think that I might adopt the label “data detective” for myself.

I can sympathize with the graduate students’ fear that someone might target my work and try to find an error in that work, but that’s a necessary occupational hazard for a scientist.

The best way to protect research from data detectives is to produce reproducible and perceived replicable research; one of the worst ways to protect research from data detectives is to publish low-powered studies in a high-profile journal, because the high profile draws attention and the low power increases suspicions that the finding was due to the non-reporting of failed experiments.

From McBee and Matthews (2014):

Researchers who try to serve the interests of science are going to find themselves out-competed by those who elect to “play the game,” because the ethical researcher will conduct a number of studies that will prove unpublishable because they lack statistically significant findings, whereas the careerist will find ways to achieve significance far more frequently. (p. 77)


This reflects part of the benefit produced by data detectives and the replication police: a more even playing field for researchers reluctant to take advantage of researcher degrees of freedom.

This Francis (2012) article is an example of a data detective targeting an article to detect non-reporting of experiments. Balcetis and Dunning (2010) reported five experiments rejecting the null hypothesis; the experiments had Ns, effect sizes, and powers as listed below in a table drawn from Francis (2012) p. 176.

Francis 2012Francis summed the powers to get 3.11, which indicates the number of times that we should expect the null hypothesis to be rejected given the observed effect sizes and powers of the 5 experiments; Francis multiplied the powers to get 0.076, which indicates the probability that the null hypothesis will be rejected in all 5 experiments.

Here is Francis again detecting more improbable results. And again. Here’s a back-and-forth between Simonsohn and Francis on Francis’ publication bias studies.

Here’s the Galak and Meyvis (2012) reply to another study in which Francis claimed to have detected non-reporting of experiments in Galak and Meyvis (2011). Galak and Meyvis admit to the non-reporting:

We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant. (p. 595)


…but argue that it’s not a problem because they weren’t interested in effect sizes:

However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions. (p. 595)


Even if it is true that the authors were unconcerned with effect size, I do not understand how that justifies not reporting results that fail to reach conventional levels of statistical significance.

So what about readers who *are* interested in effect sizes? Galak and Meyvis write:

If a researcher is interested in estimating the size of an effect reported in a published paper, we recommend asking the authors for their file drawer and conducting a meta-analysis. (p. 595-596)


That’s an interesting solution: if you are reading an article and wonder about the effect size, put down the article, email the researchers, hope that the researchers respond, hope that the researchers send the data, and then — if you receive the data — conduct your own meta-analysis.

The ex ante value of replications

Jeremy Freese recently linked to a Jason Mitchell essay that discussed perceived problems with replications. Mitchell discussed many facets of replication, but I will restrict this post to Mitchell’s claim that “[r]ecent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.”

Mitchell’s claim appears to be based on a perceived asymmetry between positive and negative findings: “When an experiment succeeds, we can celebrate that the phenomenon survived these all-too-frequent shortcomings. But when an experiment fails, we can only wallow in uncertainty about whether a phenomenon simply does not exist or, rather, whether we were just a bit too human that time around.”

Mitchell is correct that a null finding can be caused by experimental error, but Mitchell appears to overlook the fact that positive findings can also be caused by experimental error.

Mitchell also appears to confront only the possible “ex post” value of replications, but there is a possible “ex ante” value to replications.

Ward Farnsworth discussed ex post and ex ante thinking using the example of a person who accidentally builds a house that extends onto a neighbor’s property: ex post thinking concerns how to best resolve the situation at hand, but ex ante thinking concerns how to make this problem less likely to occur in the future; tearing down the house is a wasteful decision through the perspective of ex post thinking, but it is a good decision from the ex ante perspective because it incentivizes more careful construction in the future.

In a similar way, the threat of replication incentivizes more careful social science. Rational replicators should gravitate toward research for which the evidence appears to be relatively fragile: all else equal, the value of a replication is higher for replicating a study based on 83 undergraduates at one particular college than for replicating a study based on a nationally-representative sample of 1,000 persons; all else equal, a replicator should pass on replicating a stereotype threat study in which the dependent variable is percent correct in favor of replicating a study in which the stereotype effect was detected only using the more unusual measure of percent accuracy, measured as the percent correct of the problems that the respondent attempted.

Mitchell is correct that there is a real possibility that a researcher’s positive finding will not be replicated because of error on the part of the replicator, but, as a silver lining, this negative possibility incentivizes researchers concerned about failed replications to produce higher-quality research that reduces the chance that a replicator targets their research in the first place.