Excerpts and comments on researcher degrees of freedom and the file drawer problem

I have been trying to reproduce several studies and have noticed that the reporting of results from these studies often presents a much stronger impression of results than I get from an investigation of the data itself. I plan to report some of these reproduction attempts, so I have been reading literature on researcher degrees of freedom and the file drawer problem. Below I’ll post and comment on some interesting passages that I have happened upon.

To put it another way: without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology. (Gelman and Loken, 2013, 14-15, emphasis in the original)

 

I wonder how many people in the general population take seriously general claims based on only small mTurk and college student samples, provided that these people are informed that these general claims are based only on small unrepresentative samples; I suspect that some of the “taking seriously” that leads to publication in leading psychology journals reflects professional courtesy among peer researchers whose work is also largely based on small unrepresentative samples.

Maybe it’s because I haven’t done much work with small unrepresentative samples, but I feel cheated when investing time in an article framed in general language that has conclusions based on small unrepresentative samples. Here’s an article that I recently happened upon: “White Americans’ opposition to affirmative action: Group interest and the harm to beneficiaries objection.” The abstract:

We focused on a powerful objection to affirmative action – that affirmative action harms its intended beneficiaries by undermining their self-esteem. We tested whether White Americans would raise the harm to beneficiaries objection particularly when it is in their group interest. When led to believe that affirmative action harmed Whites, participants endorsed the harm to beneficiaries objection more than when led to believe that affirmative action did not harm Whites. Endorsement of a merit-based objection to affirmative action did not differ as a function of the policy’s impact on Whites. White Americans used a concern for the intended beneficiaries of affirmative action in a way that seems to further the interest of their own group.

 

So who were these white Americans?

Sixty White American students (37% female, mean age = 19.6) at the University of Kansas participated in exchange for partial course credit. One participant did not complete the dependent measure, leaving 59 participants in the final sample. (p. 898)

 

I won’t argue that this sort of research should not be done, but I’d like to see this sort of exploratory research replicated with a more representative sample. One of the four co-authors listed her institutional affiliation at California State University San Bernardino, and two other co-authors listed their institutional affiliation at Tulane University, so I would have liked to have seen a second study among a different sample of students. At the very least, I’d like to see a description of the restricted nature of the sample in the abstract to let me and other readers make a more informed judgment about the value of investing time in the article.

The Gelman and Loken (2013) passage cited above reminded me of a recent controversy regarding a replication attempt of Schnall et al. (2008). I read about the controversy in a Nicole Janz post at Political Science Replication. The result of the replication (a perceived failure to replicate) was not shocking because Schnall et al. (2008) had reported only two experiments based on data from 40 and 43 University of Plymouth undergraduates.

Schnall in a post on the replication attempt:

My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.

 

I like the term “data detectives” a bit better than “replication police” (h/t Nicole Janz), so I think that I might adopt the label “data detective” for myself.

I can sympathize with the graduate students’ fear that someone might target my work and try to find an error in that work, but that’s a necessary occupational hazard for a scientist.

The best way to protect research from data detectives is to produce reproducible and perceived replicable research; one of the worst ways to protect research from data detectives is to publish low-powered studies in a high-profile journal, because the high profile draws attention and the low power increases suspicions that the finding was due to the non-reporting of failed experiments.

From McBee and Matthews (2014):

Researchers who try to serve the interests of science are going to find themselves out-competed by those who elect to “play the game,” because the ethical researcher will conduct a number of studies that will prove unpublishable because they lack statistically significant findings, whereas the careerist will find ways to achieve significance far more frequently. (p. 77)

 

This reflects part of the benefit produced by data detectives and the replication police: a more even playing field for researchers reluctant to take advantage of researcher degrees of freedom.

This Francis (2012) article is an example of a data detective targeting an article to detect non-reporting of experiments. Balcetis and Dunning (2010) reported five experiments rejecting the null hypothesis; the experiments had Ns, effect sizes, and powers as listed below in a table drawn from Francis (2012) p. 176.

Francis 2012Francis summed the powers to get 3.11, which indicates the number of times that we should expect the null hypothesis to be rejected given the observed effect sizes and powers of the 5 experiments; Francis multiplied the powers to get 0.076, which indicates the probability that the null hypothesis will be rejected in all 5 experiments.

Here is Francis again detecting more improbable results. And again. Here’s a back-and-forth between Simonsohn and Francis on Francis’ publication bias studies.

Here’s the Galak and Meyvis (2012) reply to another study in which Francis claimed to have detected non-reporting of experiments in Galak and Meyvis (2011). Galak and Meyvis admit to the non-reporting:

We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant. (p. 595)

 

…but argue that it’s not a problem because they weren’t interested in effect sizes:

However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions. (p. 595)

 

Even if it is true that the authors were unconcerned with effect size, I do not understand how that justifies not reporting results that fail to reach conventional levels of statistical significance.

So what about readers who *are* interested in effect sizes? Galak and Meyvis write:

If a researcher is interested in estimating the size of an effect reported in a published paper, we recommend asking the authors for their file drawer and conducting a meta-analysis. (p. 595-596)

 

That’s an interesting solution: if you are reading an article and wonder about the effect size, put down the article, email the researchers, hope that the researchers respond, hope that the researchers send the data, and then — if you receive the data — conduct your own meta-analysis.

Self-archived articles

I came across an interesting site, Dynamic Ecology, and saw a post on self-archiving of journal articles.The post mentioned SHERPA/RoMEO, which lists archiving policies for many journals. The only journal covered by SHERPA/RoMEO that I have published in that permits self-archiving is PS: Political Science & Politics, so I am linking below to pdfs of PS articles that I have published.

This first article attempts to help graduate students who need seminar paper ideas. The article grew out of a graduate seminar in US voting behavior with David C. Barker. I noticed that several articles on the seminar reading list placed in top-tier journals but made an incremental theoretical contribution and used publicly-available data, which was something that I as a graduate student felt that I could realistically aspire to.

For instance, John R. Petrocik in 1996 provided evidence that candidates and parties “owned” certain issues, such as Democrats owning care for the poor and Republicans owning national defense. Danny Hayes extended that idea by using publicly-available ANES data to provide evidence that candidates and parties owned certain traits, such as Democrats being more compassionate and Republicans being more moral.

The original manuscript identified the Hayes article as a travel-type article in which the traveling is done by analogy. The final version of the manuscript lost the Hayes citation but had 19 other ideas for seminar papers. Ideas on the cutting room floor included replication and picking a fight with another researcher.

Of Publishable Quality: Ideas for Political Science Seminar Papers. 2011. PS: Political Science & Politics 44(3): 629-633.

  1. pdf version, copyright held by American Political Science Association

This next article grew out of reviews that I conducted for friends, colleagues, and journals. I noticed that I kept making the same or similar comments, so I produced a central repository for generalized forms of these comments in the hope that — for example — I do not review any more manuscripts that formally list hypotheses about the control variables.

Rookie Mistakes: Preemptive Comments on Graduate Student Empirical Research Manuscripts. 2013. PS: Political Science & Politics 46(1): 142-146.

  1. pdf version, copyright held by American Political Science Association

The next article grew out of friend and colleague Jonathan Reilly’s dissertation. Jonathan noticed that studies of support for democracy had treated don’t know responses as if the respondents had never been asked the question. So even though 73 percent of respondents in China expressed support for democracy, that figure was reported as 96 percent because don’t know responses were removed from the analysis.

The manuscript initially did not include imputation of preferences for non-substantive responders, but a referee encouraged us to estimate missing preferences. My prior was that multiple imputation was “making stuff up,” but research into missing data methods taught me that the alternative — deletion of cases — assumed that cases were missing at random, which did not appear to be true in our study: the percent of missing cases in a country correlated at -0.30 and -0.43 with the country’s Polity IV democratic rating, which meant that respondents were more likely to issue a non-substantive response in countries where political and social liberties are more restricted.

Don’t Know Much about Democracy: Reporting Survey Data with Non-Substantive Responses. 2012. PS: Political Science & Politics 45(3): 462-467. Second author, with Jonathan Reilly.

  1. pdf version, copyright held by American Political Science Association

Measuring abortion absolutism

The American National Elections Studies (ANES) has measured abortion attitudes since 1980 with an item that dramatically inflates the percentage of pro-choice absolutists:

There has been some discussion about abortion during
recent years. Which one of the opinions on this page best agrees with your view? You can just tell me the number of the opinion you choose.
1. By law, abortion should never be permitted.
2. The law should permit abortion only in case of rape, incest, or when the woman’s life is in danger.
3. The law should permit abortion for reasons other than rape, incest, or danger to the woman’s life, but only after the need for the abortion has been clearly established.
4. By law, a woman should always be able to obtain an abortion as a matter of personal choice.
5. Other {SPECIFY}

 

In a book chapter of Improving Public Opinion Surveys: Interdisciplinary Innovation and the American National Election Studies, Heather Marie Rice and I discussed this measure and results from a new abortion attitudes measure piloted in 2006 and included on the 2008 ANES Time Series Study. The 2006 and 2008 studies did not ask any respondents both abortion attitudes measures, but the 2012 study did. This post presents data from the 2012 study describing how persons selecting an absolute abortion policy option responded when asked about policies for specific abortion conditions.

Based on the five-part item above, and removing from the analysis the five persons who provided an Other response, 44 percent of the population agreed that “[b]y law, a woman should always be able to obtain an abortion as a matter of personal choice.” The figure below indicates how these pro-choice absolutists later responded to items about specific abortion conditions.

Red bars indicate the percentage of persons who agreed on the 2012 pre-election survey that “[b]y law, a woman should always be able to obtain an abortion as a matter of personal choice” but reported opposition to abortion for the corresponding condition in the 2012 post-election survey.

2012abortionANESprochoice4

Sixty-six percent of these pro-choice absolutists on the 2012 pre-election survey later reported opposition to abortion if the reason for the abortion is that the child will not be the sex that the pregnant woman wanted. Eighteen percent of these pro-choice absolutists later reported neither favoring nor opposing abortion for that reason, and 16 percent later reported favoring abortion for that reason. Remember that this 16 percent favoring abortion for reasons of fetal sex selection is 16 percent of the pro-choice absolutist subsample.

In the overall US population, only 8 percent favor abortion for fetal sex selection; this 8 percent is a more accurate estimate of the percent of pro-choice absolutists in the population than the 44 percent estimate from the five-part item.

Based on the five-part item above, and removing from the analysis the five persons who provided an Other response, 12 percent of the population thinks that “[b]y law, abortion should never be permitted.” The figure below indicates how these pro-life absolutists later  responded to items about specific abortion conditions.

Green bars indicate the percentage of persons who agreed on the 2012 pre-election survey that “[b]y law, abortion should never be permitted” but reported support for abortion for the corresponding condition in the 2012 post-election survey.

2012abortionANESprolife4

Twenty-nine percent of these pro-life absolutists on the 2012 pre-election survey later reported support for abortion if the reason for the abortion is that the woman might die from the pregnancy. Twenty-nine percent of these pro-choice absolutists later reported neither favoring nor opposing abortion for that reason, and 42 percent later reported opposing abortion for that reason. Remember that this 42 percent opposing abortion for reasons of protecting the pregnant woman’s life is 42 percent of the pro-life absolutist subsample.

In the overall US population, only 11 percent oppose abortion if the woman might die from the pregnancy; this 11 percent is a more accurate estimate of the percent of pro-life absolutists in the US population than the 12 percent estimate from the five-part item.

There is a negligible difference in measured pro-life absolutism between the two methods, but the five-part item inflated pro-choice absolutism by a factor of 5. Our book chapter suggested that this inflated pro-choice absolutism might result because the typical person considers abortion in terms of the hard cases, especially since the five-part item mentions only the hard cases of rape, incest, and danger to the pregnant woman’s life.

Notes

1. The percent of absolutists is slightly smaller if absolutism is measured as supporting or opposing abortion in each listed condition.

2. The percent of pro-life absolutists is likely overestimated in the “fatal” abortion condition item because the item asks about abortion if “staying pregnant could cause the woman to die”; presumably, there would be less opposition to abortion if the item stated with certainty that staying pregnant would cause the woman to die.

3. Data presented above are for persons who answered the five-part abortion item on the 2012 ANES pre-election survey and answered at least one abortion condition item on the 2012 ANES post-election survey. Don’t know and refusal responses were listwise deleted for each cross-tabulation. Data were weighted with the Stata command svyset [pweight=weight_full], strata(strata_full); weighted cross-tabulations were calculated with the command svy: tabulate X Y if Y==Z, where X is the abortion condition item, Y is the five-part abortion item, and Z is one of the absolute policy options on the five-part item.

4. Here is the text for each abortion condition item that appeared on the 2012 ANES Time Series post-election survey:

[First,/Next,] do you favor, oppose, or neither favor nor oppose abortion being legal if:
* staying pregnant could cause the woman to die
* the pregnancy was caused by the woman being raped
* the fetus will be born with a serious birth defect
* the pregnancy was caused by the woman having sex with a blood relative
* staying pregnant would hurt the woman’s health but is very unlikely to cause her to die
* having the child would be extremely difficult for the woman financially
* the child will not be the sex the woman wants it to be

There was also a general item on the post-election survey:

Next, do you favor, oppose, or neither favor nor oppose abortion being legal if the woman chooses to have one?

5. Follow-up items to the post-election survey abortion items asked respondents to indicate intensity of preference, such as favor a great deal, favor moderately, or favor a little. These follow-up items were not included in the above analysis.

6. There were more than 5000 respondents for the pre-election and post-election surveys.

How to use population weights in SPSS Complex Samples

My previous posts discussed the p-values that the base module of SPSS reports for statistical significance tests using weighted data; these weights are not correct for probability-weighted analyses. Jon Peck informed me of SPSS Complex Samples, which can provide correct p-values for statistical significance tests for probability-weighted analyses. Complex Samples does not have the most intuitive setup, so this post describes the procedure for analyzing data using probability weights in SPSS Statistics 21.

SPSS0

SPSS1

The dataset that I was working with had probability weights but no clustering or stratification, so the Stratify By and Clusters boxes remain empty in the image below.

SPSS4

The next dialog box has options for Simple Systematic and Simple Sequential. Either method will work if Proportions are set to 1 in the subsequent dialog box.

SPSS3

SPSS4

SPSS5

SPSS6

SPSS7

SPSS8

SPSS9

I conducted an independent samples t-test, so I selected the General Linear Model command below.

SPSS10

SPSS11

Click the Statistics button in the image above and then click the t-test box in the image below to tell SPSS to conduct a t-test.

SPSS12

SPSS13

Hit OK to get the output.

rattan2012outputSPSS

The SPSS output above has the same p-value as the probability-weighted Stata output below.

rattan2012outputStata

SPSS ate my observations

My previous post discussed p-values in SPSS and Stata for probability-weighted data. This post provides more information on weighting in the base module of SPSS. Data in this post are from Craig and Richeson (2014), downloaded from the TESS archives; SPSS commands are from personal communication with Maureen Craig, who kindly and quickly shared her replication code.

Figure 2 in Craig and Richeson’s 2014 Personality and Social Psychology Bulletin article depicts point estimates and standard errors for racial feeling thermometer ratings made by white non-Hispanic respondents. The article text confirms what the figure shows: whites in the racial shift condition (who were exposed to a news article titled, “In a Generation, Racial Minorities May Be the U.S. Majority”) rated Blacks/African Americans, Latinos/Hispanics, and Asian-Americans lower on the feeling thermometers at a statistically significant level than whites in the control condition (who were exposed to a news article titled, “U.S. Census Bureau Reports Residents Now Move at a Higher Rate”).

CraigRicheson2014PSPB

Craig and Richeson generated a weight variable that retained the original post-stratification weights for non-Hispanic white respondents but changed the weight to 0.001 for respondents who were not non-Hispanic white. Figure 2 results were drawn from the SPSS UNIANOVA command, which “provides regression analysis and analysis of variance for one dependent variable by one or more factors and/or variables,” according to the SPSS web entry for the UNIANOVA command.

The SPSS output below represents a weighted analysis in the base SPSS module for the command UNIANOVA therm_bl BY dummyCond WITH cPPAGE cPPEDUCAT cPPGENDER, in which therm_bl, dummyCond, cPPAGE, cPPEDUCAT, and cPPGENDER respectively indicate numeric ratings on a 0-to-100 feeling thermometer scale for blacks, a dummy variable indicating whether the respondent received the control news article or the treatment news article, respondent age, respondent education on a four-level scale, and respondent sex. The 0.027 Sig. value for dummyCond indicates that the mean thermometer rating made by white non-Hispanics in the control condition was different at the 0.027 level of statistical significance from the mean thermometer rating made by white non-Hispanics in the treatment condition.

CR2014PSPB

The image below presents results for the same analysis conducted using probability weights in Stata, with weightCR indicating a weight variable mimicking the post-stratification weight created by Craig and Richeson: the corresponding p-value is 0.182, not 0.027, a difference due to the Stata p-value reflecting a probability-weighted analysis and the SPSS p-value reflecting a frequency-weighted analysis.

CR2014bl0

So why did SPSS return a p-value of 0.027 for dummyCond?

The image below is drawn from online documentation for the SPSS weight command. The second bullet point indicates that SPSS often rounds fractional weights to the nearest integer. The third bullet point indicates that SPSS statistical procedures ignore cases with a weight of zero, so cases with fractional weights that round to zero will be ignored. The first bullet point indicates that SPSS arithmetically replicates a case according to the weight variable: for instance, SPSS treats a case with a weight of 3 as if that case were 3 independent and identical cases.

 weightsSPSS

Let’s see if this is what SPSS did. The command gen weightCRround = round(weightCR) in the Stata output below generates a variable with the values of weightCR rounded to the nearest integer. When the Stata command used the frequency weight option with this rounded weight variable, Stata reported p-values identical to the SPSS p-values.

CR2014bl2

The Stata output below illustrates what happened in the above frequency-weighted analysis. The expand weightCRround command replicated each dataset case n-1 times, in which n is the number in the weightCRround variable: for example, each case with a weightCRround value of 3 now appears three times in the dataset. Stata retained one instance of each case with a weightCRround value of zero, but SPSS ignores cases with a weight of zero for weighted analyses; therefore, the regression excluded cases with a zero value for weightCRround.

Stata p-values from a non-weighted regression on this adjusted dataset were identical to SPSS p-values reported using the Craig and Richeson commands.

CR2014bl3

So how much did SPSS alter the dataset? The output below is for the original dataset: the racial shift and control conditions respectively had 233 and 222 white non-Hispanic respondents with full data on therm_bl, cPPAGE, cPPEDUCAT, and cPPGENDER; the difference in mean therm_bl ratings across conditions was 3.13 units.

CR2014bl4before

The output below is for the dataset after executing the round and expand commands: the racial shift and control conditions respectively had 189 and 192 white non-Hispanic respondents with a non-zero weight and full data on therm_bl, cPPAGE, cPPEDUCAT, and cPPGENDER; the difference in mean therm_bl ratings across conditions was 4.67, a 49 percent increase over the original difference of 3.13 units.

CR2014bl4after

Certain weighted procedures in the SPSS base module report p-values identical to p-values reported in Stata when weights are rounded, cases are expanded by those weights, and cases with a zero weight are ignored; other weighted procedures in the SPSS base module report p-values identical to p-values reported in Stata when the importance weight option is selected or when the analytic weight option is selected and the sum of the weights is 1.

(Stata’s analytic weight option treats each weight as an indication of the number of observations represented in a particular case; for instance, an analytic weight of 4 indicates that the values for the corresponding case reflect the mean values for four observations; see here.)

Test analyses that I conducted produced the following relationship between SPSS output and Stata output.

SPSS weighted base module procedures that reported p-values identical to Stata p-values when weights were rounded, cases were expanded by those weights, and cases with a zero weight were ignored:

  1. UNIANOVA with weights indicated in the WEIGHT BY command

SPSS weighted base module procedures that reported p-values identical to Stata p-values when the importance weight or analytic weight option was selected and the sum of the weights was 1:

  1. Independent samples t-test
  2. Linear regression with weights indicated in the WEIGHT BY command
  3. Linear regression with weights indicated in the REGWT subcommand in the regression menu (weighted least squares analysis)
  4. UNIANOVA with weights indicated in the REGWT subcommand in the regression menu (weighted least squares analysis)

SPSS has a procedure that correctly calculates p-values with survey weights, as Jon Peck noted in a comment to the previous post. The next post will describe that procedure.

Problems with SPSS survey weights

Here are t-scores and p-values from a set of t-tests that I recently conducted in SPSS and in Stata:

Group 1 unweighted
t = 1.082 in SPSS (p = 0.280)
t = 1.082 in Stata (p = 0.280)

Group 2 unweighted
t = 1.266 in SPSS (p = 0.206)
t = 1.266 in Stata (p = 0.206)

Group 1 weighted
t = 1.79 in SPSS (p = 0.075)
t = 1.45 in Stata (p = 0.146)

Group 2 weighted
t = 2.15 in SPSS (p = 0.032)
t = 1.71 in Stata (p = 0.088)

There was no difference between unweighted SPSS p-values and unweighted Stata p-values, but weighted SPSS p-values fell under conventional levels of statistical significance that probability weighted Stata p-values did not (0.10 and 0.05, respectively).

John Hendrickx noted some problems with weights in SPSS:

One of the things you can do with Stata that you can’t do with SPSS is estimate models for complex surveys. Most SPSS procedures will allow weights, but although these will produce correct estimates, the standard errors will be too small (aweights or iweights versus pweights). SPSS cannot take clustering into account at all.

 
Re-analysis of Group 1 weighted and Group 2 weighted indicated that t-scores in Stata were the same as t-scores in SPSS when using the analytic weight option [aw=weight] and the importance weight option [iw=weight].

SPSS has another issue with weights, indicated on the IBM help site:

If the weighted number of cases exceeds the sample size, tests of significance are inflated; if it is smaller, they are deflated.

 
This means that, for significance testing, SPSS treats the sample size as the sum of the weights and not as the number of observations: if there are 1,000 observations and the mean weight is 2, SPSS will conduct significance tests as if there were 2,000 observations. Stata with the probability weight option treats the sample size as the number of observations no matter the sum of the weights.

I multiplied the weight variable by 10 in the dataset that I have been working in. For this inflated weight variable, Stata t-scores did not change for the analytic weight option, but Stata t-scores did inflate for the importance weight option.

UPDATE (2014-Apr-21)

Jon Peck noted in the comments that SPSS has a Complex Samples procedure. SPSS p-values from the Complex Samples procedure matched Stata p-values using probability weights:

SPSS

Stata

The Complex Samples procedure appears to require a plan file. I tried several permutations for the plan, and the procedure worked correctly with this setup:

SPSS-CS

Does commercial tv news make people forget?

John Sides at the Monkey Cage discusses an article on public broadcasting and political knowledge. The cross-sectional survey data analyzed in the article cannot resolve the question of causal direction, as Sides notes:

Obviously, there are challenges of sorting out correlation and causation here. Do people who consume public broadcasting become more knowledgeable? Or are knowledgeable people just more likely to consume public broadcasting? Via statistical modeling, Soroka and colleagues go some distance in isolating the possible effects of public broadcasting—though they are clear that their modeling is no panacea.

Nevertheless, the results are interesting. In most countries, people who consume more public broadcasting know more about current events than people who consume less of it. But these same differences emerge to a lesser extent among those who consume more or less commercial broadcasting. This suggests that public broadcasting helps citizens learn. Here’s a graph:
soroka

 
But the article should not be interpreted as providing evidence that “public broadcasting helps citizens learn.”

Cross-sectional survey data cannot resolve the question of causal direction, but theory can: if we observe a correlation between, say, race and voting for a particular political party, we can rule out the possibility that voting for a particular political party is causing race.

Notice that in the United Kingdom, consumption of commercial broadcasting news correlates with a substantial decrease in political knowledge: therefore, if the figure is interpreted as evidence that broadcasting causes knowledge, then it is necessary to interpret the UK results as commercial broadcasting news in the UK causing people to have less political knowledge. I think that we can safely rule out that possibility.

The results presented in the figure are more likely to reflect self-selection: persons in the UK with more political knowledge choose to watch public broadcasting news, and persons in the UK with less political knowledge choose to watch commercial broadcasting news; that doesn’t mean that public broadcasting has zero effect on political knowledge, but it does mean that the evidence presented in the figure does not provide enough information to assess the magnitude of the effect.