When are one-tailed hypotheses appropriate?

Comments to this scatterplot post contained a discussion about when one-tailed statistical significance tests are appropriate. I’d say that one-tailed tests are appropriate only for a certain type of applied research. Let me explain…

Statistical significance tests attempt to assess the probability that we mistake noise for signal. The conventional 0.05 level of statistical significance in social science represents a willingness to mistake noise for signal 5% of the time.

Two-tailed tests presume that these errors can occur because we mistake noise for signal in the positive direction or because we mistake noise for signal in the negative direction: therefore, for two-tailed tests we typically allocate half of the acceptable error to the left tail and half of the acceptable error to the right tail.

One-tailed tests presume either that: (1) we will never mistake noise for signal in one of the directions because it is impossible to have a signal in that direction, so that permits us to place all of the acceptable error in the other direction’s tail; or (2) we are interested only in whether there is an effect in a particular direction, so that permits us to place all of the acceptable error in that direction’s tail.

Notice that it is easier to mistake noise for signal in a one-tailed test than in a two-tailed test because one-tailed tests have more acceptable error in the tail that we are interested in.

So let’s say that we want to test the hypothesis that X has a particular directional effect on Y. Use of a one-tailed test would mean either that: (1) it is impossible that the true direction is the opposite of the direction predicted by the hypothesis or (2) we don’t care whether the true direction is the opposite of the direction predicted by the hypothesis.

I’m not sure that we can ever declare things impossible in social science research, so (1) is not justified. The problem with (2) is that — for social science conducted to understand the world — we should always want to differentiate between “no evidence of an effect at a statistically significant level” and “evidence of an effect at a statistically significant level, but in the direction opposite to what we expected.”

To illustrate a problem with (2), let’s say that we commit before the study to a one-tailed test for whether X has a positive effect on Y, but the results of the study indicate that the effect of X on Y is negative at a statistically significant level, at least if we had used a two-tailed test. Now we are in a bind: if we report only that there is no evidence that X has a positive effect on Y at a statistically significant level, then we have omitted important information about the results; but if we report that the effect of X on Y is negative at a statistically significant level with a two-tailed test, then we have abandoned our original commitment to a one-tailed test in the hypothesized direction.

Now, when is a one-tailed test justified? The best justification that I have encountered for a one-tailed test is the scenario in which the same decision will be made if X has no effect on Y and if X has a particular directional effect on Y, such as “we will switch to a new program if the new program is equal to or better than our current program”; but that’s for applied science, and not for social science conducted to understand the world: social scientists interested in understanding the world should care whether the new program is equal to or better than the current program.

In cases of strong theory or a clear prediction from the literature supporting a directional hypothesis, it might be acceptable — before the study — to allocate 1% of the acceptable error to the opposite direction and 4% of the acceptable error to the predicted direction, or some other unequal allocation of acceptable error. That unequal allocation of acceptable error would provide a degree of protection against unexpected effects that is lacking in a one-tailed test.

Advertisements

List experiments are not useful for estimating U.S. vote fraud

Ahlquist, Mayer, and Jackman (2013, p. 3) wrote:

List experiments are a commonly used social scientific tool for measuring the prevalence of illegal or undesirable attributes in a population. In the context of electoral fraud, list experiments have been successfully used in locations as diverse as Lebanon, Russia and Nicaragua. They present our best tool for detecting fraudulent voting in the United States.*

 
I’m not sure that list experiments are the best tool for detecting fraudulent voting in the United States. But, first, let’s introduce the list experiment.

The list experiment goes back at least to Judith Droitcour Miller’s 1984 dissertation, but she called the procedure the item count method (see page 188 of this 1991 book). Ahlquist, Mayer, and Jackman (2013) reported results from list experiments that split a sample into two groups: members of the first group received a list of 4 items and were instructed to indicate how many of the 4 items applied to themselves; members of the second group received a list of 5 items — the same 4 items that the first group received, plus an additional item — and were instructed to indicate how many of the 5 items applied to themselves. The difference in the mean number of items selected by the groups was then used to estimate the percent of the sample and — for weighted data — the percent of the population to which the fifth item applied.

Ahlquist, Mayer, and Jackman (2013) reported four list experiments from September 2013, with these statements as the fifth item:

  • “I cast a ballot under a name that was not my own.”
  • “Political candidates or activists offered you money or a gift for your vote.”
  • “I read or wrote a text (SMS) message while driving.”
  • “I was abducted by extraterrestrials (aliens from another planet).”

Figure 4 of Ahlquist, Mayer, and Jackman (2013) displayed results from three of these list experiments:

amj2013f4

My presumption is that vote buying and voter impersonation are low frequency events in the United States: I’d probably guess somewhere between 0 and 1 percent, and closer to 0 percent than to 1 percent. If that’s the case, then a list experiment with 3,000 respondents is not going to detect such low frequency events. 95 percent confidence intervals for weighted estimates in Figure 4 appear to span 20 percentage points or more: the weighted 95 percent confidence interval for vote buying appears to range from -7 percent to 17 percent. Moreover, notice how much estimates varied between the December 2012 and September 2013 waves of the list experiment: the point estimate for voter impersonation in December 2012 was 0 percent, and the point estimate for voter impersonation in September 2013 was -10 percent, a ten-point swing in point estimates.

So, back to the original point, list experiments are not the best tool for detecting vote fraud in the United States because vote fraud in the United States is a low frequency event that list experiments cannot detect without an improbably large sample size: the article indicates that at least 260,000 observations would be necessary to detect a 1% difference.

If that’s the case, then what’s the purpose of a list experiment to detect vote fraud with only 3,000 observations? Ahlquist, Mayer, and Jackman (2013, p. 31) wrote that:

From a policy perspective, our findings are broadly consistent with the claims made by opponents of stricter voter ID laws: voter impersonation was not a serious problem in the 2012 election.

 
The implication appears to be that vote fraud is a serious problem only if the fraud is common. But there’s a lot of problems that are serious without being common.

So, if list experiments are not the best tool for detecting vote fraud in the United States, then what is a better way? I think that — if the goal is detecting the presence of vote fraud and not estimating its prevalence — then this is one of those instances in which journalism is better than social science.

* This post was based on the October 30, 2013, version of the Ahlquist, Mayer, and Jackman manuscript, which was located here. A more recent version is located here and has replaced the “best tool” claim about list experiments:

List experiments are a commonly used social scientific tool for measuring the prevalence of illegal or undesirable attributes in a population. In the context of electoral fraud, list experiments have been successfully used in locations as diverse as Lebanon, Russia, and Nicaragua. They present a powerful but unused tool for detecting fraudulent voting in the United States.

 
It seems that “unused” is applicable, but I’m not sure that a “powerful” tool for detecting vote fraud in the United States would produce 95 percent confidence intervals that span 20 percentage points.

P.S. The figure posted above has also been modified in the revised manuscript. I have a pdf of the October 30, 2013, version, in case you are interested in verifying the quotes and figure.

Self-archived articles

I came across an interesting site, Dynamic Ecology, and saw a post on self-archiving of journal articles.The post mentioned SHERPA/RoMEO, which lists archiving policies for many journals. The only journal covered by SHERPA/RoMEO that I have published in that permits self-archiving is PS: Political Science & Politics, so I am linking below to pdfs of PS articles that I have published.

This first article attempts to help graduate students who need seminar paper ideas. The article grew out of a graduate seminar in US voting behavior with David C. Barker. I noticed that several articles on the seminar reading list placed in top-tier journals but made an incremental theoretical contribution and used publicly-available data, which was something that I as a graduate student felt that I could realistically aspire to.

For instance, John R. Petrocik in 1996 provided evidence that candidates and parties “owned” certain issues, such as Democrats owning care for the poor and Republicans owning national defense. Danny Hayes extended that idea by using publicly-available ANES data to provide evidence that candidates and parties owned certain traits, such as Democrats being more compassionate and Republicans being more moral.

The original manuscript identified the Hayes article as a travel-type article in which the traveling is done by analogy. The final version of the manuscript lost the Hayes citation but had 19 other ideas for seminar papers. Ideas on the cutting room floor included replication and picking a fight with another researcher.

Of Publishable Quality: Ideas for Political Science Seminar Papers. 2011. PS: Political Science & Politics 44(3): 629-633.

  1. pdf version, copyright held by American Political Science Association

This next article grew out of reviews that I conducted for friends, colleagues, and journals. I noticed that I kept making the same or similar comments, so I produced a central repository for generalized forms of these comments in the hope that — for example — I do not review any more manuscripts that formally list hypotheses about the control variables.

Rookie Mistakes: Preemptive Comments on Graduate Student Empirical Research Manuscripts. 2013. PS: Political Science & Politics 46(1): 142-146.

  1. pdf version, copyright held by American Political Science Association

The next article grew out of friend and colleague Jonathan Reilly’s dissertation. Jonathan noticed that studies of support for democracy had treated don’t know responses as if the respondents had never been asked the question. So even though 73 percent of respondents in China expressed support for democracy, that figure was reported as 96 percent because don’t know responses were removed from the analysis.

The manuscript initially did not include imputation of preferences for non-substantive responders, but a referee encouraged us to estimate missing preferences. My prior was that multiple imputation was “making stuff up,” but research into missing data methods taught me that the alternative — deletion of cases — assumed that cases were missing at random, which did not appear to be true in our study: the percent of missing cases in a country correlated at -0.30 and -0.43 with the country’s Polity IV democratic rating, which meant that respondents were more likely to issue a non-substantive response in countries where political and social liberties are more restricted.

Don’t Know Much about Democracy: Reporting Survey Data with Non-Substantive Responses. 2012. PS: Political Science & Politics 45(3): 462-467. Second author, with Jonathan Reilly.

  1. pdf version, copyright held by American Political Science Association