This post presents selected excerpts from Jesper W. Schneider’s 2014 Scientometrics article, “Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations” [ungated version here]. For the following excerpts, most citations have been removed, and page numbers references to the article have not been included because my copy of the article lacked page numbers.

The first excerpt notes that the common procedure followed in most social science research is a mishmash of two separate procedures:

What is generally misunderstood is that what today is known, taught and practiced as NHST [null hypothesis significance testing] is actually an anonymous hybrid or mix-up of two divergent classical statistical theories, R. A. Fisher’s ‘significance test’ and Neyman’s and Pearson’s ‘hypothesis test’. Even though NHST is presented somewhat differently in statistical textbooks, most of them do present p values, null hypotheses (H0), alternative hypotheses (HA), Type I (α) and II (β) error rates as well as statistical power, as if these concepts belong to one coherent theory of statistical inference, but this is not the case. Only null hypotheses and p values are present in Fisher’s model. In Neyman–Pearson’s model, p values are absent, but contrary to Fisher, two hypotheses are present, as well as Type I and II error rates and statistical power.

The next two excerpts contrast the two procedures:

In Fisher’s view, the p value is an epistemic measure of evidence from a single experiment and not a long-run error probability, and he also stressed that ‘significance’ depends strongly on the context of the experiment and whether prior knowledge about the phenomenon under study is available. To Fisher, a ‘significant’ result provides evidence against H0, whereas a non-significant result simply suspends judgment—nothing can be said about H0.

They [Neyman and Pearson] specifically rejected Fisher’s quasi-Bayesian interpretation of the ‘evidential’ p value, stressing that if we want to use only objective probability, we cannot infer from a single experiment anything about the truth of a hypothesis.

The next excerpt reports evidence that p-values are overstated. I have retained the reference citations here:

Using both likelihood and Bayesian methods, more recent research have demonstrated that p values overstate the evidence against H0, especially in the interval between significance levels 0.01 and 0.05, and therefore can be highly misleading measures of evidence (e.g., Berger and Sellke 1987; Berger and Berry 1988; Goodman 1999a; Sellke et al. 2001; Hubbard and Lindsay 2008; Wetzels et al. 2011). What these studies show is that p values and true evidential measures only converge at very low p values. Goodman (1999a, p. 1008) suggests that only p values less than 0.001 represent strong to very strong evidence against H0.

This next excerpt emphasizes the difference between p and alpha:

Hubbard (2004) has referred to p < α as an ‘alphabet soup’, that blurs the distinctions between evidence (p) and error (α), but the distinction is crucial as it reveals the basic differences underlying Fisher’s ideas on ‘significance testing’ and ‘inductive inference’, and Neyman–Pearson views on ‘hypothesis testing’ and ‘inductive behavior’.

The next excerpt contains a caution against use of p-values in observational research:

In reality therefore, inferences from observational studies are very often based on single non-replicable results which at the same time no doubt also contain other biases besides potential sampling bias. In this respect, frequentist analyses of observational data seems to depend on unlikely assumptions that too often turn out to be so wrong as to deliver unreliable inferences, and hairsplitting interpretations of p values becomes even more problematic.

The next excerpt cautions against incorrect interpretation of p-values:

Many regard p values as a statement about the probability of a null hypothesis being true or conversely, 1 − p as the probability of the alternative hypothesis being true. But a p value cannot be a statement about the probability of the truth or falsity of any hypothesis because the calculation of p is based on the assumption that the null hypothesisistrue in the population.

The final excerpt is a hopeful note that the importance attached to p-values will wane:

Once researchers recognize that most of their research questions are really ones of parameter estimation, the appeal of NHST will wane. It is argued that researchers will find it much more important to report estimates of effect sizes with CIs [confidence intervals] and to discuss in greater detail the sampling process and perhaps even other possible biases such as measurement errors.

The Schneider article is worthwhile for background and information on p-values. I’d also recommend this article on p-value misconceptions.