Statistical significance, and all that jazz

About two years ago, I was a reasonable person who argued that tests of statistical significance were useful in some limited situations. After completing research [...], I have concluded that tests of statistical significance should never be used.

This is J. Scott Armstrong, quoted in Decision Science News (via Statistical Modeling).

Here's the paper. It's a very readable piece, and I agree with almost all the points Armstrong is making. Unfortunately, it is gated; here is my summary:

1a. As a diagnostic tool, tests of statistical significance are too blunt and unnecessarily miss useful information. In some respects, they are also arbitrary. They can mislead the researcher.
1b. To make matters worse, many researchers do no understand how to construct appropriate and/or informative tests, or how to interpret them. Furthermore, journals tend to place unwarranted importance on 'statistical significance'; and even authors who do know better bend over backwards to please them.

2a. Reporting statistical significance at the x% level is too blunt and unnecessarily conceals useful information.
2b. To make matters worse, many consumers of statistical research do not understand how tests of statistical significance should be interpreted, or even what statistical significance means. Misguided commentary by the researcher doesn't help either.

A particular problem arises because there's rarely a good reason why the null hypothesis should be 'favoured'. When testing for statistical significance, the null is considered innocent until proven guilty, and the burden of proof is excessively high at 'standard levels'. The example Armstrong uses has to do with assessing whether combining forecasts can improve accuracy - why should there be a presumption that it doesn't?

The author also provides a handy list of what to do once you get rid of tests of statistical significance altogether:

What should one do without tests of statistical significance? There are better ways to report findings. To assess—
• importance, use effect sizes
• confidence, use prediction intervals
• replicability, use replications and extensions
• generality, use meta-analyses.

Finally, are there any circumstances in which tests of statistical significance could be useful? Only when utilising prediction intervals, replications, extensions and meta-analyses is impossible or too 'expensive' for the purpose at hand, and the limited (to some extent arbitrary) information tests of statistical significance convey can offer some indications:

This does not rule out the possibility that statistical significance might help in other areas such as (1) in aiding decision makers by flagging areas that need attention; (2) as part of a forecasting procedure (e.g., helping to decide whether to apply a seasonality adjustment or when to damp trends); or (3) serving as a guide to a scientist who is analyzing a problem (e.g., as a quick way to highlight areas that need further study). On the other hand, this is mere speculation on my part.

Before leaving this post, here's a related short paper written by my favourite blogger and Hal Stern: 'The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant' (free access).