Ranking Fields by p-Value Suspiciousness

Cremieux

Apr 22, 2023

128

What proportion of reported p-values in your favorite field are between 0.05 and 0.01?

Read →

24 Comments

David Hugh-Jones

Apr 22, 2023

I am very surprised at genetics looking so bad. Geneticists are mostly very aware of the problems of multiple hypothesis testing and do a lot about it. I really want to see how the Z statistics were collected.

Expand full comment

Reply (1)

Cremieux

Apr 22, 2023

Maybe in statistical genetics. Pulling papers at random from the dataset, it doesn't look like genetics is a very statistically involved field in the main. A lot of it is lab work like

https://clinicalepigeneticsjournal.biomedcentral.com/articles/10.1186/1868-7083-6-2

https://academic.oup.com/dnaresearch/article/21/1/15/347005

https://academic.oup.com/dnaresearch/article/19/1/37/485401

There's probably a lot of p-value clumping due to bad habits. Chavalarias reported that between 1990 and 2015, the proportion of reported p-values given by inequalities instead of exact numbers declined from around 87.5% to just over 50%. So, I have updated the post to include results for when p-values were reported exactly or as inequalities.

Compared to the base results, the ranks for the reports based on inequalities were correlated at 0.989. The ranks for the exact reports correlated with the base ones at 0.545. If economics is included in these ranks with its full-sample values, the correlations change to 0.990 and 0.595. Exact reports were more credible, but still extremely dubious, and exact reporting did not bring any field ahead of economics.

Expand full comment

Reply (1)

David Hugh-Jones

Apr 22, 2023

So I mean there’s nothing necessarily suspicious about reporting “p < 0.05”, right? There’s even an argument that that is good practice. It certainly doesn’t prove p-hacking. I think the exact reports are more interesting.

Expand full comment

Reply (1)

Cremieux

Apr 22, 2023

Right. A large percentage of p < 0.05 reports are examples of the incorrect p-value rounding QRP or are fraudulent || in error and nowhere near 0.05, but we can't know that happened for any particular example without going and recomputing it.

Expand full comment

Reply (1)

David Hugh-Jones

May 2, 2023

On a completely different topic... data like this might be a good showcase for my new ggmagnify R package (github.com/hughjonesd/ggmagnify). Would you be willing to share?

Expand full comment

Reply (2)

Cremieux

May 2, 2023

That sounds like an excellent idea. For anyone else searching, it is the allp dataset in the tidypvals package.

Expand full comment

David Hugh-Jones

May 2, 2023

Ah, found it, no worries.

Expand full comment

Brian Chau

Apr 22, 2023

What math papers are reporting p-values? Stats?

Expand full comment

Reply (1)

Cremieux

Apr 22, 2023

That category should be thought of as informatics. It's mostly methodological work like the sorts you would find in Comput Math Methods Med, Int J Nanomedicine, Bioinform Biol Insights, etc.

Expand full comment

Nino

Dec 11, 2024

Thank you, it’s really interesting! I was wondering if you could clarify something about the first image. A statistical power of 80% indicates that 80% of the area under the yellow distribution lies to the right of 1.96, but there are two degrees of freedom (mean and standard deviation). Could you explain why that specific effect size or standard deviation was chosen? Additionally, it’s not immediately clear why the distribution should be normal under the alternative hypothesis. In other words, could you please elaborate on how the distribution for the alternative hypothesis was derived?

Expand full comment

Reply (1)

Apr 25

Others have made this (and related) points, and to my knowledge he hasn't responded yet.

IMO, he goofed here, and must have picked a specific alternative hypothesis but didn't say which one.

Unfortunately, this kind of analysis entirely depends on the true distribution of alternative hypotheses, which is unknowable (and debatable) across fields, so the essay is much weaker than it may appear.

Expand full comment

Brad

Mar 29

Wouldn’t we, in something like a Bayesian sense, expect scientific experiments to generally outperform the assumptions baked into a significance test? 1) Professors tend to know their fields well and select experiments that are likely to pan out based on the literature or even just intuition. 2) There are other cases, like drug design, where we may have a computational team screen hundreds of thousands of small molecules and only pass the top 20 to an experimental team for in vitro testing, making those small molecules very likely to have some impact on e.g. enzyme activity.

Expand full comment

Mansoor Nasir

Jul 8, 2024

Human factors and user design are heavily statistics-based. I didn't see a defined category for them. Do you know what they might have gone grouped under?

Expand full comment

JJ Lee

May 13, 2024

Might it not be the case that the fields which scored better in your analysis tend to have lower powered studies? You point out that environmental science, neuroscience, and economics all tend to be weakly powered; these fields all also seem to have lower p-values in the 0.05-0.01 range. Is this not what you would expect if the studies are weaker? Or is there little variance in statistical power between fields?

Expand full comment

Reply (1)

Cremieux

May 13, 2024

It's hard to get a totally accurate read on the statistical power in each field, but the excess of p-values doesn't look to be too correlated with typical estimates at a glance. The thing worth noting is that, at any level of power, there shouldn't be a hollowing out of p-values below significance thresholds, and fields exceed the chance level even if they had the most optimal level of power to get p-values exactly within that range.

Expand full comment

Dean Eckles

Mar 27, 2024

One additional note is that economics often uses p < 0.1 as a secondary threshold, with top journals like QJE putting stars next to such results, while other fields' journals would not (using a different symbol or none at all).

Expand full comment

Gusto

Aug 1, 2023

Thanks. Would you be so kind as to elaborate on how you arrived at the following conclusions: “we can calculate that at 80% power and with all real effects, 12.60% of results should return p-values between 0.05 and 0.01. Under the null hypothesis, only 4% would be in that range”?

Expand full comment

Gusto

Jul 29, 2023

"In fact, with this knowledge, we can calculate that at 80% power and with all real effects, 12.60% of results should return p-values between 0.05 and 0.01. Under the null hypothesis, only 4% would be in that range. But there are problems comparing this to the observed distribution of p-values. To name a few"

Is the point (and its corresponding graph) that if we have a set of studies with an 80% statistical power and their effects are genuine, the distribution of their z-values should be something akin to the 'Alternative with 80% Power' graph, having an average z-value of ~3.7, as opposed to the observed clustering around 1.96?

Expand full comment

Reply (1)

Cremieux

Jul 31, 2023

Yes.

Expand full comment

Reply (1)

Mar 21

But you can only get that graph using a specific value of the alternative hypothesis, so I think he's asking which you chose there?

Power is a function of x, i.e. P(reject H_0 | Alternative hypothesis of x).

Expand full comment

Jab

Jul 27, 2023Edited

Economics is mostly made up to begin with. It's much easier to create consistency inside a pseudo-scientific simulation of "reality" than within the natural sciences.

Expand full comment