24 Comments
User's avatar
David Hugh-Jones's avatar

I am very surprised at genetics looking so bad. Geneticists are mostly very aware of the problems of multiple hypothesis testing and do a lot about it. I really want to see how the Z statistics were collected.

Expand full comment
Cremieux's avatar

Maybe in statistical genetics. Pulling papers at random from the dataset, it doesn't look like genetics is a very statistically involved field in the main. A lot of it is lab work like

https://clinicalepigeneticsjournal.biomedcentral.com/articles/10.1186/1868-7083-6-2

https://academic.oup.com/dnaresearch/article/21/1/15/347005

https://academic.oup.com/dnaresearch/article/19/1/37/485401

There's probably a lot of p-value clumping due to bad habits. Chavalarias reported that between 1990 and 2015, the proportion of reported p-values given by inequalities instead of exact numbers declined from around 87.5% to just over 50%. So, I have updated the post to include results for when p-values were reported exactly or as inequalities.

Compared to the base results, the ranks for the reports based on inequalities were correlated at 0.989. The ranks for the exact reports correlated with the base ones at 0.545. If economics is included in these ranks with its full-sample values, the correlations change to 0.990 and 0.595. Exact reports were more credible, but still extremely dubious, and exact reporting did not bring any field ahead of economics.

Expand full comment
David Hugh-Jones's avatar

So I mean there’s nothing necessarily suspicious about reporting “p < 0.05”, right? There’s even an argument that that is good practice. It certainly doesn’t prove p-hacking. I think the exact reports are more interesting.

Expand full comment
Cremieux's avatar

Right. A large percentage of p < 0.05 reports are examples of the incorrect p-value rounding QRP or are fraudulent || in error and nowhere near 0.05, but we can't know that happened for any particular example without going and recomputing it.

Expand full comment
David Hugh-Jones's avatar

On a completely different topic... data like this might be a good showcase for my new ggmagnify R package (github.com/hughjonesd/ggmagnify). Would you be willing to share?

Expand full comment
Cremieux's avatar

That sounds like an excellent idea. For anyone else searching, it is the allp dataset in the tidypvals package.

Expand full comment
David Hugh-Jones's avatar

Ah, found it, no worries.

Expand full comment
Brian Chau's avatar

What math papers are reporting p-values? Stats?

Expand full comment
Cremieux's avatar

That category should be thought of as informatics. It's mostly methodological work like the sorts you would find in Comput Math Methods Med, Int J Nanomedicine, Bioinform Biol Insights, etc.

Expand full comment
Nino's avatar

Thank you, it’s really interesting! I was wondering if you could clarify something about the first image. A statistical power of 80% indicates that 80% of the area under the yellow distribution lies to the right of 1.96, but there are two degrees of freedom (mean and standard deviation). Could you explain why that specific effect size or standard deviation was chosen? Additionally, it’s not immediately clear why the distribution should be normal under the alternative hypothesis. In other words, could you please elaborate on how the distribution for the alternative hypothesis was derived?

Expand full comment
JM's avatar

Others have made this (and related) points, and to my knowledge he hasn't responded yet.

IMO, he goofed here, and must have picked a specific alternative hypothesis but didn't say which one.

Unfortunately, this kind of analysis entirely depends on the true distribution of alternative hypotheses, which is unknowable (and debatable) across fields, so the essay is much weaker than it may appear.

Expand full comment
Brad's avatar

Wouldn’t we, in something like a Bayesian sense, expect scientific experiments to generally outperform the assumptions baked into a significance test? 1) Professors tend to know their fields well and select experiments that are likely to pan out based on the literature or even just intuition. 2) There are other cases, like drug design, where we may have a computational team screen hundreds of thousands of small molecules and only pass the top 20 to an experimental team for in vitro testing, making those small molecules very likely to have some impact on e.g. enzyme activity.

Expand full comment
Mansoor Nasir's avatar

Human factors and user design are heavily statistics-based. I didn't see a defined category for them. Do you know what they might have gone grouped under?

Expand full comment
JJ Lee's avatar

Might it not be the case that the fields which scored better in your analysis tend to have lower powered studies? You point out that environmental science, neuroscience, and economics all tend to be weakly powered; these fields all also seem to have lower p-values in the 0.05-0.01 range. Is this not what you would expect if the studies are weaker? Or is there little variance in statistical power between fields?

Expand full comment
Cremieux's avatar

It's hard to get a totally accurate read on the statistical power in each field, but the excess of p-values doesn't look to be too correlated with typical estimates at a glance. The thing worth noting is that, at any level of power, there shouldn't be a hollowing out of p-values below significance thresholds, and fields exceed the chance level even if they had the most optimal level of power to get p-values exactly within that range.

Expand full comment
Dean Eckles's avatar

One additional note is that economics often uses p < 0.1 as a secondary threshold, with top journals like QJE putting stars next to such results, while other fields' journals would not (using a different symbol or none at all).

Expand full comment
Gusto's avatar

Thanks. Would you be so kind as to elaborate on how you arrived at the following conclusions: “we can calculate that at 80% power and with all real effects, 12.60% of results should return p-values between 0.05 and 0.01. Under the null hypothesis, only 4% would be in that range”?

Expand full comment
Gusto's avatar

"In fact, with this knowledge, we can calculate that at 80% power and with all real effects, 12.60% of results should return p-values between 0.05 and 0.01. Under the null hypothesis, only 4% would be in that range. But there are problems comparing this to the observed distribution of p-values. To name a few"

Is the point (and its corresponding graph) that if we have a set of studies with an 80% statistical power and their effects are genuine, the distribution of their z-values should be something akin to the 'Alternative with 80% Power' graph, having an average z-value of ~3.7, as opposed to the observed clustering around 1.96?

Expand full comment
Cremieux's avatar

Yes.

Expand full comment
JM's avatar

But you can only get that graph using a specific value of the alternative hypothesis, so I think he's asking which you chose there?

Power is a function of x, i.e. P(reject H_0 | Alternative hypothesis of x).

Expand full comment
Jab's avatar

Economics is mostly made up to begin with. It's much easier to create consistency inside a pseudo-scientific simulation of "reality" than within the natural sciences.

Expand full comment
zinjanthropus's avatar

Ot-- if I wanted to learn statistics of the type you do, theoretically and practically, what would you recommend doing?

Expand full comment
TGGP's avatar

It seems wrong for economics to be at the top of a hierarchy of the sciences:

https://pubmed.ncbi.nlm.nih.gov/20383332/

Expand full comment
User's avatar
Comment deleted
Apr 22, 2023
Comment deleted
Expand full comment
Cremieux's avatar

Thanks. Updated for clarity.

Expand full comment