The State of 'the Gaps' 2022-23
How large are America's racial gaps in standardized test performance in the current year?
On large-scale tests of reasoning ability, the observed mean difference between non-Hispanic whites and African Americans is 1.1 ± 0.2 standard deviation.
This law is more a stylized fact than a law. After all, it was based on hundreds of large-scale observations, not the identification of a fundamental constant of the universe. In fact, just four years after its coining, William Dickens and James Flynn argued that the law did not, in fact, hold. They contended that the Black-White gap had declined by 4 to 7 IQ points between 1972 and 2002!1
Dickens and Flynn’s argument was based on the observation that the Black-White gaps on the Stanford-Binet (SB), the Wechsler Intelligence Scale for Children (WISC), the Wechsler Adult Intelligence Scale (WAIS), and the Armed Forces Qualification Test (AFQT) had all apparently declined. Expressed in standardized terms relative to the White group using Hedge’s g, their claim of gap closure seems to hold some water since Blacks gained 2.53 points on the SB, 6.44 points on the WISC, 1.36 points on the WAIS, and 3.57 points on the AFQT.2 The simple mean gap closure was 3.48 points.
Despite claims they were comprehensive and careful in their search for gap estimates, Dickens and Flynn made errors, inappropriately extrapolated to obtain their results, and ignored evidence from other sources and even from within their paper.
A few of the problems with Dickens and Flynn’s work were pointed out by Philippe Rushton and Arthur Jensen in a review of Richard Nisbett’s Intelligence and How to Get It.3 They noted that the Wonderlic Personnel Test (WPT) showed a gain of 2.4 points between 1970 and 2001, the Kaufman Assessment Battery for Children (K-ABC) showed a loss of 1 point between 1983 and 2004, the Woodcock-Johnson (WJ) showed no movement whatsoever between 1987 and 19994, and the Differential Ability Scales (DAS) showed a gain of 1.83 points between 1972 and 1986. Including the WPT, K-ABC, WJ, and DAS, the mean gain drops from 3.48 to 2.14 points. As Rushton and Jensen put it in their 2006 response to Dickens and Flynn:5
To claim a 4- to 7-point gain for Blacks, Dickens and Flynn chose three independent tests showing medium gains (the Wechsler, Stanford-Binet, and Armed Forces Qualification tests) and relegated to their Appendix B four or more tests showing lesser gains…. Even the tests Dickens and Flynn did analyze do not support their conclusion. The alleged gain of 4 to 7 points is from a “projected” trend line based on a small IQ rise per year multiplied by more years than are in the data using unclear procedures. Simple arithmetic applied to the data in their Table A1 shows a mean gain for Blacks of only 3.44 IQ points, from 86.44 to 89.88.
As Rushton and Jensen noted in 2010, these projections are doubly inappropriate because they cannot be assumed and were, in fact, contradicted by evidence from other tests. For example, they noted that Gottfredson’s analysis of the National Assessment of Educational Progress (NAEP) showed gap narrowing into the 1980s from 1.07 to 0.89 standard deviations, but nothing thereafter. Projecting gains forward with that data would deliver the wrong result.
Projection and omission aren’t the only issues in Dickens and Flynn’s analysis; they also committed the cardinal psychometric sin of comparing incomparable scores.
The meaning of sumscores can never be assumed to be constant. A simple reason for this is that test content changes with the times. Even content that’s repeated over the years is interpreted differently across the generations. This matters. In WISC-V Clinical Use and Interpretation, Weiss et al. begrudgingly noted why that test’s fullscale scores appeared to show racial gap closure:
In the early part of the last century, Spearman (1927) hypothesized that group differences in IQ test scores could be explained by innate differences in “g” between the races, and this position continues to rear its ugly head 70 years later (Jensen, 1998; Murray, 2005). Some will likely follow this antiquated line of reasoning and argue that [African American] FSIQ was increased in WISC-IV and WISC-V by increasing the contribution of cognitively less complex subtests with lower “g” loadings (e.g., Coding and Symbol Search) in the FSIQ, and they could be correct in so far as psychometric studies of “g” are concerned. (p. 169)6
Dickens and Flynn’s results were also likely to be driven by capitalizing on tests becoming less g-loaded, since Black-White differences are due to g. Dickens and Flynn provided evidence for this perspective whilst claiming to show evidence against it.
Dickens and Flynn claimed that, between the WAIS-R and the WAIS-III, Blacks gained 1.20 IQ points and 2.82 points for individuals under age 25. At the same time, they claimed that practically all of those gains (1.17 and 2.57) were in g. The way they did this was nonsensical, because it involved simply computing the gains in a principal component score—a sumscore that contains g and non-g variance! They conflated g and non-g gains and they completely ignored that fact. This fact should have been very obvious to them, because the Black-White differences (in the full sample) correlated at 0.65 with g loadings (0.74 in the <25 sample), whereas the Black gains were negatively correlated with g loadings (-0.28 and -0.73 in the full and <25 samples, respectively). They replicated this error when they compared the WISC-R and the WISC-IV, where they noted IQ gains of 4.93, “g gains” of 4.67, a correlation of Black-White differences and g loadings of 0.86, and a correlation of Black gains and g loadings of -0.38.
To actually estimate gains in g requires modeling g in a way that separates g and non-g variance, not principal components analysis (PCA). While PCA can compute good-enough scores in general, this is a scenario where it’s important to use more appropriate methods, like multi-group confirmatory factor analysis (MGCFA). Luckily, appropriate methods have been applied to some of the tests they used.
For the AFQT, Dickens and Flynn claimed gains of 3.62 points. We saw that the corrected form of their estimated using Hedge’s g was 3.57 points, and using the estimates for the 1997 sample from Hu et al. and the estimates for the 1979 sample from Lasker, Nyborg & Kirkegaard7, the gap closure for g declines to 1.95 points—a reduction of 46%. Methodology matters, however, and the gap may have actually increased in size.8
Through reanalysis of the norming samples’ correlation matrices, means, and standard deviations with MGCFA, I found that the WISC-R to WISC-IV change was from a g gap of 1.19 to 1.09 g, or a more modest gap closure of 1.5 points. I do not yet have access to the WISC-V to update the WISC’s estimate9, but it is not unheard of for small gains to wash out between norming sample years. For example, Murray analyzed the WJ-I, WJ-II, and WJ-III tests and found a fullscale difference of 1.23 g for the WJ-I (1976-77), 0.89 g for the WJ-II (1986-88), and 1.05 g for the WJ-III (1996-99). If we extrapolated from the WJ-I to WJ-II, we would expect the gap to be 5.1 points smaller, but by the later cohort, it had grown again so the possible gains shrunk to just 2.7 points. But Murray didn’t go latent, and he recognized how this might be useful: “The data… lend themselves to [MGCFA]…. I leave those analyses to specialists in the relevant techniques.” Being such a specialist, I can confirm Murray’s results with respect to g, finding gaps of 1.18 g, 0.95 g, and 1.14 g in the respective test administrations, suggesting a nonsignificant gap closure of 0.60 IQ points between 1976-77 and 1996-99. Unfortunately, I do not have the data from the WJ-IV so we’ll have to stick with the WJ-III result.
I have the norming sample data from the WAIS-R, but not the WAIS-III. However, this isn’t a worry, since the WAIS-IV has been analyzed appropriately by Craig Frisby and Alexander Beaujean. Where I found a latent gap of 1.03 g (and Dickens and Flynn even reported an observed fullscale gap of 1.01 g), Frisby and Beaujean found a gap in latent g of 1.16 g, or an increase of 1.95 points.10 Gap closure evidently did not stick for this test. The WAIS-V has been released, but no appropriate analyses of it have been published and I don’t have the data so we’ll stick with the WAIS-IV result.
Unfortunately, I do not have access to the data for the SB, so completely correcting Dickens and Flynn’s estimates isn’t possible. Regardless, we can still update Dickens and Flynn’s estimates using the same sets of tests to obtain a new average gap closure of 1.01 points over an even longer timeframe. Adding the WJ, DAS, K-ABC, and WPT results, the gap apparently closed by just 0.98 points. If all the data were available to be analyzed properly, I have no doubt the apparent gap closure would shrink even further.11
The difference in results handled poorly and results handled better is incredible:
Dickens and Flynn’s claims continue to be cited and taken seriously to this day, but few have apparently bothered to understand how their work was impacted by extrapolations that were unwise when they were made and which later turned out to be wrong, in addition to their interpretive errors and the omission of contradictory data.
Because Black-White differences around this size have been observed for a very long time, I’m confident this tiny apparent gap closure isn’t real. Because more recent results still show the same gap as ever—all of which are consistent with the Fundamental Law of Sociology—I’m even more confident. As noted in an earlier article, three large, recent (post-2010) U.S. cohorts found gaps of 1.09 g, 1.05 g, and 1.06 g. There are not contradictory findings, so we do not have any reason to think the gap has closed.
But there’s no need to rest on our laurels. Let’s go ahead and look at results for the post-COVID era, and let’s expand the sample to look at other races and ethnicities too!
Let’s answer the question: how large are America’s racial intelligence gaps?