"You Couldn't Replicate Our Study Because You're Ugly"
Attractiveness rating studies shouldn't be taken too seriously
This post is brought to you by my sponsor, Warp.
This is the ninth in a series of timed posts. The way these work is that if it takes me more than one hour to complete the post, an applet that I made deletes everything I’ve written so far and I abandon the post. Check out previous examples here, here, here, here, here, here, here, here, and here. I placed the advertisement before I started my writing timer.
Plenty of scientists study the impacts of attractiveness on real-world outcomes like success landing a job, getting dates, or doing well in games. Unfortunately, the methods they use to do these studies are fraught, sometimes even bordering on absurd.
Like many literatures, attractiveness studies are apparently littered with replication issues. But, unlike so many other fields, there might not really be a huge replication problem, and the reason is because the field’s methods are so absurd. Consider the case of the following pair of articles, one López Bóo, Rossi and Urzúa and a discrepant article by Ruffle and Shtudiner.
Ruffle and Shtudiner’s article first appeared online as a preprint uploaded to the SSRN preprint server in November of 2010. López Bóo, Rossi and Urzúa’s article first appeared as a preprint uploaded to the IZA preprint server in February of 2012. By the time López Bóo, Rossi and Urzúa’s paper was uploaded, Ruffle and Shtudiner’s had already gained some public acclaim because of a surprising finding: attractive women were penalized in job applications.
Ruffle and Shtudiner found that when they send out job applications to employment agencies, men with attractive pictures got significantly more job offers compared to men without pictures (β = 0.073, z = 4.06), whereas unattractive men got significantly fewer (-0.051, 3.64). For women, the effects were directionally negative in both cases, but were marginally significant (-0.031/-0.039, 1.94/2.18). However, when callback rates were examined for people who applied directly to companies, male attractiveness became irrelevant (z’s = 0.72, 1.36), and for women, attractive women became seriously disadvantaged relative to women who elected not to include a picture with their application (-0.066, 2.54).1
If these findings are correct, it suggests that, perhaps, the women who tend to do hiring discriminate against attractive coworkers of the same sex, limiting intrasexual competition in the workplace. But these findings may not be real. Beyond their statistical tenuousness, López Bóo, Rossi and Urzúa simply couldn’t replicate their finding that attractive women suffer hiring discrimination relative to unattractive women. From the initial preprint:
The estimated coefficient associated with [Ruffle and Shtudiner’s focal] interaction term [in our study was] not significant, suggesting that the beauty premium is similar for women and men. This contradicts the evidence in Ruffle and Shtudiner suggesting that only men enjoy a beauty premium, and subsequently, it challenges their explanation, namely that female jealousy of attractive women in the workplace explains the punishment of attractive women.
Why? Because, López Bóo, Rossi and Urzúa suggested, their results relied on objective measures of beauty, and Ruffle and Shtudiner’s relied on subjective beauty measures.
An objective measure of beauty? That doesn’t seem possible, but there’s something to the claim. Their objective beauty measure is to take normal pictures of people and distort them using an image editing tool, so they look so bad that no sensible person would consider them attractive compared to their unmanipulated counterpart. This is how that looks:
This strategy to doing attractiveness studies is interesting and it definitely seems like it could be useful. It’s hard to argue that this doesn’t do what it’s supposed to, because the edits really do make for pictures that are so relatively unattractive that basically everyone would agree the “Attractive” edits are better than their “Unattractive” counterparts. The problem here is that this method doesn’t allow researchers to test Ruffle and Shtudiner’s claims, at least with these images. The reason is that, while this “objective” attractiveness manipulation allows for relative comparisons between pairs of manipulated and unmanipulated images, it doesn’t allow for comparisons across pictures. The people López Bóo, Rossi and Urzúa dubbed “Attractive” and “Unattractive” are correctly labeled within pairs, but someone they’re calling “Attractive” might still be a 4/10, with a manipulated image that’s brought down to a 2/10. Ruffle and Shtudiner noticed this and updated their preprint in response:
In an attempt to account for our papers’ seemingly divergent results on attractive females, we asked our original panel of judges to rate the attractiveness of all [the] photos from López Bóo et al. on the same 1-9 scale [that we used]. Seven of our eight judges rated all 16 (4 in each beauty category) of the… photos that appear in… their paper. The judges’ mean ratings for the four beauty categories appear in Table 1 in the first row (our photos) and third row (theirs). Both in our paper and in theirs, the judges rated the attractive candidates significantly higher than the plain candidates for males and females alike, thereby attesting to the reasonableness of the choice of attractive and plain photos in both papers.
The most striking difference between the attractiveness ratings is that our attractiveness candidates are rated markedly higher than those in López Bóo et al.. At 8.0 out of 9, the mean rating for our attractive females is nearly twice that of 4.11 for theirs, and our attractive males’ mean rating of 6.48 is more than 60% higher than theirs of 4.14. In fact, the mean ratings of their attractive candidates are sufficiently low that they resemble more… those of our plain females… Similarly, their attractive males are rated only one point higher than our plain males.
In other words, Ruffle and Shtudiner’s subjective approach, where a diverse group of judges rated different pictures on a ‘typical’ attractiveness scale (à la “How Hot, 1-10?”), allowed for an absolute ranking of attractiveness on which López Bóo, Rossi and Urzúa’s pictures weren’t even competitive. Or put differently, they just weren’t hot enough to test Ruffle and Shtudiner’s hypothesis. Ouch!
But this doesn’t mean Ruffle and Shtudiner were right, it just means that López Bóo, Rossi and Urzúa didn’t meaningfully fail to replicate their findings. It could have been the case that Ruffle and Shtudiner’s ratings were biased or deficient. They seem valid enough, but who can really say? Their ratings were not objective, just potentially significantly intersubjectively agreed upon. I think that’s probably good enough, and what we should say based on these two results is not that the papers are contradictory, but that both methods have their uses, their results just can’t be compared willy-nilly and doing so is likely to generate confusion and claims of replication failures where there aren’t any.2
As to whether Ruffle and Shtudiner’s unintuitive result really replicates, I can’t say. I could see it holding up, but to my eyes, it doesn’t actually look all that reliable to begin with. It looks more like it’s a finding that’s likely to raise eyebrows and upset people than it is to be one that actually replicates.
Attractiveness studies already have a lot of potential to be contentious, but what if we made them extremely controversial?
A Message From My Sponsor
Steve Jobs is quoted as saying, “Design is not just what it looks like and feels like. Design is how it works.” Few B2B SaaS companies take this as seriously as Warp, the payroll and compliance platform used by based founders and startups.
Warp has cut out all the features you don’t need (looking at you, Gusto’s “e-Welcome” card for new employees) and has automated the ones you do: federal, state, and local tax compliance, global contractor payments, and 5-minute onboarding.
In other words, Warp saves you enough time that you’ll stop having the intrusive thoughts suggesting you might actually need to make your first Human Resources hire.
Get started now at joinwarp.com/crem and get a $1,000 Amazon gift card when you run payroll for the first time.
Get Started Now: www.joinwarp.com/crem
Worse Than Attractiveness Studies
Aldana and Salazar just published an interesting experimental study that claims to show that both heterosexuals and homosexuals on the dating app Tinder prefer “otherwise equivalent” Whites over Blacks. Thus, rather than helping to “build a more racially diverse world”, dating apps may instead “reinforce discrimination”.
This evidence for pro-White racial discrimination is not particularly credible because of the nature of the images the researchers used. They used images that are not equivalent in terms of their looks—fine-looking White people and OK-looking Black people, all of whom aren’t even doing the same facial expressions.3
The problem in this design is immediately apparent. The solution, not so much, for a lot of reasons.
To do this study right, a simple way is to start could be with a lot more pictures. With more people in the pile, you can go through and match a large sample of Black and White participants on multiple dimensions of rated attractiveness—in terms of mean ratings and rating variability—as judged by a large, diverse panel of judges. Then, when you have images of lots of Blacks and Whites who are actually “otherwise equivalent” rather than clearly non-equivalent based on ratings that were either poor or insufficiently descriptive for the task at hand, you can deploy them on the app with a diversity of profile information, from ages to interests to supporting hobby pictures, and you can use that to see if people still have a racial preference by examining compensating differentials and matched results at scale.
But even if you do this, you have an uphill battle to face for the same reason audit studies are so fraught. Namely, that it is frequently hard to distinguish statistical discrimination from taste-based discrimination. Take the authors’ choice to portray everyone as if they were into hiking/being outdoors. There are differences in hobbies and leisure time-use by race, and everyone knows that some people—regardless of race—look more or less “sporty”, “nerdy”, “artsy”, and so on, so if you aren’t matching on perceived hobbies, why would you expect those hobby pictures to lead to the same judgments?
Without putting in a lot more work to improve the study’s design, you’d have to expect Tinder users are choosing to be ignorant to make conclusions about racial discrimination. I could say more, but I think I can cut the commentary off there.
I have no doubt that attractiveness studies can be useful, but I have major doubts that they’re being conducted anywhere near as well as they should be, and when the study is focused on the intersection of attractiveness and race, sex, religion, or any other demographic, I am extremely certain it’s going to be bad. Because the studies are so clearly methodologically faulty that the results of different studies are so often incomparable, ‘failed replications’ won’t be disqualifying. That’s not good, because if replications aren’t going to be disqualifying, their purpose is limited and the reliability of the literature’s key findings will remain extraordinarily uncertain.
And marginally significantly disadvantaged relative to women who were unattractive (-0.007, 0.80; p-difference = 0.12).
For what it’s worth, López Bóo, Rossi and Urzúa eventually dropped the claim that they failed to replicate Ruffle and Shtudiner’s work.
The authors of this study tried to portray every profile with similar interests, but another neglected dimension is that the people don’t equally look like they’re all into the same hobbies, perhaps regardless of race, perhaps otherwise.
FWIW, I'm sure you're aware of the OKCupid data from before the Great Awokening showing that, indeed, white men and Asian women were most rated most attractive by most races, and Asian men and black women least. The exceptions were that black men were rated as most attractive by black women.
The latest numbers I can find are from Quartz, in 2013, which find similar things:
https://qz.com/149342/the-uncomfortable-racial-preferences-revealed-by-online-dating
Exception seems to be that now black men are at the bottom of the heap (except again for in the eyes of black women), and there is a preference of Asian men for Latina women; I don't know what this means or why it occurs (though there may be some delicious fusion restaurants in our future).
Those photoshopped “unattractive” faces look like they have developmental abnormalities or recovered from facial reconstruction surgery after disfiguring injuries. Either of those possibilities has a much greater effect than just being “unattractive”.
I bet AI image generators would help here if used carefully. Could adjust the prompts to get just the right range of images.