much ado about reproducing social science experiments
Social science gets a bad rap because not only does it sometimes make us confront some very ugly truths about human nature, its studies can be very difficult to reproduce, so much that the undertaking in doing just that found they couldn’t get the same results as more than half of the papers they tried to replicate. But ironically enough, an effort to replicate the replication did not succeed either. For those who are having trouble following this, let’s recap. Researchers trying to figure out how many social science papers can be reproduced didn’t conduct a study others were able to reproduce themselves. That’s a disaster on a meta level, but apparently, it’s more or less to be expected due to the subject matter, measurement biases, and flaws involved. In a study challenging the supposedly abysmal replication rate known as the Replication Project, it’s quickly evident that the guidelines by which the tested studies failed were simply too rigid, even going so far as to neglect clearly stated uncertainty and error margins, and choosing to perform some experiments using different methods than the papers they were trying to replicate.
Had the Replication Project simply followed the studies carefully and included the papers’ error bars when comparing the final results, it would have found over 70% of the replication attempts successful. That sounds not that great either, with more than one in four experiments not really panning out a second time, but that’s the wrong way to think about it. Social sciences are trying to measure very complicated things and they won’t get the same answer every time. There will be lots and lots of noise until we uncover a signal, and that’s really what science does. Where a quantification-minded curmudgeon sees failed replication attempts, a scientist sees failures that can be used as a lesson in what not to do when doing a future experimental design. It would’ve been great to see the much desired 92% successful replication rate the Replication Project set as the benchmark, but that number reduced the complexity of doing bleeding edge science that often needs to get it wrong before it gets it right, to the equivalent of answering questions on an unpleasantly thorough pop quiz. Add the facts that the project’s researchers refused to account for something as simple as error bars when rendering their final judgments, and that they would one in a while neglect to follow the designs they were testing, and it’s difficult to trust them.
Where does this leave us? Well, there is a replication problem in social sciences, so much that studies claiming to be able to measure it are themselves flawed and difficult to replicate. There are constant debates about which study got it right and which didn’t, and we can choose to see this as a huge problem we have to tackle to save the discipline. Or we can remember that this back and forth on how well certain studies hold up over time and whose paper got it wrong and whose got it right are exactly what we want to see in a healthy scientific field. The last thing we want is researchers not calling out studies they see as flawed because we’re trying to find how people think and societies work, not hit an ideal replication benchmark. It’s part of that asinine, self-destructive trend of quantifying the quality out of modern scientific papers by measuring a bunch of simple, irrelevant, or tangential metrics to determine the worth of the research being done and it really needs to stop. Look, we definitely want lots of papers we can replicate at the end of the day. But far more importantly than that, we want to see that researchers are giving it their best, most honest, most thorough try, and if they fail to prove something or we can’t easily replicate their findings, it could be even more important than a positive, repeatable result.