Reflections on My Adventures in Replication
My attempts to demonstrate the problems in the knowledge production system have led me to a Damascene conversion on the need for peer review
Over the past year or so, I have taken a break from the study of economic history. Having entered middle age, I returned to a question that moved me when I was barely a man: What do we actually know? I have trespassed on others’ territory, engaging in a series of replication studies in economics. Here I will outline where I have been and what I have concluded about the problems in our knowledge production system.
My adventures began with Karl Marx—or, more specifically, a “Synthetic Karl Marx.” In June 2023, the Journal of Political Economy published an article that seemed to use econometrics to demonstrate that Marx was “an occasionally acknowledged but relatively minor figure between his death and the events of 1917.” Only the Russian Revolution, the article claimed, saved him from obscurity. Knowing a little intellectual history, this seemed an odd finding, so I decided to investigate.
The result was an article in Econ Journal Watch, an online journal that I have long admired. My article on “Synthetic Karl Marx” was published in September 2024, together with a response from the two scholars whose work I was criticizing. I followed with a rejoinder in March 2025, which was accompanied by another reply from my interlocutors. Among other things, I showed that their article’s main “Synthetic Karl Marx” had used an absurd Frankenstein’s monster that mainly consisted of Abraham Lincoln and Oscar Wilde as a counterfactual for what would have happened to Marx’s Google n-gram share if there had never been a Russian Revolution. The whole thing was nonsensical, yet it had been published in the Journal of Political Economy, one of the “top five” economics journals that make or break economists’ careers.
I then began to wonder what other nonsense was being given the rubber stamp of truth by the peer review system.
The Synthetic Control Method (SCM) seemed like the obvious place to begin, given that it was the econometric technique that had been used to make the claims about Marx’s influence in the JPE. My initial assumption was that Magness and Makovi had misused a fundamentally sound method. But my doubts grew, not least when I found a weakness in the SCM that made it easy to engineer statistically significant results. So I looked harder at the existing literature.
And then I found something very silly.
It was in “Comparative Politics and the Synthetic Control Method”, an article by Alberto Abadie, Alexis Diamond, and Jens Hainmueller (2015) that is routinely used to teach the SCM in universities. It purports to provide “precise quantitative inference” for the effect of German reunification on West Germany’s real GDP per capita from 1990 to 2003. Looking at the replication, however, I realized that Abadie et al. had actually used nominal GDP per capita but then referred to it as real GDP per capita throughout. The result would eventually be a paper released by the Institute for Replication (I4R), together with an unconvincing response from Abadie et al. In the paper, I found far more issues than expected, going beyond the use of the wrong outcome variable. My findings suggest that the SCM should not be used for real GDP per capita at all, even though it routinely is. In their response, Abadie et al. also seemed to struggle somewhat with index number theory, which is concerning, given the SCM is essentially an index number-generating algorithm and they are its architects.
I began to find more and more problems in economics articles.
“The China Syndrome: Local Labor Market Effects of Import Competition in the United States” by David H. Autor, David Dorn, and Gordon H. Hanson, came next. Published in the American Economic Review in 2013, it is arguably the most influential economics article of the past quarter century, given its use by the Trump administration. Most importantly, I tried to understand the corrections that Kirill Borusyak, Peter Hull, and Xavier Jaravel had made to Autor et al.’s econometrics. Like most people, I obsessed over how they had reduced but not eliminated the coefficient for Chinese imports’ effects on the manufacturing employment share, seemingly supporting Autor et al.’s basic findings. Finally, I wondered what the Borusyak et al. corrections did to Autor et al.’s other dependent variables: and then I found it in a table tucked away on page 42 of Borusyak et al.’s Online Appendix that showed that their corrections eliminated the major negative effects that Autor et al. had found for welfare indicators. I therefore applied the Borusyak et al. corrections to all Autor et al.’s dependent variables and found that the analysis collapsed: even if Chinese imports did lead to a reduced manufacturing employment share, they appear to have had no negative effects on welfare.
And then came the most disturbing replication of all. Published in the Quarterly Journal of Economics in 2012, Nico Voigtländer and Hans-Joachim Voth claimed to have traced the origins of Nazi antisemitism to medieval Germany. Using large language models (LLMs) to assist me, I nevertheless found major discrepancies between what Voigtländer and Voth had coded in the dataset and what their sources actually said. In a first draft of my paper, I made 171 corrections to Voigtländer and Voth’s dataset; in their response to that first draft, Voigtländer and Voth identified just 2 errors in my recoding, while also hallucinating 3 others. Subsequently, in private correspondence, Voigtländer and Voth have not been able to identify any more errors in the remaining 169 corrections that I made to their dataset. They have also offered no explanation for why there are numerous discrepancies in their dataset relating to the Gedenkbuch, the official government record of German Jews killed during the Holocaust. In my last email from them, Voth instructed me to send a revised version to the Pope and the United Nations. I therefore decided to make the paper available online, but Voigtländer and Voth have still not provided any answers to the questions that I raise.
I had also decided to automate my efforts. There was simply too much for me to replicate. I could not produce any more of these time-consuming replication studies. Instead, I thought that I could use the machines to do what I had been doing manually, albeit on a more superficial level. I became a “prompt engineer.” I spent months building a system that would use LLMs to investigate the existing literature. The result can be found at isitcredible.com, a website that produces reports by “Reviewer 2,” my automated assessment system for academic texts.
My goal was grandiose in a Borgesian way: with tongue only slightly in cheek, I wanted to produce thousands of these reports, leading to a “Total Audit of All Human Knowledge.” I hoped to use machines to provide a better answer to the question that had motivated me as a young man: What do we actually know? Eventually, I believed, isitcredible.com would tell us the limits of our knowledge. It would help us to be more like Socrates, when he famously said that he was wiser than a know-all because “what I do not know, I do not think I know, either.” It would use machines to operationalize Karl Popper’s definition of the scientific method as “the search for and the elimination of errors in the service of truth.”
Such was my belief in the capabilities of the machines that I also decided to run an experiment.
During the testing of Reviewer 2, my system found many problems in published articles. In Sarah Pierce et al.’s “Prime Editing-Installed Suppressor tRNAs for Disease-Agnostic Genome Editing,” an article on gene-editing published in Nature, it discovered an error-riddled pathologist’s report in the supplementary materials. I wrote to Nature about it, and they then allowed the authors to retrospectively alter a signed-and-dated pathologist’s report. This seemed odd, and piqued my suspicions. I therefore decided to investigate the article further. Using LLMs, I dug into the article’s replication data, and together we appeared to discover problems in the safety validation. I dug deeper using multiple models and the case seemed watertight: there was clearly something wrong with the safety data, the LLMs and I concurred. I then sought to validate my suspicions with human expertise: I made numerous appeals on social media for people with understanding of gene editing to give me their opinion; I sent a paper to the authors explaining my concerns; I posted on Pubpeer, appealing for information. But I had no luck: experts on gene-editing, it seemed, were hard to find. I was nevertheless convinced that I had discovered something important. Eventually, I submitted the paper to Nature as a Matters Arising. I saw it as a test: could the LLMs allow me to detect a major issue in an article on a scientific topic far beyond my field of expertise?
The answer was a resounding “No.”
When the authors of the article sent their response to Nature, it turned out that I had been extremely stupid: what I thought had been a failure of safety validation was in fact a sign of the drug working as expected. A random economic historian on a hill in Wales, it turns out, should not assume to know things on scientific topics far beyond his field of expertise. It was a total humiliation.
The problem, I think, was that having found such problems in major economics articles, I assumed that I would also find them in STEM articles, given that they both operate under the same peer review system. The robots then sycophantically helped me to find what I wanted to find, even though it wasn’t there. We hallucinated together, and I began to believe that I knew what I did not actually know. I had become the antithesis of Socrates’ wiseman: a sophist, taken in my own nonsense.
One of my mistakes was to assume that peer review is the same in STEM as in economics. If you look at the Pierce et al. article in Nature, however, you can see that it is quite different: two of the article’s reviewers are, for example, named. There is also a clear procedure for post-publication review: any random economic historian on a hill in Wales can submit a Matters Arising comment if they incorrectly believe they have discovered major issues in a published article.
My impression is that STEM is moving toward a system of peer review that is more transparent and continuous. The reviewers are often identified, and sometimes the journals make the reviews public. There is also a strong culture of post-publication criticism. I recently watched scientists descend like rabid dogs on an article in Nature that claimed a treatment for lung cancer was far more effective when given in the morning compared to the evening. Seeing them go to work on Pubpeer and X was a joy to behold. Their culture of criticism accords with Popper’s definition of the scientific method as “the search for and the elimination of errors in the service of truth.” Furthermore, retractions are relatively common, with clearly defined procedures to follow.
The contrast with economics is stark. While economists pride themselves on the robustness of their seminars, what actually matters is publication in just five journals. The editors have immense power. Peer review is closed and anonymous. Virtually nothing is ever retracted. Post-publication peer review is minimal. Instead, my experience suggests that there is a culture of not publicly criticizing anything that has been published. If you do, you are viewed as too aggressive, possibly due to some kind of personality defect. Meanwhile, the original authors can use their right to reply for deflection and ad hominem attacks. The fear of upsetting one’s superiors is palpable.
The machines will perhaps bring about some changes. They are massively useful for replication. Without asking a robot to explain it to me repeatedly as if I were a five-year-old, I never would have understood the SCM or Borusyak et al.’s critique of Autor et al.; I never would have done all the coding my replications required; I never would have been able to search for errors in German-language source materials. The cost of doing replication studies has dropped dramatically, even if the institutions and the culture of economics are still hostile to them. Perhaps the system will catch up.
Ultimately, however, human experts are still needed to determine the truth. The danger of the machines was demonstrated during my humiliation by Pierce et al.: using the robots, I turned myself into a know-all who actually knew nothing. For this reason, I no longer think that my Reviewer 2 system can produce “A Total Audit of All Human Knowledge,” and I have deleted the public archive from isitcredible.com. We cannot delegate the determination of truth to the machines. Hence, while I will continue to use it in my work (and I will keep it available for others for as long as the fees are paying the server costs), I do not think my system should be allowed to publicly pronounce on the credibility of research. That is ultimately for human experts to decide.
Peer review is necessary. On this, at least, I have had a Damascene conversion.
When I began my adventures in replication, I wanted to demonstrate how fundamentally corrupt the peer review system is: how it allows seemingly credible nonsense to be published with the rubber stamp of truth. In my replications of widely-cited economics articles, I discovered plenty of evidence for this. But now I think that the system can be salvaged and made into something more effective. Developments in STEM suggest that a more open and continuous system of peer review is possible. As and when I return to my critique of economics, this will be my hypothesis: the problem is not peer review per se, but the anti-scientific way in which it is applied by and to economists.
For now, however, I am returning to economic history: I am working again on my book on American capitalism. During my adventures in replication, I have learned a lot about the knowledge production system, but it has also been tiring. As Saul of Tarsus discovered, it is hard to kick against the pricks.
You can subscribe to my blog here if you haven’t already:
I can also be followed on Twitter/X.








Always have to appreciate someone capable of roundly admitting having been wrong, respect!
Which models did you use?