Aggregating evidence
Rafe Meager on aggregating evidence in the social sciences, the research process, and how to read results.
You can listen to this podcast on Spotify, Apple Podcasts, or wherever else you get your podcasts. You can also watch this conversation on YouTube.
The conversation about evidence-based policy often focuses on the policy side. The questions we ask tend to be something like: why is this amazing body of evidence not already shaping decisions?
But what we should also be asking, on the research side, is whether the evidence base itself is actually worthy of shaping policy.
This week on Ideas in Development, I was joined by Rafe Meager, who is an Associate Professor at the University of New South Wales, and is one of the leading meta-analysts in economics. We discuss why the evidence base is messier and more uncertain than we would like, and how we can fix that by aggregating across studies.
A paper is not a proof
It helps to start with what a single paper actually involves. Rafe describes research as a long, artistic process with a significant front-end thinking cost, including years of accumulating background knowledge, understanding, and insight before the idea for a project actually arrives. This is then followed by even more years of iteration, working papers and convincing expert referees that your result is new, plausibly true and relevant.
It can be tempting to treat the published paper that emerges out of this process as though it were a mathematical proof that something is true or false. But Rafe argues that this is the wrong mental model.
“It’s not right to think about papers as proving that something is happening or not happening, or true or not true… empirical social science is about the balance of probabilities”
Empirical social science deals in “the balance of probabilities”, and we accumulate evidence for or against fairly narrow propositions. And even a genuine mathematical proof only applies where its axioms hold, and whether those axioms describe your actual situation is itself a fuzzy, subjective judgement.
“If you do statistics enough, you come to respect the noise. You come to respect what’s not captureable as something… important alongside what we’re very happy that we can capture and explain.”
The noise we ignore
Even well-designed and well-executed papers can mislead.
“There’s just very little serious attempt to deal with noise. People do these power calculations that are mostly, to be honest, made up or kind of reverse engineered. And we wouldn’t accept that in other areas of our statistical practice.”
Rafe argues that while the discipline obsesses over internal validity and causal identification, we don’t hold ourselves to the same standards when it comes to dealing with noise. Precision, variance and statistical power are relegated to secondary concerns. The result is an evidence base far less precise than its confident abstracts suggest.
And every result is produced against a backdrop of existing knowledge, which shapes how a study is designed. So doing the best possible job given what is known, if what is known turns out to be wrong, means that a study can still mislead.
“You can do the absolute best job given the existing knowledge that you have. And yet if that existing knowledge is wrong, the study might still be misleading in a way that no one could know.”
Social science is hard!
We then discuss external validity — whether a result learned in one place and time tells you anything about another. Classical statistical inference assumes you are drawing from a defined population by some known process. But in social science there is never really a fixed population.
“This idea of a population, especially in social science… there’s no population of adults in the UK. That’s changing every single day because people are moving in and out of the population.”
Everything flows. So even extrapolating from the past to the future for the ‘same’ population carries an error that classical statistics simply does not capture.
The temptation, faced with this, is is often one of two bad responses. One is to treat a single study as decisive: a zero effect on microcredit here means it works nowhere; a positive effect on cash transfers there means we should roll it out everywhere. The other is the fashionable pessimism that says it is absurd to imagine anything that works in Kenya could work in Ghana. Rafe rejects both. It is not absurd at all – it may or may not be true, and we have the capacity to study that and we should.
The whole premise of policymaking, as opposed to a tailored personal solution for every single individual, is that aggregation is possible. The question is how to do it honestly.
Separating real differences from sampling noise
If a trial in Ghana gives a different estimate from one in Kenya, that does not necessarily mean the underlying effects differ, because rerunning the Kenyan study would itself produce a different estimate. Sampling error alone leads to apparent differences, and on sensible assumptions it will overstate how different the effects really are. So what you need is a technique that separates true variation from sampling variation, and the regression tools of an introductory econometrics course will not cut it.
The job requires a deconvolution, a problem that astrophysicists also face, and Rafe’s tool of choice is Bayesian hierarchical modelling, which nests the uncertainty within each setting inside a second layer of uncertainty across settings.
And this does not require hundreds of studies! Often people who are trained on classical statistics incorrectly assume that meta-analysis demands a large number of estimates.
This is where Rafe’s microfinance work comes in, which aggregates seven randomised trials around the world that had a range of estimated effects. The microcredit literature is large, but most does not focus on the question which Rafe cared about: what happens if you expand access?
Only a few studies answer that, and with a small number of genuinely comparable data points, Bayesian hierarchical modelling, which does not lean on large-sample approximations, can help.
Can your system actually deliver this?
Working with Noam Angrist and Youth Impact, which was scaling targeted instruction across Botswana and into Namibia and South Africa, Rafe used these methods to find which features of an intervention have to be preserved at scale and which can be compromised. Aggregation methods cannot answer this causally, but they can show which features correlate most strongly with larger treatment effects, telling you where to look.
What emerged was that implementation fidelity was very strongly predictive of effect size. Take-up is not a simple binary. Attending one remedial class is not the same as attending eight weeks of them, and the quality of teacher training mattered enormously for something as difficult as assessing children, splitting them by level and teaching each group accordingly. When Youth Impact then varied the intensity of targeting directly, more targeting produced more learning, corroborating the correlation.
Prior to the study, some economists had insisted that implementation could not matter much in studies, because trials are ‘gold-plated’ – everyone tries hard, so there is no meaningful variation to find. But basic contact with reality says otherwise: even smart people trying hard are usually barely getting these studies over the line, because they are attempting something ambitious.
Implementation, in other words, can be studied scientifically. For a policymaker, this work reframes the central question. Instead of asking “does this program work?”, the better question is “can my system actually deliver this program?”
What aggregation cannot fix
Rafe is equally clear about the limits of aggregation methods. Sites where rigorous evaluation is possible may differ systematically from those where it is not.
Publication bias persists, as the research community has preferences about what it wants to see that may or may not track the truth.
And there is no clean solution to any of this, which is precisely why Rafe calls for interdisciplinary collaboration and partnership between academics and policymakers. Neither group would really want to do without the other – more sources of knowledge are simply more.
Heuristics for the rest of us
Asked how they react to a single causal claim on social media or in an abstract, Rafe offered a set of rules of thumb.
More than one study is always better than one – is this the first study of a question, or building on an established literature?
The quality of the causal and statistical techniques matters.
Was the study pre-registered and, if so, what was the pre-analysis plan?
Ask people who know more than you do whether a finding passes their sniff test.
Seeking out the bits of knowledge you are missing is among the most useful things a reader can do.
The future of aggregation
Looking ahead, as AI drives up the volume of research, the need for disciplined aggregation will only increase. Careful attention is required on the quality of inputs and to whether studies are even conceptually coherent enough to combine.
Rafe worries about ‘cookie cutter’ approaches that apply meta-analytic recipes without care – more methodological progress is required for these aggregation methods.
Beyond that, we need to empower the public to understand evidence the way it understands football teams. Some evidence is simply better than others, the quality of studies varies, and saying so out loud need not be hard. The alternative we are left with – a public fed an endless stream of contradictory “studies say” headlines and given no tools to weigh them – is a recipe for losing trust in social science altogether.

