Wednesday, September 7, 2011

Why Randomized Controlled Trials Work in Public Health...and Not Much Else

If there is a next big thing at the moment in the field of economics, it is the application of techniques from medical research—specifically, “randomized controlled trials,” or RCTs—to assess the effectiveness of development or other government-initiated projects. The general thrust of the work is as simple as it is brilliant: rather than employ complex, and often unreliable, statistical methods to tease out the extent to which a project or policy actually had a beneficial impact on intended beneficiaries, why not follow the tried and true methods employed by pharmaceutical researchers to assess the efficacy of medical treatments? The steps are: 1) randomly divide the experimental population into a treatment group that receives “benefits” from the program, and a control group that doesn’t; 2) assess outcomes for both groups; and 3) determine whether or not a significant difference exists between the two groups.

This approach, championed by MIT economist Esther Duflo (winner of the prestigious Bates Clark Medal, granted to the most promising economist under the age of forty) among others, is a marked improvement on the status quo in development aid, which not infrequently involves assessing program effectiveness by simply checking whether or not the money was spent. Clearly, a world in which aid money is allocated to projects that actually benefit people is preferable to one in which money goes to whoever most effectively maneuvers to get the money to start with and then most reliably manages to spend it. In this way, the use of RCTs in development arguably advances the goal of aid effectiveness.

So, to be clear, RCTs quite credibly represent the “gold standard” in program assessment. If other methods are used, it is usually because RCTs are too expensive or otherwise impractical. Describe your favorite entrepreneurial initiative to an economist and the most probable skeptical response you’ll receive is: “Sounds wonderful, but where’s the evidence of effectiveness? In order to find out whether or not it worked, you really need to do a randomized controlled trial."

So what is the problem with applying RCTs to development?  The Achilles heel of RCTs is a little thing known to the statistically inclined as “external validity”—a phrase that translates informally to “Who cares?” (In a future post I'll elaborate on another fundamental flaw of RCTs, which is technically termed the "non-stationarity" of underlying processes.)

The concept of external validity is straightforward. For any assessment, “internal validity” refers to the mechanism of conducting a clinical trial, and the reliability of results on the original setting. A professionally conducted RCT that yields a high level of statistical significance is said to be “internally valid.” However, it is fairly obvious that an intervention rigorously proven to work in one setting may or may not work in another setting. This second criterion—the extent to which results apply outside the original research setting—is known as “external validity.” External validity may be low because the populations in the original and the new research setting are not really comparable—for example, results of a clinical trial conducted on adults may not apply to children. But external validity may also be low because the environment in the new study setting is different in some fundamental way, not accounted for by the researcher, from the original study setting. Econometric studies that seek to draw conclusions about effectiveness from data that span large geographical areas or highly varied populations thus typically have lower levels of internal validity, but higher levels of external validity.

So, once again, the fundamental issue is not the purity of the methodology employed (as exciting as such methodological purity is to the technically inclined) but rather the inherent complexity of the world being studied.

For this precise reason, it turns out that those who most vociferously and na├»vely advocate that we apply techniques from public health to economics (a group that does not include Esther Duflo) make a fundamental error.  They fail to appreciate the fact that, when it comes to external validity, public health is the exception that proves the rule. Indeed, in aid-led development in general, of the few real historical successes, nearly all are in public health. Outside of public health, few of the large-scale, top-down development programs have in fact succeeded. 

Why is this? Multiple conjectures are possible. But one persuasive one is this: when it comes to biophysical function, people are people. For this reason, a carefully developed medical protocol (read “recipes”) proven to be effective for one population is highly likely to work for another population. The smallpox vaccine tested on one population tended to work on other populations; this made it possible to eradicate smallpox. Oral rehydration therapy tested on one group of children tended to work of other groups of children; millions of children have been spared preventable deaths because the technique has been adopted on a global basis. Indeed, medical protocols have such a high level of external validity that, in the United States alone, tens if not hundreds of thousands of lives could be saved every year through a more determined focus on adherence to their particulars.[1]

These huge successes were achieved, and continue to be achievable, though bold action taken by public health officials. They are rightly celebrated and encouraged, but—outside of other public health applications—not easily replicated. Successes in medicine contrast sharply with failures in other domains. Decades of efforts to design and deploy improved cook-stoves—with the linked aims of reducing both deforestation and the illness and death due to indoor air pollution—have so far primarily yielded an accumulation of Western inventions maladapted to needs and realities in various parts of the world, along with locally developed innovations that cannot be expanded to meet the true scale of the challenge.[2] For development programs in general, and RCTs in particular, public health is the exception that proves the rule.

What does work in areas outside of public health? How is it possible to design, test, and implement effective solutions in environments where complexity and volatility are dominant? The general principle applies: Success requires adaptability as well as structure, flexibility as well as structure—a societal capacity to scale successful efforts combined with an ingrained practice of entrepreneurial exploration. As the uniquely insightful Mancur Olson wrote in his classic Power and Prosperity (pp. 188-189):
Because uncertainties are so pervasive and unfathomable, the most dynamic and prosperous societies are those that try many, many things. They are societies with countless thousands of entrepreneurs who have relatively good access to credit and venture capital.
What works in development, according to Olsen, is entrepreneurial exploration. Why? Because we don't know what works.

[1] See Atul Gawande (2009), Checklist Manifesto: How to Get Things Right. New York: Metropolitan Books.
[2] Burkhard Bilger (2009), Hearth Surgery: The Quest for a Stove That Can Save the World. The New Yorker. December 21.


21 comments:

  1. Phil

    Your claim about (lack of) external validity is surely an empirical one. Perhaps a programme which does well in one place or circumstance will reproduce well elsewhere; perhaps it won't.

    So far, where programmes have been rigorously evaluated in different circumstances, the results have bbeen (perhaps surprisingly) rather consistent.

    You may be right that the external validity is greater in health than in other fields; but until we do more rigorous evaluations we don't know.

    Owen

    ReplyDelete
  2. Excellent post; but I'm not even sure it's always the best thing in public health, especially where behavioural interventions are concerned and where legal and social frameworks aggravate peoples' vulnerability. People are people; and contexts are contexts.

    ReplyDelete
  3. Owen,

    Agreed that external validity is an empirical point. However, there are some first principles we can apply to predict (albeit imperfectly) where external validity is likely to be a major issue, and where it won't.

    In the post I link to a paper I wrote a while back with Stuart Kauffman, Jose Lobo, and Karl Shell about "production recipes" and technological innovation. This is, actually, an important hook. The take-away from that paper that is relevant here is that (if our model is the "correct" one) the greater the extent of inter-dependencies present among the different components of a particular recipe (read protocol), the lower the expected correlation between two recipes that differ along only one or a few components. Hausmann has made a similar point, framing it in terms of the high-dimensionality of development challenges:

    http://bit.ly/h3AVDK

    To unpack all of this really requires another post...maybe more. But I'll start with another post.

    Phil

    ReplyDelete
  4. Phil,

    I have yet to meet the person often described in posts like this that recommends RCTs for everything and believes that a single RCT is definitive across contexts. I beginning to think that they are mythical.

    All of the economists I know that promote RCTs--like Banerjee, Duflo, Dean Karlan, Jonathan Morduch and David McKenzie--are very cautious about interpreting the results of their studies. If there is a knock there it is not that they underplay or ignore external validity questions, but that they overemphasize external validity and seek to test everything, everywhere. When they give advice its not that something "works" and policy makers should implement it, it's that something might work (you might say because of a Bayesian frame)and therefore should be tried and tested in the specific context.

    In terms of Public Health, I think you've missed the most important reason that RCTs "work" in that context. That reason is that every intervention is replicated dozens if not hundreds of times before it becomes accepted practice or a drug is approved. In pharmaceutical terms this means testing and refining compounds at a molecular level, then testing and refining at a cellular level, then testing and refining in simple animal models, like C. elegans, then testing and refining in complex animal models, then Phase 1 clinical trials, then Phase 2 clinical trials, then Phase 3 clinical trials, all before a drug is even submitted for approval. Even still, it is not generally accepted as effective until it has been tested several times by other parties. Combine this with the fact that other parties are likely testing the same or similar compounds and publishing papers.

    Thus the reason external validity questions are not an issue in public health RCTs is that those questions have been answered by rigorous replications in multiple contexts.

    Therefore, the fundamental flaw here is not in RCTs. The fundamental flaw is in the development policy arena where replications are discounted and discouraged.

    Tim

    ReplyDelete
  5. I'll bet you a tenner that you can't find a single academic economist running an RCT who does not appreciate the importance of external validity. Who are the naive advocates?

    ReplyDelete
  6. I think you give far too much credit to public health campaigns - many of them have been hopelessly unsuccessful and non-scientific and many continue to waste aid funding today.

    Even the smallpox campaign, which you mention in your post, operated for a long time using a mass vaccination strategy (rather than a targeted approach) that turned out to be highly ineffective and inefficient until they switched (based on anecdotal evidence that it worked in one place and thus perhaps was worth trying in another).

    Plus at the time, many of the vaccines that were in use had never been tested using clinical trials - these were things that people scraped off someone's arm and stuck in someone else's arm and based on simple observation figured out that it prevented disease.

    Perhaps at the biophysical level there is more external validity, but don't confuse medical protocols with public health protocols. They are very different things. We are still a long way from knowing really what works in many public health examples.

    ReplyDelete
  7. Tim, Sure, professionals know this. But, implicitly and sometimes explicitly, they (including Duflo and Karlan) publicly oversell methodological innovation while continuing the general practice among "development" economists to ignore entrepreneurship almost entirely. (And no, "searchers" are not the same as entrepreneurs... don't even get me started: http://bit.ly/c0Jiya ).

    What motivated (OK, provoked) me back into the RCT fray with this post was a piece in Fast Company last month by Anya Kamenetz (for whom I have very high regard)

    http://bit.ly/okxXXD

    in which Karlan is quoted as saying "The social entrepreneurship world is in a weird spot, to be honest with you. It’s a world full of rhetoric about impact investing, yet I have very rarely seen an investor actually take that seriously. When you look at the actual analysis it lacks rigor."

    At the time I read this I was wrapping up editing a special edition of Innovations journal on impact investing for SOCAP11 ( http://bit.ly/ne8eFq , ref my previous post).
    Juxtaposing Karlan's comment with what I know of the impact investing world, I'd say "The RCT world is in a weird spot, to be honest with you. It’s a world full of rhetoric about impact creation, yet I have very rarely seen an academic economist actually take that seriously. When you look at the actual work it lacks relevance."

    ReplyDelete
  8. More to the point, because of the external validity issue, I have to ask: why is an RCT any better than a detailed ethnographic study of the same issue? Ethnographies have long been derided as non-generalizable case studies, yet it seems to me that the careful use of RCTs opens this method to the same critique. What do RCTs bring to the table that we didn't already have?

    ReplyDelete
  9. I'd be really curious to hear a response to Ed's question, from someone more steeped in research methodology than myself. It strikes me, as a consumer of much of this research (but not a researcher myself), that clarity about the comparative advantage that each methodology enjoys is extremely important. I think there probably *is* a really good answer about what RCTs bring to the table (if we take non-generalizability as one of their fundamental attributes), but I'd love to hear someone with more methodological chops than myself take a stab at what precisely it is.

    ReplyDelete
  10. Phil: Why single out RCTs for threats to external validity? By focusing on RCTs, your post makes it sound like other designs don't have to worry about generalizability. The reality is that, whether I randomly assign 120 Ugandan villages to treatment or control groups, or use a quasi-experimental design instead, generalizability is still an issue.

    ReplyDelete
  11. Phil, I'm pretty surprised that you think Duflo and Karlan have ignored entrepreneurship given all of the ongoing RCTs on microfinance, enterprise, and new technologies.

    I'm also pretty surprised that you think RCTs lack relevance given the number of governments, NGOs and private companies that are collaborating directly with JPAL and IPA to answer various questions, many of which are influencing policy and decisions and have direct relevance to the lives of millions.

    (I should mention that IPA has paid my salary for the past year).

    best,

    Lee

    ReplyDelete
  12. I'm fairly sure the answer to Ed's question is: Randomization and Control. That's only half in jest.

    But if by detailed ethnography you mean careful, specific data gathering over an extended period using both quantitative and qualitative measures then there wouldn't seem to be much difference in the data gathering portion then there would be a lot of overlap.

    So that would mean that the difference is in how the subjects are chosen and the confidence level one has in the differences between people with access to the intervention and those who do not have access to the intervention.

    If the detailed ethnography subjectively picks subjects and relies solely on observables to determine similarities and differences--or if there is no control group who doesn't have access to the intervention--then there is an obvious difference. And an obvious benefit to RCTs.

    If on the other hand, the detailed ethnography limits subjectivity by randomizing and ensuring there is a valid control group then, well, you've got a detailed ethnography that is also an RCT.

    ReplyDelete
  13. Lee,

    Hmmm... You know enough about all this that I don't need to get how microfinance mostly doesn't have much to do with microenterprise, and microenterprise in doesn't have much to do with entrepreneurship (in the Schumpeterian sense, ref. http://bit.ly/qbCENw ). I'll go back and look at Karlan's work on technology, but pretty sure that's going to be subject to the non-stationarity critique that will follow in my next post.

    As for "the number of governments, NGOs and private companies that are collaborating directly with JPAL and IPA to answer various questions, many of which are influencing policy and decisions and have direct relevance to the lives of millions." That is exactly why RCTs should be of concern. RCTs may well influence policy. Policy may affect the lives of millions. But, outside of public health, and in the presence of (likely aggravated) external validity concerns, the net effect is questionable.

    Indeed, RCTs are most damaging if (because?) they provide an appealing rationale for technocrats to prop up broken systems with incremental fixes. In the aggregate, this strikes me as most obvious (if not inevitable) outcome of the increased use of RCTs ... good intentions of "randomistas" notwithstanding.

    Phil

    ReplyDelete
  14. You could argue that it was the RCTs that burst the whole microfinance=entrepreneurship bubble.

    IPA has an SME research initiative partly funded by Kauffman which is looking more closely at entrepreneurship (http://www.poverty-action.org/sme)

    Let me give a specific example on working with governments - IPA is working with the Ghana Ministry of Education to implement and evaluate a pilot education reform to hire contract teachers, (http://www.poverty-action.org/remedialeducation/scalingup) reaching 400 schools across the country, so there is a pretty reasonable shot at external validity for the rest of Ghana. The program was first run in India - so there was no assumption that it would also work in Ghana, instead it is being tested.

    Is incremental change to a poorly performing system really a bad thing? Do you think it would be better to sit back and hope the government education sector just collapses?

    ReplyDelete
  15. Yes, I'll be very interested to see how the SME initiative (http://www.poverty-action.org/sme) goes. New territory for IPA, and obviously I am skeptical. But may have potential.

    As for the education project in Ghana--look, really, if the exact same approach had been imported five years ago to the District of Columbia Public Schools, where I taught twenty years ago, and where my daughters have gone, I'd have said, "Give me an f-ing break! We need to do a hell of lot more to this school system than haul in some tutors. We have to push back against the entrenched interests that have kept real progress in education from happening in this city for the past thirty years." Well, what Michelle Rhee did when she arrived as Chancellor was exactly what needed to be done. She was effective because she understood that the only response to systemic failure was systemic change. She was effective because, although she had a bureaucratic responsibility, she approached it with an entrepreneurial mentality. (Now, of course, she's gone. Failures in governance, of course, are everywhere.)

    That said, just because I wouldn't buy the concept you're evaluating in Ghana for Washington DC does that mean I think it has no value in Accra. Of course not. But, a priori, I'm hardly sold. Even considering the second and third order distortions of the environment to which I referred above that are an inherent part of aid projects, it may indeed turn out that hiring contract teachers does yield improved outcomes for students. Even if that were the case, it might still be true that the youth of Ghana would have been better if your entire team had quit at the outset, found some first-rate Ghanian partners (Yaw Nyarko at NYU could have helped) and brought the idiscoveri XSEED model to Ghana instead (http://www.xseed.in/ ; http://www.mitpressjournals.org/doi/abs/10.1162/inov_a_00010 ). Or maybe the Hole-in-the-Wall approach (http://www.hole-in-the-wall.com/). So many possibilities. How can you know that assessing this particular intervention was the best possible use of your time? The opportunity cost of talent in aid assessment may be the greatest cost of all.

    ReplyDelete
  16. Will someone name me three positive (or negative) results which have been convincingly (and comparably) replicated, with results released, in five different settings?

    Go!

    ReplyDelete
  17. Or, if it is any easier, 5 positive/negative results in 3 different settings each.

    ReplyDelete
  18. Phil,

    Are you still standing? Well, it strikes me from your response that you are responding to Dean's prod with a reciprocal prod rather than taking on the truth or falsity of Dean's statement directly.

    I will admit that I'm not fully up to date on the state of analytical rigor in social entrepreneurship--and that's because I found the rhetoric so cloying and substance-free (just declaring my priors) that I've tuned a lot of it out over the last year.

    What I'd be really interested in is the direct challenge to Dean's statement. Where are the investors (not just consultants proposing methodologies) applying analytical rigor to measure total return of social enterprises? I'd love to know they are out there. I'd love to see actual money moved based on rigorous evaluation (RCT or otherwise) of social enterprises' impact.

    Tim

    ReplyDelete
  19. Like Tim, @deankarlan points out to me offline that my quote from him above ("The social entrepreneurship world is in a weird spot, to be honest with you..." from http://bit.ly/okxXXD ) has nothing to do with external validity. (Agreed). He emphasizes that the point he was making was that people in the impact investing world do not measure impact rigorously--at least not so far as he has seen.

    He also stresses that no one who works with RCTs believes that a single assessment can be broadly generalized. He affirms that the professionals who do this work are highly sensitive to the issue of external validity that I highlight in this post. (Again, I don't disagree in the slightest.)

    I appreciate Dean's willingness to reach out, and have learned something from the above exchange (including his responses directly to me).

    In future posts I'll continue to explore the complex and varied themes addressed in the excellent comments to this posts--in particular as they relate to
    (1) the cost/benefit trade-offs involved in choosing among assessment methodologies and important differences,
    (2) the way "impact investors" and "randomistas" understand and use the word "impact"
    (3) most generally, what entrepreneurship has to do with the balance between "intensive" and "extensive" search in the process of development.

    ReplyDelete
  20. Phil, thanks for the interesting post. I offer a few thoughts on my blog:
    http://blogs.cgdev.org/open_book/?p=6843
    --David Roodman

    ReplyDelete
  21. I added your post to a compilation of recent posts related to randomized control trials and aid effectiveness. You can see it at: http://www.how-matters.org/2011/05/24/rcts-and-aid-effectiveness-compilatio/
    You can also read my "how matters" advice for donors on RCTs at: http://www.how-matters.org/2011/05/25/rcts-how-matters-advice-for-donors/

    ReplyDelete