Rob Van Den Berg is the Director of the Global Environment Facility Evaluation Office (GEFEO). The original version of this article appeared in the September 2013 edition of Evaluation Connections, the monthly newsletter of the European Evaluation Society (EES).
Over the past decade, evaluations have been influenced by the scientific method: to show what works and what doesn’t through ‘rigorous’ methods. The focus of discussion has often been on proving that an intervention actually works. Relatively less attention has been paid to why a specific intervention or policy does not work. The stock answer from impact evaluators has been that failure indicates that a supposed causal mechanism does not exist. But other evaluators have been quick to point out that the causal mechanism may in fact work except that the intervention did not succeed due to “implementation failure”. The reasons for failure can be manifold, and evaluations often cannot clearly identify why something does not work.
Many impact evaluation tools such as randomized controlled trials or quasi-experimental evaluations use a ‘black box’ approach. They are not geared to identify implementation failure, do not differentiate it from design failure or seek an explanation in unexpected changes of external circumstances. Data is gathered but only its relevance to the proof or disproof of the causality hypothesis is considered. Evidence not relevant to the attribution question is discarded as irrelevant. Those in favor of this approach tend to dismiss such knowledge: if something does not work, one should obviously stop doing it, they say. However, an important objective of evaluations is to learn from mistakes and the current popularity of one particular form of impact evaluation is not helping.
The disregard for learning from mistakes need large n to provide positive proof that something works, and that in fact only large n can have scientific meaning. In a presentation to the International Development Evaluation Association Global Assembly in Barbados in May 2013 I pointed to the fact that the philosophy of science recognizes more than one way to skin a cat or to use rigorous methodology to explain what is happening in interventions. In fact, a full range of n (from very small to very large) is in use in scientific research. For example, in astronomy the adoption of new or rival scientific theories is not decided by experiments or by differences in n, but by the differences in explanatory power between theories.
A guiding scientific principle has been Occam’s razor; i.e. when there are conflicting theories based on the same data, scientists should adopt the theory that is simpler. Both the Ptolemaic and the Copernican theories of the solar system use the same n: data on which planet is where in the sky at what moment in time. The Copernican system provides a simpler explanation and calculation of the movements of the planets than the Ptolemy version, which had convoluted circles within circles to explain why a planet was observed at a certain point in space and time. Still simpler explanations were provided by Kepler, showing that the planets do not run in circles around the sun, but ellipses. Again, his laws of planetary motion were taken up in astronomy not because new n emerged, or experiments were conducted in the planetary system but because they provided a better explanation than Copernicus’ proposals that the planets ran in circles around the sun.
Astronomy also provides us with an example of one n that tipped the balance between the Newtonian view of gravity and the Einsteinian view. Einstein’s general theory of relativity posed that light rays would be bent by gravity; Newton’s theory does not include this. At an eclipse of the sun on May 29, 1919, one star was observed that was actually behind the sun; its light was bent by the gravity of the sun so that it could be seen. This one n proved to the world that Einstein was right and Newton was wrong. The search for such a crucial and decisive n is often elusive; it is difficult to find one n that would prove or disprove a theory. The best known example of negative evidence through one n in the philosophy of science has been the “all swans are white” theory that philosopher Karl Popper used to demonstrate that if one black swan is observed, the theory would be disproven.
Popper proposed to shift looking for evidence from positive that something works to negative that something does not work. His claim was that enormous numbers of positive data in the world that swans are white will not survive the sole fact of one black swan. Popper posited that it is better to search for black swans, i.e. to disprove theories, than to gather ever larger n that still would not prove that the theory is the case. This is easier said than done, as illustrated in the contributions of Nassim Taleb on “black swan events” in hedge fund and investment calculations, based on very large n, which led to the 2008 international financial crisis. Taleb argues that statistical systems of large n are fundamentally incomplete as they cannot deal with unexpected events.
In the current emphasis in the evaluation community on large n, the lessons from the natural and physical sciences are ignored and evaluations are increasingly turning into scientific research on whether a causality can be demonstrated to exist through statistical analysis of large n.
This is a search for positive evidence that disregards the induction problem and disregards small n – and especially unexpected “black swan” n – as irrelevant for scientific discovery. The current mainstream claims that only large n – through counterfactual analysis based on control groups – will demonstrate whether something works. Facts, from this perspective, are only powerful if you have a lot of them. There is nothing wrong with a lot of facts – the more the better – but if we only select them for the positive conclusions we can draw from them, we fail to learn from failure. Even single facts may have the power to prove or disprove theories. This requires analyzing facts that experimental methods do not normally capture. By gathering evidence on implementation failure, barriers to progress and unexpected developments one may learn from failure as well as success.
In early May, the author of this blog, Rob Van Den Berg and and Christine Wörlen, consultant of the Climate-Eval online community of pratice made a presentation at the the General Assembly of the International Development Evaluation Association (IDEAS) in Bridgetown, Barbados. While Rob focused on "The importance of Negative Evidence," Christine Focused on the “Theory of No Change”. We would be bring you a blog from Christine in the days ahead.
Click below for the PPT presentations: