Handbook of Labor Economics, Vol. 4, No. Suppl PA, 2011
ISSN: 1573-4463
doi: 10.1016/S0169-7218(11)00411-4
Chapter 5Program Evaluation and Research Designs
Abstract
This chapter provides a selective review of some contemporary approaches to program evaluation. One motivation for our review is the recent emergence and increasing use of a particular kind of “program” in applied microeconomic research, the so-called Regression Discontinuity (RD) Design of Thistlethwaite and Campbell (1960). We organize our discussion of these various research designs by how they secure internal validity: in this view, the RD design can been seen as a close “cousin” of the randomized experiment. An important distinction which emerges from our discussion of “heterogeneous treatment effects” is between ex post (descriptive) and ex ante (predictive) evaluations; these two types of evaluations have distinct, but complementary goals. A second important distinction we make is between statistical statements that are descriptions of our knowledge of the program assignment process and statistical statements that are structural assumptions about individual behavior. Using these distinctions, we examine some commonly employed evaluation strategies, and assess them with a common set of criteria for “internal validity”, the foremost goal of an ex post evaluation. In some cases, we also provide some concrete illustrations of how internally valid causal estimates can be supplemented with specific structural assumptions to address “external validity”: the estimate from an internally valid “experimental” estimate can be viewed as a “leading term” in an extrapolation for a parameter of interest in an ex ante evaluation.
This chapter provides a selective review of some contemporary approaches to program evaluation. Our review is primarily motivated by the recent emergence and increasing use of the a particular kind of “program” in applied microeconomic research, the so-called Regression Discontinuity (RD) Design of Thistlethwaite and Campbell (1960). In a recent survey, Lee and Lemieux (2009) point out that the RD design has found good use in a wide variety of contexts, and that over the past decade, the way in which researchers view the approach has evolved to a point where it is now considered to yield highly credible and transparent causal inferences. At the time of the last volumes of the Handbook of Labor Economics, the RD design was viewed simultaneously as a “special case” of Instrumental Variables (IV) (Angrist and Krueger, 1999) and a “special case” of a “selection on observables”, or matching approach (Heckman et al., 1999). Recent theoretical analyses and the way in which practitioners interpret RD designs reveal a different view; Lee and Lemieux (2009) point out that the RD design can be viewed as a close “cousin” of the randomized experiment. In this chapter, we provide an extended discussion of this view, and also discuss some of the issues that arise in the practical implementation of the RD design. The view of the RD design as a “cousin” of the randomized experiment leads to our second, broader objective in this review: to chart out this perspective’s implicit “family tree” of commonly used program evaluation approaches. 1
Our discussion necessarily involves a discussion of “heterogeneous treatment effects”, which is one of the central issues in a wider debate about the relative merits of “structural” versus “design-based”/“experimentalist” approaches. 2 In setting forth a particular family tree, we make no attempt to make explicit or implicit judgments about what is a “better” or “more informative” approach to conducting research. Instead, we make two distinctions that we think are helpful in our review.
First, we make a clear distinction between two very different kinds of evaluation problems. One is what could be called the ex-post evaluation problem, where the main goal is to document “what happened” when a particular program was implemented. The problem begins with an explicit understanding that a very particular program was run, individuals were assigned to, or self-selected into, program status in a very particular way (and we as researchers may or may not know very much about the process), and that because of the way the program was implemented, it may only be possible to identify effects for certain sub-populations. In this sense, the data and the context (the particular program) define and set limits on the causal inferences that are possible. Achieving a high degree of internal validity (a high degree of confidence that what is measured indeed represents a causal phenomenon) is the primary goal of the ex post evaluation problem.
The other evaluation problem is the ex-ante evaluation problem, which begins with an explicit understanding that the program that was actually run may not be the one that corresponds to a particular policy of interest. Here, the goal is not descriptive, but is instead predictive. What would be the impact if we expanded eligibility of the program? What would the effects be of a similar program if it were run at a national (as opposed to a local) level? Or if it were run today (as opposed to 20 years ago)? It is essentially a problem of forecasting or extrapolating, with the goal of achieving a high degree of external validity. 3
We recognize that in reality, no researcher will only pursue (explicitly or implicitly) one of these goals to the exclusion of the other. After all, presumably we are interested in studying the effects of a particular program that occurred in the past because we think it has predictive value for policy decisions in the here and now. Likewise, a forecasting exercise usually begins with some assessment of how well methods perform “in-sample”. Nevertheless, keeping the “intermediate goals” separate allows us to discuss more clearly how to achieve those goals, without having to discuss which of them is “more important” or ambitious, or more worthy of a researcher’s attention.
The second distinction we make—and one that can be more helpful than one between “structural” and “design-based” approaches—is the one between “structural” and “design-based” statistical conditions. When we have some institutional knowledge about the process by which treatment was assigned, and when there can be common agreement about how to represent that knowledge as a statistical statement, we will label that a “D”-condition; “D” for “data-based”, “design-driven”, or “descriptive”. These conditions are better thought of as descriptions of what actually generated the data, rather than assumptions. By contrast, when important features of the data generating process are unknown, we will have to invoke some conjectures about behavior (perhaps motivated by a particular economic model), or other aspects about the environment. When we do not literally know if the conditions actually hold, but nevertheless need them to make inferences, we will label them “S”-conditions; “S” for “structural”, “subjective”, or “speculative”. As we shall see, inference about program effects will frequently involve a combination of “D” and “S” conditions: it is useful to be able to distinguish between conditions whose validity is secure and those conditions whose validity is not secure.
Note that although we may not know whether “S”-conditions are literally true, sometimes they will generate strong testable implications, and sometimes they will not. And even if there is a strong link between what we know about program assignment and a “D”-condition, a skeptic may prefer to treat those conditions as hypotheses; so we will also consider the testable implications that various “D”-conditions generate.
Using these distinctions, we examine some commonly employed evaluation strategies, and assess them against a common set of criteria for “internal validity”. We also provide a few concrete illustrations of how the goal of an ex post evaluation are quite complementary to that of an ex ante evaluation. Specifically, for a number of the designs, where “external validity” is an issue, we show some examples where internally valid causal estimates—supplemented with specific “S”-conditions—can be viewed as a “leading term” in an extrapolation for a parameter of interest from an ex ante evaluation standpoint.
Our review of commonly employed evaluation strategies will highlight and emphasize the following ideas, some of which have long been known and understood, others that have gained much attention in the recent literature, and others that have been known for some time but perhaps have been under-appreciated:
The chapter is organized as follows: in Section 2 we provide some background for our review, including our criteria for assessing various research designs; we also make some important distinctions between types of “program evaluation” that will be useful in what follows. One important distinction will be between research designs where the investigator has detailed institutional knowledge of the process by which individuals were assigned to treatment (“dominated by knowledge of the assignment process”) and those research designs where such information is lacking—what we describe as being “dominated by self-selection.” In Section 3, we discuss the former: this includes both randomized controlled trials and the regression discontinuity design. In Section 4, we discuss the latter: this includes “differences-in-differences”, instrumental variables (“selection on unobservables”), matching estimators, (“selection on observables”). Section 5 concludes.
The term “program evaluation” is frequently used to describe any systematic attempt to collect and analyze information about the implementation and outcomes of a “program”—a set of policies and procedures. Although program evaluations often include “qualitative” information, such as narrative descriptions about aspects of the program’s implementation, our focus will be solely on statistical and econometric evaluation. For our purposes, a program is a set of interventions, actions or “treatments” (typically binary), which are assigned to participants and are suspected of having some consequences on the outcomes experienced by the participants. Individuals who are “assigned” or “exposed” to treatment may or may not take up the treatment; when some individuals are assigned to, but do not take up the treatment we will often find it convenient to evaluate the effect of the offer of treatment (an “intent to treat analysis”), rather than the effect of the treatment per se, although we will examine what inferences can be made about the effect of the treatment in these situations. The problem will be to study the causal effect of the treatment when “the effects under investigation tend to be masked by fluctuations outside the experimenter’s control” (Cox, 1958). Examples of programs and treatments include not only explicit social experiments such as those involving the provision of job training to individuals under the Job Training Partnership Act (JTPA) (Guttman, 1983), but also “treatments” provided outside the context of specifically designed social experiments. Some examples of the latter include the provision of collective bargaining rights to workers at firms (DiNardo and Lee, 2004), the effects of social insurance on labor market outcomes (Lemieux and Milligan, 2008), health insurance (Card et al., 2009b,a) and schooling to mothers (McCrary and Royer, 2010).
Our review will be selective. In particular, we will focus most of our attention on situations in which “institutional knowledge of the data generation process” strongly informs the statistical and econometric analysis. 4 With such a focus, a discussion of randomized controlled trials (RCTs) and the regression discontinuity design (RDD) are featured not because they are “best” in some single index ranking of “relevance”, but because they often provide situations where a “tight link” between the posited statistical model and the institutional details of the experiment lends credibility to the conclusions. The statistical model employed to analyze a simple, well-designed RCT often bears a tighter resemblance to the institutional details of the designed experiment than does, for example, a Mincerian wage regression. In this latter case, the credibility of the exercise does not rest on the fact that wages are set in the market place as a linear combination of a non-stochastic relationship between potential experience, schooling, etc. and a stochastic error term: the credibility of such an exercise instead rests on factors other than its close resemblance to the institutional realities of wage setting.
The distinction between these situations has sometimes been blurred: the Neyman-Holland-Rubin Model (Splawa-Neyman et al., 1990, 1935; Rubin, 1990, 1974, 1986; Holland, 1986), which we discuss later, has been used in situations both where the investigator does have detailed institutional knowledge of the data generating process and where the investigator does not. Our focus is on “the experiment that happened” rather than the “experiment we would most like to have been conducted”. As others have noted, this focus can be limiting, and a given experiment may provide only limited information (if any) on structural parameters interesting to some economists (see for example Heckman and Vytlacil (2007a)). If a designed experiment assigns a package of both “remedial education” and “job search assistance” to treated individuals, for example, we may not be able to disentangle the separate effects of each component on subsequent employment outcomes. We may be able to do better if the experiment provides random assignment of each of the components separately and together, but this will depend crucially on the experiment that was actually conducted.
In adopting such a focus, we do not mean to suggest that the types of research designs we discuss should be the only ones pursued by economists and we wish to take no position on where the “marginal research dollar” should be spent or the appropriate amount of energy which should be dedicated to “structural analyses”; for some examples of some recent contributions to this debate see Deaton (2008), Heckman and Urzua (2009), Imbens (2009), Keane (2009) and Rust (2009). Moreover, even with this narrow focus there are several important subjects we will not cover, such as those involving a continuously distributed randomized instrument as in Heckman and Vytlacil (2001a); some of these issues are treated in Taber and French (2011).
It will be useful to reiterate a distinction that has been made elsewhere (see for example, Todd and Wolpin (2006) and Wolpin (2007)), between ex ante evaluation and ex post evaluation. Ex post policy evaluation occurs upon or after a policy has been implemented; information is collected about the outcomes experienced by those who participated in the “experiment” and an attempt is made to make inferences about the role of a treatment in influencing the outcomes. An ex post evaluation generally proceeds by selecting a statistical model with a tight fit to the experiment that actually happened (whether or not the experiment was “planned”). The claims that are licensed from such evaluations are context dependent—an experiment conducted among a specific group of individuals, at a specific time and specific place, may or may not be a reliable indicator of what a treatment would do among a different group of individuals at a different time or place. The credibility of an ex post evaluation depends on the credibility of the statistical model of the experiment. Drug trials and social experiments are examples of “planned” experiments; similarly, regression discontinuity designs, although not necessarily planned, can also often provide opportunities for an ex post evaluation.
Ex ante evaluation, by contrast, does not require an experiment to have happened. It is the attempt to “study the effects of policy changes prior to their implementation” (Todd and Wolpin, 2006). 5 Unlike the ex post evaluation, the credibility of an ex ante evaluation depends on the credibility of the statistical model of the behavior of individuals and the environment to which the individuals are subjected. An influential ex ante evaluation was McFadden et al. (1977), which built a random utility model to forecast the demand for the San Francisco BART subway system before it was built. In that case, the random utility model is a more or less “complete”, albeit highly stylized, description of utility maximizing agents, their “preferences”, etc. In short, the statistical model explains why individuals make their observed choices. The model of behavior and the environment is the data generation process.
This contrasts sharply with ex post evaluation, where apart from the description of the treatment assignment mechanism, one is as agnostic as possible about what specific behavioral model is responsible for the observed data other than the assignment mechanism. We describe this below as “pan-theoretic”—the goal in an ex post evaluation is to write down a statistical model of the assignment process or the experiment that is consistent with as broad a class of potential models as possible. When the analyst has detailed institutional knowledge of the assignment mechanism, there is usually very little discretion in the choice of statistical model—it is dictated by the institutional details of the actual experiment. As observed by Wolpin (2007), however, this is not the case in the ex ante evaluation: “Researchers, beginning with the same question and using the same data, will generally differ along many dimensions in the modeling assumptions they make, and resulting models will tend to be indistinguishable in terms of model fit.”
Since human behavior is so complicated and poorly understood (relative to the properties of simple treatment assignment mechanisms), ex ante evaluations typically place a high premium on some form of “parsimony”—some potential empirical pathways are necessarily omitted from the model. Researchers in different fields, or different economists, may construct models of the same outcomes which are very different. Because many different models—with different implications, but roughly the same “fit” to the data—might be used in an ex ante evaluation, there are a wide variety of ways in which such models are validated (see Heckman (2000), Keane and Wolpin (2007), Keane (2009) and the references therein for useful discussion). Given the goal of providing a good model of what might happen in contexts different than those in which the data was collected, testing or validating the model is considerably more difficult. Indeed, “the examination of models’ predictive ability is not especially common in the microeconometrics literature” (Fang et al., 2007). Part of the difficulty is that by necessity, some variables in the model are “exogenous” (determined outside the model), and if these variables affect the outcome being studied, it is not sufficient to know the structure. For the ex ante evaluation to be reliable, “it is also necessary to know past and future values of all exogenous variables” (Marschak, 1953). Finally, it is worth noting that an ex ante evaluation (as opposed to a mere forecasting exercise) generally requires a specification of “values” (a clear discussion of the many issues involved can be found in Heckman and Smith (1998)).
In the following table, we outline some of the similarities and differences between the two kinds of evaluations, acknowledging the difficulties of “painting with a broad brush”:
Ex post program evaluation | Ex ante program evaluation |
---|---|
What did the program do? Retrospective: what happened? | What do we think a program will do? Prospective/predictive: what would happen? |
Focus on the program at hand | Focus on forecasting effects of different program |
For what population do we identify causal effect? | For what population do we want to identify causal effect? |
Desirable to have causal inferences not reliant on specific structural framework/model | Question ill-posed without structural framework/paradigm |
No value judgments on “importance” of causal facts | Some facts will be more helpful than others |
Inferences require assumptions | Predictions require assumptions |
Desirable to test assumptions whenever possible | Desirable to test assumptions whenever possible |
Ex Ante problem guides what programs to design/analyze | Would like predictions consistent with results of Ex Post evaluation |
Inference most appropriate for situations that “resemble” the experiment and are similar to that which produce the observed data | Inferences intended for situations that are different than that which produced the observed data |
Here we describe a prototypical ex post program evaluation, where the perspective is that an event has occurred (i.e. some individuals were exposed to the program, while others were not) and data has been collected. The ex post evaluation question is: Given the particular program that was implemented, and the data that was collected, what is the causal effect of the program on a specific outcome of interest?
For example, suppose a state agency implements a new program that requires unemployment insurance claimants to be contacted via telephone by a job search counselor for information and advice about re-employment opportunities, and data is collected on the labor market behavior of the claimants before and after being exposed to this program. The ex post evaluation problem is to assess the impact of this particular job search program on labor market outcomes (e.g. unemployment durations) for the population of individuals to whom it was exposed.
One might also want to know what the program’s impact would be in a different state, or 5 years from now, or for a different population (e.g. recent high school graduates, rather than the recently unemployed), or if the job counselor were to make a personal visit to the UI claimant (rather than a phone call). But in our hypothetical example none of these things happened. We consider these questions to be the concern of an ex ante program evaluation—a forecast of the effect of a program that has not occurred. For now, we consider the program that was actually implemented, and its effect on the population to which the program was actually exposed, and focus on the goal of making as credible and precise causal inferences as possible (see Heckman and Vytlacil (2007a,b), Abbring and Heckman (2007), Keane and Wolpin (2007) and Todd and Wolpin (2006) for discussion).
We describe the general evaluation problem using the following notation:
A general framework for the evaluation problem can be given by the system:
In the first equation, is a random vector because denotes the type of a randomly chosen individual from the population. With being a real-valued function, those with the same (identical agents) will have the same , but there may be variation in conditional on an observed value of need not be one-to-one. Furthermore, since is determined before , does not enter the function .
The second equation defines the latent propensity to be treated, . Program status can be influenced by type or the factors . Additionally, by allowing to take values between 0 and 1, we are allowing for the possibility of “other factors” outside of and that could have impacted program status. If there are no “other factors”, then takes on the values 0 or 1. Even though our definition of types implies no variation in conditional on , it is still meaningful to consider the structural relation between and . In particular, for a given value of equal to , if one could select all alternative values of for which the relation between and is exactly the same, then for that subset of s, the variation in could be used to trace out the impact of on for . might include years of education obtained prior to exposure to the job search assistance program, and one could believe that education could impact the propensity to be a program participant. It is important to note that is quite distinct from the well-known “propensity score”, as we will discuss in Section 4.3. Not only is potentially a function of some unobservable elements of , but even conditional on , can vary across individuals.
The final equation is the outcome equation, with the interest centering on the impact of on , keeping all other things constant. As with , although our definition of types implies no variation in conditional on , it is still meaningful to consider the structural relation between and . Specifically, given a particular , if one could select all the alternate values of such that the relation between and is the same, then for that subset of s, the variation in could be used to trace out the impact of on for .
Note that this notation has a direct correspondence to the familiar “potential outcomes framework” (Heckman, 1974, 1976, 1978) such as:
where , (with an arbitrary joint distribution) are elements of , and . The framework also corresponds to that presented in Heckman and Vytlacil (2005).7 The key difference is that we will not presume the existence of a continuously distributed instrument that is independent of all the unobservables in .
Throughout this chapter, we maintain a standard assumption in the evaluation literature (and in much of micro-econometrics) that each individual’s behaviors or outcomes do not directly impact the behaviors of others (i.e., we abstract from “peer effects”, general equilibrium concerns, etc.).
Define the causal effect for an individual with and as
If and all the elements of were observed, then the causal effect could be identified at any value of and provided there existed some treated and non-treated individuals.
The main challenge, of course, is that the econometrician will never observe (even if individuals can be partially distinguished through the observable elements of ). Thus, even conditional on , it is in general only possible to learn something about the distribution of . Throughout this chapter we will focus on—as does much of the evaluation literature—average effects
where is some weighting function such that (see Heckman and Vytlacil (2007a) and Abbring and Heckman (2007) for a discussion of distributional effects and effects other than the average).
The source of the causal inference problem stems from unobserved heterogeneity in , which will cause treated and untreated populations to be noncomparable. The treated will tend to have higher (and hence the and that lead to high ), while the untreated will have lower (and hence values of and that lead to low ). Since and determine , the average will generally be different for different populations.
where the is the density of conditional on , and the second equality follows from the fact that : for all observations with an identical probability of receiving treatment, the distribution of unobservables will be identical between and populations.8 Importantly, any nontrivial marginal density will necessarily lead to .9
In our discussion below, we will point out how various research designs grapple with the problem of unobserved heterogeneity in . In summary, in an ex post evaluation problem, the task is to translate whatever knowledge we have about the assignment mechanism into restrictions on the functions given in Eqs (1), (2), or (3), and to investigate, as a result, what causal effects can be identified from the data.
We argue that in an ex post evaluation of a program, the goal is to make causal inferences with a high degree of “internal validity”: the aim is to make credible inferences and qualify them as precisely as possible. In such a descriptive exercise, the degree of “external validity” is irrelevant. On the other hand, “external validity” will be of paramount importance when one wants to make predictive statements about the impact of the same program on a different population, or when one wants to use the inferences to make guesses about the possible effects of a slightly different program. That is, we view “external validity” to be the central issue in an attempt to use the results of an ex post evaluation for an ex ante program evaluation; we further discuss this in the next section.
What constitutes an inference with high “internal validity”? 10 Throughout this chapter we will consider three criteria. The first is the extent to which there is a tight correspondence between what we know about the assignment-to-treatment mechanism and our statistical model of the process. In some cases, the assignment mechanism might leave very little room as to how it is to be formally translated into a statistical assumption. In other cases, little might be known about the process leading to treatment status, leaving much more discretion in the hands of the analyst to model the process. We view this discretion as potentially expanding the set of “plausible” (yet different) inferences that can be made, and hence generating doubt as to which one is correct.
The second criterion is the broadness of the class of models with which the causal inferences are consistent. Ideally, one would like to make a causal inference that is consistent with any conceivable behavioral model. By this criterion, it would be undesirable to make a causal inference that is only valid if a very specific behavioral model is true, and it is unknown how the inferences would change under plausible deviations from the model in question.
The last criterion we will consider is the extent to which the research design is testable; that is, the extent to which we can treat the proposed treatment assignment mechanism as a null hypothesis that could, in principle, be falsified with data (e.g. probabilistically, via a formal statistical test).
Overall, if one were to adopt these three criteria, then a research design would have low “internal validity” when (1) the statistical model is not based on what is actually known about the treatment assignment mechanism, but based entirely on speculation, (2) inferences are known only to be valid for one specific behavioral model amongst many other plausible alternatives and (3) there is no way to test the key assumption that achieves identification.
What is the role of economic (for that matter, any other) theory in the ex post evaluation problem? First of all, economic theories motivate what outcomes we wish to examine, and what causal relationships we wish to explore. For example, our models of job search (see McCall and McCall (2008) for example) may motivate us to examine the impact of a change in benefit levels on unemployment duration. Or if we were interested in the likely impacts of the “program” of a hike in the minimum wage, economists are likely to be most interested in the impact on employment, either for the purposes of measuring demand elasticities, or perhaps assessing the empirical relevance of a perfectly competitive labor market against that of a market in which firms face upward-sloping labor supply curves (Card and Krueger, 1995; Manning, 2003).
Second, when our institutional knowledge does not put enough structure on the problem to identify any causal effects, then assumptions about individuals’ behavior must be made to make any causal statement, however conditional and qualified. In this way, structural assumptions motivated by economic theory can help “fill in the gaps” in the knowledge of the treatment assignment process.
Overall, in an ex post evaluation, the imposition of structural assumptions motivated by economic theory is done out of necessity. The ideal is to conjecture as little as possible about individuals’ behavior so as to make the causal inferences valid under the broadest class of all possible models. For example, one could imagine beginning with a simple Rosen-type model of schooling with wealth maximization (Rosen, 1987) as a basis for empirically estimating the impact of a college subsidy program on educational attainment and lifetime earnings. The problem with such an approach is that this would raise the question as to whether the causal inferences entirely depend on that particular Rosen-type model. What if one added consumption decisions to the model? What about saving and borrowing? What if there are credit constraints? What if there are unpredictable shocks to non-labor income? What if agents maximize present discounted utility rather than discounted lifetime wealth? The possible permutations go on and on.
It is tempting to reason that we have no choice but to adopt a specific model of economic behavior and to admit that causal inferences are conditional only on the model being true; that the only alternative is to make causal inferences that depend on assumptions that we do not even know we are making. 11 But this reasoning equates the specificity of a model with its completeness, which we believe to be very different notions.
Suppose, for example—in the context of evaluating the impact of our hypothetical job search assistance program—that the type of a randomly drawn individual from the population is given by the random variable (with a cdf ), that represents all the constraints and actions the individual takes prior to, and in anticipation of, the determination of participating in the program , and that outcomes are determined by the system given by Eqs (1)-(3). While there is no discussion of utility functions, production functions, information sets, or discount rates, the fact is that this is a complete model of the data generating process; that is, we have enough information to derive expressions for the joint distribution of the observables from the primitives of and (1)-(3). At the same time it is not a very specific (or economic) model, but in fact, quite the opposite: it is perhaps the most general formulation that one could consider. It is difficult to imagine any economic model—including a standard job search model—being inconsistent with this framework.
Another example of this can be seen in the context of the impact of a job training program on earnings. One of the many different economic structures consistent with Roy, 1951; Heckman and Honore, 1990) into training. 12 The Roy-type model is certainly specific, assuming perfect foresight on earnings in both the “training” or “no-training” regimes, as well as income maximization behavior. If one obtains causal inferences in the Roy model framework, an open question would be how the inferences change under different theoretical frameworks (e.g. a job search-type model, where training shifts the wage offer distribution upward). But if we can show that the causal inferences are valid within the more general—but nonetheless complete—formulation of (1)-(3), then we know the inferences will still hold under both the Roy-type model, a job search model, or any number of plausible alternative economic theories.
We now consider a particular kind of predictive, or ex ante, evaluation problem: suppose the researcher is interested in predicting the effects of a program “out of sample”. For example, the impact of the Job Corps Training program on the earnings of youth in 1983 in the 10 largest metropolitan areas in the US may be the focus of an ex post evaluation, simply because the data at hand comes from such a setting. But it is natural to ask any one or a combination of the following questions: What would be the impact today (or some date in the future)? What would be the impact of an expanded version of the program in more cities (as opposed to the limited number of sites in the data)? What would be the impact on an older group of participants (as opposed to only the youth)? What would be the impact of a program that expanded eligibility for the program? These are examples of the questions that are in the domain of an ex ante evaluation problem.
Note that while the ex post evaluation problem has a descriptive motivation—the above questions implicitly have a prescriptive motivation. After all, there seems no other practical reason why knowing the impact of the program “today” would be any “better” than knowing the impact of the program 20 years ago, other than because such knowledge helps us make a particular policy decision today. Similarly, the only reason we would deem it “better” to know the impact for an older group of participants, or participants from less disadvantaged backgrounds, or participants in a broader group of cities is because we would like to evaluate whether actually targeting the program along any of these dimensions would be a good idea.
One can characterize an important distinction between the ex post and ex ante evaluation problems in terms of Eq. (4). In an ex post evaluation, the weights are dictated by the constraints of the available data, and what causal effects are most plausibly identified. It is simply accepted as a fact—however disappointing it may be to the researcher—that there are only a few different weighted average effects that can be plausibly identified, whatever weights they involve. By contrast, in an ex ante evaluation, the weights are chosen by the researcher, irrespective of the feasibility of attaining the implied weighted average “of interest”. These weights may reflect the researcher’s subjective judgement about what is an “interesting” population to study. Alternatively, they may be implied by a specific normative framework. A clear example of the latter is found in Heckman and Vytlacil (2005), who begin with a Benthamite social welfare function to define a “policy relevant treatment effect”, which is a weighted average treatment effect with a particular form for the weights .
One can thus view “external validity” to be the degree of similarity between the weights characterized in the ex post evaluation and the weights defined as being “of interest” in an ex ante evaluation. From this perspective, any claim about whether a particular causal inference is “externally valid” is necessarily imprecise without a clear definition of the desired weights and their theoretical justification. Again, the PRTE of Heckman and Vytlacil (2005) is a nice example where such a precise justification is given.
Overall, in contrast to the ex post evaluation, the goals of an ex ante evaluation are not necessarily tied to the specific context of or data collected on any particular program. In some cases, the researcher may be interested in the likely effects of a program on a population for which the program was already implemented; the goals of the ex post and ex ante evaluation would then be similar. But in other cases, the researcher may have reason to be interested in the likely effects of the program on different populations or in different “economic environments”; in these cases ex post and ex ante evaluations—even when they use the same data—would be expected to yield different results. It should be clear that however credible or reliable the ex post causal inferences are, ex ante evaluations using the same data will necessarily be more speculative and dependent on more assumptions, just as forecasting out of sample is a more speculative exercise than within-sample prediction.
In this chapter, we focus most of our attention on the goals of the ex post evaluation problem, that of achieving a high degree of internal validity. We recognize that the weighted average effects that are often identified in ex post evaluation research designs may not correspond to a potentially more intuitive “parameter of interest”, raising the issue of “external validity”. Accordingly—using well-known results in the econometric and evaluation literature—we sketch out a few approaches for extrapolating from the average effects obtained from the ex post analysis to effects that might be the focus of an ex ante evaluation.
Throughout the chapter, we limit ourselves to contexts in which a potential instrument is binary, because the real-world examples where potential instruments have been explicitly or “naturally” randomized, the instrument is invariably binary. As is well-understood in the evaluation literature, this creates a gap between what causal effects we can estimate and the potentially more “general” average effects of interest. It is intuitive that such a gap would diminish if one had access to an instrumental variable that is continuously distributed. Indeed, as Heckman and Vytlacil (2005) show, when the instrument is essentially randomized (and excluded from the outcome equation) and continuously distributed in such a way that is continuously distributed on the unit interval, then the full set of what they define as Marginal Treatment Effects (MTE) can be used construct various policy parameters of interest.
In this section, we consider a group of research designs in which the model for the data generating process is to a large extent dictated by explicit institutional knowledge of how treatment status was assigned. We make the case that these four well-known cases deliver causal inferences with a high degree of “internal validity” because of at least three reasons: (1) some important or all aspects of the econometric model is a literal description of the treatment assignment process, (2) the validity of the causal inferences hold true within a seemingly broad class of competing behavioral models, and perhaps most importantly, (3) the statistical statements that describe the assignment process simultaneously generate strong observable predictions in the data. For these reasons, we argue that these cases might be considered “high-grade” experiments/natural experiments. 13
In this section, we also consider the issue of “external validity” and the ex ante evaluation problem. It is well understood that in the four cases below, the populations for which average causal effects are identified may not correspond to the “populations of interest”. The ATE identified in a small, randomized experiment does not necessarily reflect the impact of a widespread implementation of the program; the Local Average Treatment Effect (LATE) of Imbens and Angrist (1994) is distinct from the ATE; the causal effect identified by the Regression Discontinuity Design of Thistlethwaite and Campbell (1960) does not reflect the effect of making the program available to individuals whose assignment variable is well below the discontinuity threshold. For each case, we illustrate how imposing some structure on the problem can provide an explicit link between the quantities identified in the ex post evaluation and the parameters of interest in an ex ante evaluation problem.
We start by considering simple random assignment with perfect compliance. “Perfect compliance” refers to the case that individuals who are assigned a particular treatment, do indeed receive the treatment. For example, consider a re-employment program for unemployment insurance claimants, where the “program” is being contacted (via telephone and/or personal visit) by a career counselor, who provides information that facilitates the job search process. Here, participation in this “program” is not voluntary, and it is easy to imagine a public agency randomly choosing a subset of the population of UI claimants to receive this treatment. The outcome might be time to re-employment or total earnings in a period following the treatment.
In terms of the framework defined by Eqs (1)-(3), this situation can be formally represented as
That is, for the entire population being studied, every individual has the same probability of being assigned to the program.
It is immediately clear that the distribution of becomes degenerate, with a single mass point at , and so the difference in the means in Eq. (5) becomes
where the ATE is the “average treatment effect”. The weights from Eq. (4) are in this case. A key problem posed in Eq. (5) is the potential relationship between the latent propensity and (the functions and ). Pure random assignment “solves” the problem by eliminating all variation in .
Internal validity: pan-theoretic causal inference
Let us now assess this research design on the basis of the three criteria described in Section 2.2.1. First, given the general formulation of the problem in Eqs (1)-(3), Condition D1 is much less an assumption, but rather a literal description of the assignment process—the “D” denotes a descriptive element of the data generating process. Indeed, it is not clear how else one would formally describe the randomized experiment.
Second, the causal inference is apparently valid for any model that is consistent with the structure given in Eqs (1)-(3). As discussed in Section 2.2.1, it is difficult to conceive of a model of behavior that would not be consistent with (1), (2), and (3). So even though we are not explicitly laying out the elements of a specific model of behavior (e.g. a job search model), it should be clear that given the distribution , Eqs (1)-(3), and Condition D1 constitutes a complete model of the data generating process, and that causal inference is far from being “atheoretic”. Indeed, the causal inference is best described as “pan-theoretic”, consistent with a broad—arguably the broadest—class of possible behavioral models.
Finally, and perhaps most crucially, even though one could consider D1 to be a descriptive statement, we could alternatively treat it as a hypothesis, one with testable implications. Specifically, D1 implies
and similarly, . That is, the distribution of unobserved “types” is identical in the treatment and control groups. Since is unobservable, this itself is not testable. But a direct consequence of the result is that the pre-determined characteristics/actions must be identical between the two groups as well,
which is a testable implication (as long as there are some observable elements of ).
The implication that the entire joint distribution of all pre-determined characteristics be identical in both the treatment and control states is indeed quite a stringent test, and also independent of any model of the determination of . It is difficult to imagine a more stringent test.
Although it may be tempting to conclude that “even random assignment must assume that the unobservables are uncorrelated with treatment”, on the contrary, the key point here is that the balance of unobservable types between the treatment and control groups is not a primitive assumption; instead, it is a direct consequence of the assignment mechanism, which is described by D1. Furthermore, balance in the observable elements of is not an additional assumption, but a natural implication of balance in the unobservable type .
One might also find D1 “unappealing” since mathematically it seems like a strong condition. But from an ex post evaluation perspective, whether D1 is a “strong” or “weak” condition is not as important as the fact that D1 is beyond realistic: it is practically a literal description of the randomizing process.
Now, suppose there is a subset of elements in —call this vector —that are observed by the experimenter. A minor variant on the above mechanism is when the probability of assignment to treatment is different for different groups defined by , but the probability of treatment is identical for all individuals within each group defined by . 14 In our hypothetical job search assistance experiment, we could imagine initially stratifying the study population by their previous unemployment spell history: “short”, “medium”, and “long”-(predicted) spell UI claimants. This assignment procedure can be described as
In this case, where there may be substantial variation in the unobservable type for a given , the probability of receiving treatment is identical for everyone with the same .
The results from simple random assignment naturally follow,
essentially an average treatment effect, conditional on .
We mention this case not because D2 is a weaker, and hence more palatable assumption, but rather, it is useful to know that the statement in D2—like the mechanism described by D1—is one that typically occurs when randomized experiments are implemented. For example, in the Negative Income Tax Experiments (Robins, 1985; Ashenfelter and Plant, 1990), were the pre-experimental incomes, and families were randomized into the various treatment groups with varying probabilities, but those probabilities were identical for every unit with the same . Another example is the Moving to Opportunity Experiment (Orr et al., 2003), which investigated the impact of individuals moving to a more economically advantaged neighborhood. The experiment was done in 5 different cities (Baltimore, Boston, Chicago, Los Angeles, and New York) over the period 1994-1998. Unanticipated variation in the rate at which people found eligible leases led them to change the fraction of individuals randomly assigned to the treatments two different times during the experiment (Orr et al., 2003, page 232). In this case, families were divided into different “blocks” or “strata” by location time and there was a different randomization ratio for each of these blocks.
This design—being very similar to the simple randomization case—would have a similar level of internal validity, according to two of our three criteria. Whether this design is testable (the third criterion we are considering) depends on the available data. By the same argument as in the simple random assignment case, we have
So if the conditional randomization scheme is based on all of the s that are observed by the analyst, then there are no testable implications. On the other hand, if there are additional elements in that are observed (but not used in the stratification), then once again, one can treat D2 as a hypothesis, and test that hypothesis by examining whether the distribution of those extra variables are the same in the treated and control groups (conditional on ).
We have focused so far on the role that randomization (as described by D1 or D2) plays in ensuring a balance of the unobservable types in the treated and control groups, and have argued that in principle, this can deliver causal inferences with a high degree of internal validity.
Another characteristic of the randomized experiment is that it can be described as “pre-specified” research design. In principle, before the experiment is carried out, the researcher is able to dictate in advance what analyses are to be performed. Indeed, in medical research conducted in the US, prior to conducting an medical experiment, investigators will frequently post a complete description of the experiment in advance at a web site such as clincaltrials.gov. This posting includes how the randomization will be performed, the rules for selecting subjects, the outcomes that will be investigated, and what statistical tests will be performed. Among other things, such pre-announcement prevents the possibility of “selective reporting”—reporting the results only from those trials that achieve the “desired” result. The underlying notion motivating such procedure has been described as providing a “severe test”—a test which “provides an overwhelmingly good chance of revealing the presence of a specific error, if it exists—but not otherwise” (Mayo, 1996, page 7). This notion conveys the idea that convincing statistical evidence does not rely only on the “fit” of the data to a particular hypothesis but on the procedure used to arrive at the result. Good procedures are ones that make fewer “errors.”
It should be recognized, of course, that this “ideal” of pre-specification is rarely implemented in social experiments in economics. In the empirical analysis of randomized evaluations, analysts often cannot help but be interested in the effects for different sub-groups (in which they were not initially interested), and the analysis can soon resemble a data-mining exercise. 15 That said, the problem of data-mining is not specific to randomized experiments, and a researcher armed with a lot of explanatory variables in a non-experimental setting can easily find many “significant” results even among purely randomly generated “data” (see Freedman (1983) for one illustration). It is probably constructive to consider that there is a spectrum of pre-specification, with the pre-announcement procedure described above on one extreme, and specification searching and “significance hunting” with non-experimental data on the other. In our discussion below, we make the case that detailed knowledge of the assignment-to-treatment process can serve much the same role as a pre-specified research design in “planned” experiments—as a kind of “straight jacket” which largely dictates the nature of statistical analysis.
Another noteworthy consequence of this particular data generating process is that it is essentially a “statistical machine” or a “chance set up” (Hacking, 1965) whose “operating characteristics” or statistical properties are well-understood, such as a coin flip. Indeed, after a randomizer assigns n individuals to the (well-defined) treatment, and n individuals to the control, for a total of individuals, one can conduct a non-parametric exact test of a sharp null hypothesis that does not require any particular distributional assumptions.
Consider the sharp null hypothesis that there is no treatment effect for any individuals (which implies that the two samples are drawn from the same distribution). In this case the assignment of the label “treatment” or “control” is arbitrary. In this example there are different ways the labels “treatment” and “control” could have been assigned. Now consider the following procedure:
This particular “randomization” or “permutation” test was originally proposed by Fisher (1935) for its utility to “supply confirmation whenever, rightly or, more often wrongly, it is suspected that the simpler tests have been appreciably injured by departures from normality.” (Fisher, 1966, page 48) (see Lehmann (1959, pages 183–192) for a detailed discussion). Our purpose in introducing it here is not to advocate for randomization inference as an “all purpose” solution for hypothesis testing; rather our purpose is to show just how powerful detailed institutional knowledge of the DGP can be.
Up to this point, with our focus on an ex post evaluation we have considered the question, “For the individuals exposed to the randomized evaluation, what was the impact of the program?” We now consider a particular ex ante evaluation question, “What would be the impact of a full-scale implementation of the program?”, in a context when that full-scale implementation has not occurred. It is not difficult to imagine that the individuals who participate in a small-scale randomized evaluation may differ from those who would receive treatment under full-scale implementation. One could take the perspective that this therefore makes the highly credible/internally valid causal inferences from the randomized evaluation irrelevant, and hence that there is no choice but to pursue non-experimental methods, such as a structural modeling approach to evaluation, to answer the “real” question of interest.
Here we present an alternative view that this ex ante evaluation question is an extrapolation problem. And far from being irrelevant, estimates from a randomized evaluation can form the basis for such an extrapolation. And rather than viewing structural modeling and estimation as an alternative or substitute for experimental methods, we consider the two approaches to be potentially quite complementary in carrying out this extrapolation. That is, one can adopt certain assumptions about behavior and the structure of the economy to make precise the linkage between highly credible impact estimates from a small-scale experiment and the impact of a hypothetical full-scale implementation.
We illustrate this with the following example. Suppose one conducted a small-scale randomized evaluation of a job training program where participation in the experimental study was voluntary, while actual receipt of training was randomized. The question is, what would be the impact on earnings if we opened up the program so that participation in the program was voluntary?
First, let us define the parameter of interest as
where and are the earnings of a randomly drawn individual under two regimes: full-scale implementation of the program (), or no program at all (). This corresponds to the parameter of interest that motivates the Policy Relevant Treatment Effect (PRTE) of Heckman and Vytlacil (2001b). We might like to know the average earnings gain for everyone in the population. We can also express this as
where the is the treatment status indicator in the regime, and the subscripts denote the potential outcomes.
Make the following assumptions:
This setup will imply that
Thus, S1 through S4 are simply a set of economic assumptions that says potential outcomes are unaffected by the implementation of the program; this corresponds to what Heckman and Vytlacil (2005) call policy invariance. This policy invariance comes about because of the linear production technology, which implies that wages are determined by the technological parameters, and not the supply of labor for each level of human capital.
With this invariance, we may suppress the superscript for the potential outcomes; Eq. (7) will become
16 Note that the key causal parameter will in general be different from , where is the indicator for having participated in the smaller scale randomized experiment (bearing the risk of not being selected for treatment). That is, the concern is that those who participate in the experimental study may not be representative of the population that would eventually participate in a full-scale implementation.
How could they be linked? Consider the additional assumptions
Together, S5 and S6 imply that we could characterize the selection into the program in the experimental regime as
where is the probability of being randomized into receiving the treatment (conditional on participating in the experimental study). Note that this presumes that in the experimental regime, all individuals in the population have the option of signing up for the experimental evaluation.
Finally, assume a functional form for the distribution of training effects in the population:
Applying assumption S7 yields the following expressions
where and are the standard normal pdf and cdf, respectively.
The probability of assignment to treatment in the experimental regime, , characterizes the scale of the program. The smaller is, the smaller the expected gain to participating in the experimental study, and hence the average effect of the study participants will be more positively selected. On the other hand, as approaches 1, the experimental estimate approaches the policy parameter of interest because the experiment becomes the program of interest. Although we are considering the problem of predicting a “scaling up” of the program, this is an interesting case to case to consider because it implies that for an already existing program, one can potentially conduct a randomized evaluation, where a small fraction of individuals are denied the program ( close to 1), and the resulting experimentally identified effect can be directly used to predict the aggregate impact of completely shutting down the program. 17
The left-hand side of the first equation is the “parameter of interest” (i.e. what we want to know) in an ex ante evaluation problem. The left-hand side of the second equation is “what can be identified” (i.e. what we do know) from the experimental data in the ex post evaluation problem. The latter may not be “economically interesting” per se, but at the same time it is far from being unrelated to the former.
Indeed, the average treatment effect identified from the randomized experiment is the starting point or “leading term”, when we combine the above two expressions to yield
with the only unknown parameters in this expression being , the predicted take-up in a full-scale implementation, and , the degree of heterogeneity of the potential training effects in the entire population. It is intuitive that any ex ante evaluation of the full-scale implementation that has not yet occurred will, at a minimum, need these two quantities.
In presenting this example, we do not mean to assert that the economic assumptions S1 through S7 are particularly realistic. Nor do we assert they are minimally sufficient to lead to an extrapolative expression. There are as many different ways to model the economy as there are economists (and probably more!). Instead, we are simply illustrating that an ex ante evaluation attempt can directly use the results of an ex post evaluation, and in this way the description of the data generating process in an ex post evaluation (D1 or D2) can be quite complementary to the structural economic assumptions (S1 through S7). D1 is the key assumption that helps you identify whether there is credible evidence—arguably the most credible that is possible—of a causal phenomenon, while S1 through S7 provides a precise framework to think about making educated guesses about the effects of a program that has yet to be implemented. Although may not be of direct interest, obtaining credible estimates of this quantity would seem helpful for making a prediction about .
We now consider another data generating process that we know often occurs in reality—when there is randomization in the “intent to treat”, but where participation in the program is potentially non-random and driven by self-selection. To return to our hypothetical job search assistance program, instead of mandating the treatment (personal visit/phone call from a career counselor), one could make participation in receiving such a call voluntary. Furthermore, one could take UI claimants and randomize them into two groups: one group receives information about the existence of this program, and the other does not receive the information. One can easily imagine that those who voluntarily sign up to be contacted by the job counselor might be systematically different from those who do not, and in ways related to the outcome. One can also imagine being interested in knowing the “overall effect” of “providing information about the program”, but more often it is the case that we are interested in participation in the program per se (the treatment of “being contacted by the job counselor”).
We discuss this widely known data generating process within the very general framework described by Eqs (1)-(3). We will introduce a more accommodating monotonicity condition than that employed in Imbens and Angrist (1994) and Angrist et al. (1996). When we do so, the familiar “Wald” estimand will give an interpretation of an average treatment effect that is not quite as “local” as implied by the “local average treatment effect” (LATE), which is described as “the average treatment effect for the [subpopulation of] individuals whose treatment status is influenced by changing an exogenous regressor that satisfies an exclusion restriction” Imbens and Angrist (1994).
We begin with describing random assignment of the “intent to treat” as
This is analogous to D1 (and D2), except that instead of randomizing the treatment, we are randomizing the instrumental variable. Like D1 and D2, it is appropriate to consider this a description of the process when we know that has been randomized.
Since we have introduced a new variable , we must specify how it relates to the other variables:
Although this is a re-statement of Eqs (1) and (3), given the existence of , this is a substantive and crucial assumption. It is the standard excludability condition: cannot have an impact on , either directly or indirectly through influencing the other factors . It is typically not a literal descriptive statement in the way that D1 through D3 can sometimes be. It is a structural (“S”) assumption on the same level as S1 through S7 and it may or may not be plausible depending on the context.
S9 is a generalization of the monotonicity condition used in Imbens and Angrist (1994) and Angrist et al. (1996). In those papers, or take on the values or 0; that is, for a given individual type , their treatment status is deterministic for a given value of the instrument . This would imply that would have a distribution with three points of support: 0 (the latent propensity for “never-takers”), (the latent propensity for “compliers”), and 1 (the latent propensity for “always-takers”). 18
In the slightly more general framework presented here, for each type , for a given value of the instrument , treatment status is allowed to be probabilistic: some fraction (potentially strictly between 0 and 1) of them will be treated. can thus take on a continuum of values between 0 and 1. The probabilistic nature of the treatment assignment can be interpreted in at least two ways: (1) for a particular individual of type , there are random shocks beyond the individual’s control that introduce some uncertainty into the treatment receipt (e.g. there was a missed newspaper delivery, so the individual did not see an advertisement for the job counseling program), or (2) even for the same individual type (and hence with the same potential outcomes), there is heterogeneity in individuals in the factors that determine participation even conditional on (e.g. heterogeneity in costs of participation).
S9 allows some violations of “deterministic” monotonicity at the individual level (the simultaneous presence of “compliers” and “defiers”), but requires that —conditional on the individual type —the probability of treatment rises when moves from to . In other words, S9 requires that—conditional on —on average the “compliers” outnumber the “defiers”. To use the notation in the literature, where and are the possible treatments when or , respectively, the monotonicity condition discussed in the literature is . By contrast, S9 requires . Integrating over , S9 thus implies that , but the converse is not true. Furthermore, while implies S9, the converse is not true.
Averaging over the distribution of conditional on yields
Taking the difference between the and individuals, this yields the reduced-form
where D3 allows us to combine the two integrals. Note also that without S8, we would be unable to factor out the term .
It is useful here to contrast the DGP given by D3 and S8 with the randomized experiment with perfect compliance, in how it confronts the problem posed by Eq. (5). With perfect compliance, the randomization made it so that was the same constant for both treated and control individuals, so the two terms in Eq. (5) could be combined. With non-random selection into treatment, we must admit the possibility of variability in . But instead of making the contrast between and , it is made between versus individuals, who, by D3, have the same distribution of types (). Thus, the randomized instrument allows us to compare two groups with the same distribution of latent propensities : .
Dividing the preceding equation by a normalizing factor, it follows that the Wald Estimand will identify
Therefore, there is an alternative to the interpretation of the Wald estimand as the LATE. 19 It can be viewed as the weighted average treatment effect for the entire population where the weights are proportional to the increase in the probability of treatment caused by the instrument, . 20 This weighted average interpretation requires the weights to be non-negative, which will be true if and only if the probabilistic monotonicity condition S9 holds. Note the connection with the conventional LATE interpretation: when treatment is a deterministic function of , then the monotonicity means only the compliers (i.e. ) collectively receive 100 percent of the weight, while all other units receive 0 weight.
The general framework given by Eqs (1)-(3), and the weaker monotonicity condition S9 thus leads to a less “local” interpretation than LATE. For example, Angrist and Evans (1998) use a binary variable that indicates whether the first two children were of the same gender (Same Sex) as an instrument for whether the family ultimately has more than 2 children (More than 2). They find a first-stage coefficient of around 0.06. The conventional monotonicity assumption, which presumes that is a deterministic function of , leads to the interpretation that we know that 6 percent of families are “compliers”: those that are induced to having a third child because their first two children were of the same gender. This naturally leads to the conclusion that the average effect “only applies” to 6 percent of the population.
In light of Eq. (9), however, an alternative interpretation is that the Wald estimand yields a weighted average of 100 percent of the population, with individual weights proportional to the individual-specific impact of (Same Sex) on (More than 2). In fact, if (Same Sex) had the same 0.06 impact on the probability of having more than 2 children for all families, the Wald Estimand will yield the ATE. Nothing in this scenario prevents substantial amount of variation in , , (and hence ), as well as non-random selection into treatment (e.g. correlation between and ). 21 With our hypothetical instrument of “providing information about the job counseling program”, a first-stage effect on participation of 0.02 can be interpreted as a 0.02 effect in probability for all individuals.
In summary, the data generating process given by D3, S8, and S9–compared to one where there is deterministic monotonicity—is a broader characterization of the models for which the Wald estimand identifies an average effect. Accordingly, the Wald estimand can have a broader interpretation as a weighted average treatment effect. The framework used to yield the LATE interpretation restricts the heterogeneity in to have only three points of support, , , and . Thus, the LATE interpretation—which admits that effects can only be identified for those with is one that most exaggerates the “local” or “unrepresentativesness” of the Wald-identified average effect.
Finally, it is natural to wonder why there is so much of a focus on the Wald estimand. In a purely ex post evaluation analysis, the reason is not that IV is a “favorite” or “common” estimand. 22 Rather, in an ex post evaluation, we may have limited options, based on the realities of how the program was conducted, and what data are available. So, for example, as analysts we may be confronted with an instrument, “provision of information about the job counseling program” (), which was indeed randomized as described by D3, and on purely theoretical grounds, we are comfortable with the additional structural assumptions S8 and S9. But suppose we are limited by the observable data , and know nothing else about the structure given in Eqs (1)-(3), and therefore wish our inferences to be invariant to any possible behavioral model consistent with those equations. If we want to identify some kind of average , then what alternative do we have but the Wald estimand? It is not clear there is one.
The definition of the weights of “interest” is precisely the first step of an ex ante evaluation of a program. We argue that the results of an analysis that yields us an average effect, as in (9), may well not be the direct “parameter of interest”, but could be used as an ingredient to predict such a parameter in an ex ante evaluation analysis. We illustrate this notion with a simple example below.
In terms of our three criteria to assess internal validity, how does this research design fare—particularly in comparison to the randomized experiment with perfect compliance? First, only part of the data generating process given by D3, S8, and S9 is a literal description of the assignment process: if is truly randomly assigned, then D3 is not so much an assumption, but a description. On the other hand, S8 and S9 will typically be conjectures about behavior rather than being an implication of our institutional knowledge. 23 This is an example where there are “gaps” in our understanding of the assignment process, and structural assumptions work together with experimental variation to achieve identification.
As for our second criterion, with the addition of the structure imposed by S8 and S9, it is clear that the class of all behavioral models for which the causal inference in Eq. (9) is valid is smaller. It is helpful to consider, for our hypothetical instrument, the kinds of economic models that would or would not be consistent with S8 and S9. If individuals are choosing the best job search activity amongst all known available feasible options, then the instrument of “providing information about the existence of a career counseling program” could be viewed as adding one more known alternative. A standard revealed preference argument would dictate that if an individual already chose to participate under , then it would still be optimal if : this would satisfy S9. Furthermore, it is arguably true that most attempts at modeling this process would not specify a direct impact of this added information on human capital; this would be an argument for S8. On the other hand, what if the information received about the program carried a signal of some other factor? It could indicate to the individual, that the state agency is monitoring their job search behavior more closely. This might induce the individual to search more intensively, independently of participating in the career counseling program; this would be violation of the exclusion restriction S8. Or perhaps the information provided sends a positive signal about the state of the job market and induces the individual to pursue other job search activities instead of the program; this might lead to a violation of S9.
For our third criterion, we can see that some aspects of D3, S8, and S9 are potentially testable. Suppose the elements of can be categorized into the vector (the variables determined prior to ) and (after ). And suppose we can observe a subset of elements from each of these vectors as variables and , respectively. Then the randomization in D3 has the direct implication that the distributions of for and should be identical:
Furthermore, since the exclusion restriction S8 dictates that all factors that determine are not influenced by , then D3 and S8 jointly imply that the distribution of are identical for the two groups:
The practical limitation here is that this test pre-supposes the researcher’s really do reflect elements of that influence . If are not a subset of , then even if there is imbalance in , S8 could still hold. Contrast this with the implication of D3 (and D1 and D2) that any variable determined prior to the random assignment should have a distribution that is identical between the two randomly assigned groups. Also, there seems no obvious way to test the proposition that does not directly impact , which is another condition required by S8.
Finally, if S9 holds, it must also be true that
That is, if probabilistic monotonicity holds for all , then it must also hold for groups of individuals, defined by the value of (which is a function of ). This inequality also holds for any variables determined prior to .
In summary, we conclude (unsurprisingly) that for programs where there is random assignment in the “encouragement” of individuals to participate, causal inferences will be of strictly lower internal validity, relative to the perfect compliance case. Nevertheless, the design does seem to satisfy—even if to a lesser degree—the three criteria that we are considering. Our knowledge of the assignment process does dictate an important aspect of the statistical model (and other aspects need to be modeled with structural assumptions), the causal inferences using the Wald estimand appear valid within a reasonably broad class of models (even it is not as broad as that for the perfect compliance case), and there are certain aspects of the design that generate testable implications.
Perhaps the most common criticism leveled at the LATE parameter is that it may not be the “parameter of interest”. 24 By the same token, one may have little reason to be satisfied with the particular weights in the average effect expressed in (9). Returning to our hypothetical example in which the instrument “provide information on career counseling program” is randomized, the researcher may not be interested in an average effect that over-samples those who are more influenced by the instrument. For example, a researcher might be interested in predicting the average impact of individuals of a mandatory job counseling program, like the hypothetical example in the case of the randomized experiment with perfect compliance. That is, it may be of interest to predict what would happen if people were required to participate (i.e. every UI claimant will receive a call/visit from a job counselor). Moreover, it has been suggested that LATE is an “instrument-specific” parameter and a singular focus on LATE risks conflating “definition of parameters with issues of identification” (Heckman and Vytlacil, 2005): different instruments can be expected to yield different “LATEs”.
Our view is that these are valid criticisms from the perspective of an ex ante evaluation standpoint, where causal “parameters of interest” are defined by a theoretical framework describing the policy problem. But from an ex post evaluation perspective, within which internal validity is the primary goal, these issues are, by definition, unimportant. When the goal is to describe whatever one can about the causal effects of a program that was actually implemented (e.g. randomization of “information about the job counseling program”, ), a rigorous analysis will lead to precise statements about the causal phenomena that are possible to credibly identify. Sometimes what one can credibly identify may correspond to a desired parameter from a well-defined ex ante evaluation; sometimes it will not.
Although there has been considerable emphasis in the applied literature on the fact that LATE may differ from ATE, as well as discussion in the theoretical literature about the “merits” of LATE as a parameter, far less effort has been spent in actually using estimates of LATE to learn about ATE. Even though standard “textbook” selection models can lead to simple ways to extrapolate from LATE to ATE (see Heckman and Vytlacil (2001a) and Heckman et al. (2001, 2003)), our survey of the applied literature revealed very few other attempts to actually produce these extrapolations.
Although we have argued that the ATE from a randomized experiment may not directly correspond to the parameter of interest, for the following derivations, let us stipulate that ATE is useful, either for extrapolation (as illustrated in Section 3.1.4), or as an “instrument-invariant” benchmark that can be compared to other ATEs extrapolated from other instruments or alternative identification strategies.
Consider re-writing the structure given by Eqs (1)-(3), and D3, S8, and S9 as
where are constants, is a binary instrument, and characterize both the individual’s type and all other factors that determine and selection, and is independent of by D3 and S8. and are constants in the selection equation: S9 is satisfied. , , can be normalized to be mean zero error terms. The ATE is by construction equal to .
Let us adopt the following functional form assumption:
This is simply the standard dummy endogenous variable system (as in Heckman (1976, 1978) and Maddala (1983)), with the special case of a dummy variable instrument, and is a case that is considered in recent work by Angrist (2004) and Oreopoulos (2006). With this one functional form assumption, we obtain a relationship between LATE and ATE.
where and are the pdf and cdf of the standard normal, respectively. 25 This is a standard result that directly follows from the early work on selection models (Heckman (1976, 1978). See also Heckman et al. (2001, 2003)). This framework has been used to discuss the relationship between ATE and LATE (see Heckman et al. (2001), Angrist (2004), and Oreopoulos (2006)). With a few exceptions (such as Heckman et al. (2001), for example), other applied researchers typically do not make use of the fact that even with information on just the three variables , , and , the “selection correction” term in Eq. (10) can be computed. In particular
implies that
and analogously that
Having identified and , we have the expression
This expression is quite similar to that given in Section 3.1.4. Once again, the result of an ex post evaluation can be viewed as the leading term in the extrapolative goal of an ex ante evaluation: to obtain the effects of a program that was not implemented (i.e. random assignment with perfect compliance) or the ATE. This expression also shows how the goals of the ex post and ex ante evaluation problems can be complementary. Ex post evaluations aim to get the best estimate of the first term in the above equation, whereas ex ante evaluations are concerned with the assumptions necessary to extrapolate from LATE to ATE as in the above expression.
If S10 were adopted, how much might estimates of LATE differ from those of ATE in practice? To investigate this, we obtained data from a select group of empirical studies and computed both the estimates of LATE and the “selection correction” term in (13), using (11) and (12). 26
The results are summarized in Table 1. For each of the studies, we present the simple difference in the means , the Wald estimate of LATE, the implied selection error correlations, , , a selection correction term, and the implied ATE. For comparison, we also give the average value of Y. Standard deviations are in brackets. The next to last row of the table gives the value of the second term in (13), and its significance (calculated using the delta method).
The quantity is the implied covariance between the gains and the selection error . A “selection on gains” phenomenon would imply . With the study of Abadie et al. (2002) in the first column, we see a substantial negative selection term, where the resulting LATE is significantly less than either the simple difference, or the implied ATE. If indeed the normality assumption is correct, this would imply that for the purposes of obtaining ATE, which might be the target of an ex ante evaluation, simply using LATE would be misleading, and ultimately even worse than using the simple difference in means.
On the other hand, in the analysis of Angrist and Evans (1998), the estimated LATE is actually quite similar to the implied ATE. So while a skeptic might consider it “uninteresting” to know the impact of having more than 2 children for the “compliers” (those whose family size was impacted by the gender mix of the first two children), it turns out in this context—if one accepts the functional form assumption—LATE and ATE do not differ very much. 27 The other studies are examples of intermediate cases: LATE may not be equal to ATE, but it is closer to ATE than the simple difference .
Our point here is neither to recommend nor to discourage the use of this normal selection model for extrapolation. Rather, it is to illustrate and emphasize that even if LATE is identifying an average effect for a conceptually different population from the ATE, this does not necessarily mean that the two quantities in an actual application are very different. Our other point is that any inference that uses LATE to make any statement about the causal phenomena outside the context from which a LATE is generated, must necessarily rely on a structural assumption, whether implicitly or explicitly. In this discussion of extrapolating from LATE to ATE, we are being explicit that we are able to do this through a bivariate normal assumption. While such an assumption may seem unpalatable to some, it is clear that to insist on making no extrapolative assumptions is to abandon the ex ante evaluation goal entirely.
This section provides an extended discussion of identification and estimation of the regression discontinuity (RD) design. RD designs were first introduced by Thistlethwaite and Campbell (1960) as a way of estimating treatment effects in a non-experimental setting, where treatment is determined by whether an observed “assignment” variable (also referred to in the literature as the “forcing” variable or the “running” variable) exceeds a known cutoff point. In their initial application of RD designs, Thistlethwaite and Campbell (1960) analyzed the impact of merit awards on future academic outcomes, using the fact that the allocation of these awards was based on an observed test score. The main idea behind the research design was that individuals with scores just below the cutoff (who did not receive the award) were good comparisons to those just above the cutoff (who did receive the award). Although this evaluation strategy has been around for almost fifty years, it did not attract much attention in economics until relatively recently.
Since the late 1990s, a growing number of studies have relied on RD designs to estimate program effects in a wide variety of economic contexts. Like Thistlethwaite and Campbell (1960), early studies by Van der Klaauw (2002) and Angrist and Lavy (1999) exploited threshold rules often used by educational institutions to estimate the effect of financial aid and class size, respectively, on educational outcomes. Black (1999) exploited the presence of discontinuities at the geographical level (school district boundaries) to estimate the willingness to pay for good schools. Following these early papers in the area of education, the past five years have seen a rapidly growing literature using RD designs to examine a range of questions. Examples include: the labor supply effect of welfare, unemployment insurance, and disability programs; the effects of Medicaid on health outcomes; the effect of remedial education programs on educational achievement; the empirical relevance of median voter models; and the effects of unionization on wages and employment.
An important impetus behind this recent flurry of research is a recognition, formalized by Hahn et al. (2001), that RD designs require seemingly mild assumptions compared to those needed for other non-experimental approaches. Another reason for the recent wave of research is the belief that the RD design is not “just another” evaluation strategy, and that causal inferences from RD designs are potentially more credible than those from typical “natural experiment” strategies (e.g. difference-in-difference or instrumental variables), which have been heavily employed in applied research in recent decades. This notion has a theoretical justification: Lee (2008) formally shows that one need not assume the RD design isolates treatment variation that is “as good as randomized”; instead, such randomized variation is a consequence of agents’ inability to precisely control the assignment variable near the known cutoff.
So while the RD approach was initially thought to be “just another” program evaluation method with relatively little general applicability outside of a few specific problems, recent work in economics has shown quite the opposite. 28 In addition to providing a highly credible and transparent way of estimating program effects, RD designs can be used in a wide variety of contexts covering a large number of important economic questions. These two facts likely explain why the RD approach is rapidly becoming a major element in the toolkit of empirical economists.
Before presenting a more formal discussion of various identification and estimation issues, we first briefly highlight what we believe to be the most important points that have emerged from the recent theoretical and empirical literature on the RD design. 29 In this chapter, we will use to denote the assignment variable, and treatment will be assigned to individuals when exceeds a known threshold , which we later normalize to 0 in our discussion.
When there is a payoff or benefit to receiving a treatment, it is natural for an economist to consider how an individual may behave to obtain such benefits. For example, if students could effectively “choose” their test score through effort, those who chose a score (and hence received the merit award) could be somewhat different from those who chose scores just below . The important lesson here is that the existence of a treatment being a discontinuous function of an assignment variable is not sufficient to justify the validity of an RD design. Indeed, if anything, discontinuous rules may generate incentives, causing behavior that would invalidate the RD approach.
This is a crucial feature of the RD design, and a reason that RD designs are often so compelling. Intuitively, when individuals have imprecise control over the assignment variable, even if some are especially likely to have values of near the cutoff, every individual will have approximately the same probability of having an that is just above (receiving the treatment) or just below (being denied the treatment) the cutoff—similar to a coin-flip experiment. This result clearly differentiates the RD and IV (with a non-randomized instrument) approaches. When using IV for causal inference, one must assume the instrument is exogenously generated as if by a coin-flip. Such an assumption is often difficult to justify (except when an actual lottery was run, as in Hearst et al. (1986) or Angrist (1990), or if there were some biological process, e.g. gender determination of a baby, mimicking a coin-flip). By contrast, the variation that RD designs isolate is randomized as a consequence of the assumption that individuals have imprecise control over the assignment variable (Lee, 2008).
This is the key implication of the local randomization result. If variation in the treatment near the threshold is approximately randomized, then it follows that all “baseline characteristics” –all those variables determined prior to the realization of the assignment variable—should have the same distribution just above and just below the cutoff. If there is a discontinuity in these baseline covariates, then at a minimum, the underlying identifying assumption of individuals’ inability to precisely manipulate the assignment variable is unwarranted. Thus, the baseline covariates are used to test the validity of the RD design. By contrast, when employing an IV or a matching/regression-control strategy in non-experimental situations, assumptions typically need to be made about the relationship of these other covariates to the treatment and outcome variables. 30
It is tempting to conclude that the RD delivers treatment effects that “only apply” for the sub-population of individuals whose is arbitrarily close to the threshold . Such an interpretation would imply that the RD identifies treatment effects for “virtually no one”. Fortunately, as we shall see below, there is an alternative interpretation: the average effect identified by a valid RD is that of a weighted average treatment effect where the weights are the relative ex ante probability that the value of an individual’s assignment variable will be in the neighborhood of the threshold (Lee, 2008).
Randomized experiments from non-random selection
As argued in Lee and Lemieux (2009), while there are some mechanical similarities between the RD design and a “matching on observables” approach or between the RD design and an instrumental variables approach, the RD design can instead be viewed as a close “cousin” of the randomized experiment, in the sense that what motivates the design and what “dictates” the modeling is specific institutional knowledge of the treatment assignment process. We illustrate this by once again using the common framework given by Eqs (1)-(3). We begin with the case of the “sharp” RD design, whereby the treatment status is a deterministic “step-function” of an observed assignment variable . That is, if and only if crosses the discontinuity threshold (normalized to 0 in our discussion).
Returning to our hypothetical job search assistance program, suppose that the state agency needed to ration the number of participants in the program, and therefore mandated treatment (personal visit and/or phone call from job counselor) for those whom the agency believed would the program would most greatly benefit. In particular, suppose the agency used information on individuals’ past earnings and employment information to generate a score that indicated the likely benefit of the program to the individual, and determined treatment status based on whether that exceeded the threshold 0.
Such a discontinuous rule can be described as
Both D4 and D5 come from institutional knowledge of how treatment is assigned, and thus are more descriptions (“D”) than assumptions.
This assumption ensures that there are some individuals at the threshold.
Since we have introduced a variable that is realized before , we must specify its relation to , so that Eq. (3) becomes
This assumption states that for any individual type , as crosses the discontinuity threshold 0, any change in must be attributable to and only. As we shall see, this assumption, while necessary, is not sufficient for the RD to deliver valid causal inferences.
The most important assumption for identification is
This condition, which we will discuss in greater detail below, says that individuals—no matter how much they can influence the distribution of with their actions—cannot precisely control , even if they may make decisions in anticipation of this uncertainty.
There are at least two alternative interpretations of this condition. One is that individuals may actually precisely control , but they are responding to different external factors, which generates a distribution of different possible s that could occur depending on these outside forces. S13 says that the distribution of —as driven by those outside forces—must have a continuous density. The other interpretation is that even conditional on , there exists heterogeneity in the factors that determine . In this case, S13 is a statement about the distribution of this heterogeneity in having continuous density conditional on the type .
As Lee (2008) shows, it is precisely S13 that will generate a local randomization result. In particular, S13 implies
which says that the distribution of the unobserved “types” will be approximately equal on either side of the discontinuity threshold in a neighborhood of 0. This is the sense in which it accomplishes local randomization, akin to the randomization in an experiment. The difference is in how the problem expressed in Eq. (5) is being confronted. The experimenter in the randomized experiment ensures that treated and non-treated individuals have the same distribution of latent propensities by dictating that all individuals have the same fixed . In the non-experimental context here, we have no control over the distribution of , but if S13 holds, then the (non-degenerate) distribution of (which is a function of ) will approximately be equal between treated and non-treated individuals—for those with realized in a small neighborhood of 0.
Now consider the expectation of at the discontinuity threshold
where the third line follows from Bayes’ Rule. Similarly, we have
where the last line follows from the continuity assumption S12 and S13.
The RD estimand—the difference between the above two quantities—is thus
That is, the discontinuity in the conditional expectation function identifies a weighted average of causal impacts , where the weights are proportional to type ’s relative likelihood that the assignment variable is in the neighborhood of the discontinuity threshold 0.
Equation (14) provides a quite different alternative to the interpretation of the estimand as “the treatment effect for those whose are close to zero”—which connotes a very limited inference, because in the limit, there are no individuals at . The variability in weights in (14) depend very much on the typical scale of relative to the location of across the types. That is, if for each type there is negligible variability in , then the RD estimand will indeed identify a treatment effect for those individuals who can be most expected to have close to the threshold. On the other extreme, if there is large variability, with having flat tails, the weights will tend to be more uniform. 31
One of the reasons why RD design can be viewed as a “cousin” of the randomized experiment is that the latter is really a special case of the sharp RD design. When randomly assigning treatment, one can imagine accomplishing this through a continuously distributed random variable that has the same distribution for every individual. In that case, would not enter the function determining , and hence S12 would be unnecessary. It would follow that , and consequently every individual would receive equal weight in the average effect expression in Eq. (14).
The final point to notice about the average effect that is identified by the RD design is that it is a weighted average of , which may be different from a weighted average of for some other , even if the weights do not change. Of course, there is no difference between the two quantities in situations where is thought to have no impact on the outcome (or the individual treatment effect). For example, if a test score is used for one and only one purpose—to award a scholarship —then it might be reasonable to assume that has no other impact on future educational outcomes . As discussed in Lee (2008) and Lee and Lemieux (2009), in other situations, the concept of for values of other than may not make much practical sense. For example, in the context of estimating the electoral advantage to incumbency in US House elections (Lee, 2008), it is difficult to conceive of the counterfactual when is away from the threshold: what does it mean for the outcome that would have been obtained if the candidate who became the incumbent with 90 percent of the vote had not become the incumbent, having won 90 percent of the vote? Here, incumbent status is defined by .
We now assess this design on the basis of the three criteria discussed in Section 2.2.1. First, some of the conditions for identification are indeed literally descriptions of the assignment process: D4 and D5. Others, like S11 through S13, are conjectures about the assignment process, and the underlying determinants of . S11 requires that there is positive density of at the threshold. S12 allows —which can be viewed as capturing “all other factors” that determine —to have its own structural effect on . It is therefore not as restrictive as a standard exclusion restriction, but S12 does require that the impact of is continuous. S13 is the most important condition for identification, and we discuss it further below.
Our second criterion is the extent to which inferences could be consistent with many competing behavioral models. The question is to what extent does S12 and S13 restrict the class of models for which the RD causal inference remains valid? When program status is determined solely on the basis of a score , and is used for nothing else but the determination of , we expect most economic models to predict that the only reason why would have a discontinuous impact on would be because an individual’s status switches from non-treated to treated. So, from the perspective of modeling economic behavior, S12 does not seem to be particularly restrictive.
By contrast, S13 is potentially restrictive, ruling out some plausible economic behavior. Are individuals able to influence the assignment variable, and if so, what is the nature of this control? This is probably the most important question to ask when assessing whether a particular application should be analyzed as an RD design. If individuals have a great deal of control over the assignment variable and if there is a perceived benefit to a treatment, one would certainly expect individuals on one side of the threshold to be systematically different from those on the other side.
Consider the test-taking example from Thistlethwaite and Campbell (1960). Suppose there are two types of students: and . Suppose type students are more able than types, and that types are also keenly aware that passing the relevant threshold (50 percent) will give them a scholarship benefit, while types are completely ignorant of the scholarship and the rule. Now suppose that 50 percent of the questions are trivial to answer correctly, but due to random chance, students will sometimes make careless errors when they initially answer the test questions, but would certainly correct the errors if they checked their work. In this scenario, only type students will make sure to check their answers before turning in the exam, thereby assuring themselves of a passing score. The density of their score is depicted in the truncated density in Fig. 1. Thus, while we would expect those who barely passed the exam to be a mixture of type and type students, those who barely failed would exclusively be type students. In this example, it is clear that the marginal failing students do not represent a valid counterfactual for the marginal passing students. Analyzing this scenario within an RD framework would be inappropriate.
On the other hand, consider the same scenario, except assume that questions on the exam are not trivial; there are no guaranteed passes, no matter how many times the students check their answers before turning in the exam. In this case, it seems more plausible that among those scoring near the threshold, it is a matter of “luck” as to which side of the threshold they land. Type students can exert more effort—because they know a scholarship is at stake—but they do not know the exact score they will obtain. This can be depicted by the untruncated density in Fig. 1. In this scenario, it would be reasonable to argue that those who marginally failed and passed would be otherwise comparable, and that an RD analysis would be appropriate and would yield credible estimates of the impact of the scholarship.
These two examples make it clear that one must have some knowledge about the mechanism generating the assignment variable, beyond knowing that if it crosses the threshold, the treatment is “turned on”. The“folk wisdom” in the literature is to judge whether the RD is appropriate based on whether individuals could manipulate the assignment variable and precisely “sort” around the discontinuity threshold. The key word here should be “precise”, rather than “manipulate”. After all, in both examples above, individuals do exert some control over the test score. And indeed in virtually every known application of the RD design, it is easy to tell a plausible story that the assignment variable is to some degree influenced by someone. But individuals will not always have precise control over the assignment variable. It should, perhaps, seem obvious that it is necessary to rule out precise sorting to justify the use of an RD design. After all, individual self-selection into treatment or control regimes is exactly why simple comparison of means is unlikely to yield valid causal inferences. Precise sorting around the threshold is self-selection.
Finally, the data generating process given by D4, D5, S11, S12, and S13 has many testable implications. D4 and D5 are directly verifiable, and S11 can be checked since the marginal density can be observed from the data. S12 appears fundamentally unverifiable, but S13, which generates the local randomization result, is testable in two ways. First, as McCrary (2008) points out, the continuity of each type’s density (S13) implies that the observed marginal density is continuous, leading to a natural test: examining the data for a potential discontinuity in the density of at the threshold; McCrary (2008) proposes an estimator for this test. Second, the local randomization result implies that
because is by definition determined prior to (Eq. (1)). This is analogous to the test of randomization where treated and control observations are tested for balance in observable elements of . 32
In summary, based on the above three criteria, the RD design does appear to have potential to deliver highly credible causal inferences because some important aspects of the model are literal descriptions of the assignment process and because the conditions for identification are consistent with a broad class of behavioral models (e.g. can be endogenous—as it is influenced by actions in —as long there is “imprecise” manipulation of ). Perhaps most importantly, the key condition that is not derived from our institutional knowledge (and is the key restriction on economic behavior (S13)), has a testable implication that is as strong as that given by the randomized experiment.
Consider again our hypothetical job search assistance program, where individuals were assigned to the treatment group if their score (based on earnings and employment information and the state agency’s model of who would most benefit from the program) exceeded the threshold . From an ex ante evaluation perspective, one could potentially be interested in predicting the policy impact of “shutting down the program” Heckman and Vytlacil (2005). As pointed out in Heckman and Vytlacil (2005), the treatment on the treated parameter (TOT) is an important ingredient for such a prediction. But as we made precise in the ex post evaluation discussion, the RD estimand identifies a particular weighted average treatment effect that in general will be different from the TOT. Is there a way to extrapolate from the average effect in (14) to TOT?
We now sketch out one such proposal for doing this, recognizing this is a relatively uncharted area of research (and that TOT, while an ingredient in computing the impact of the policy of “shutting down the program”, may not be of interest for a different proposed policy). Using D4, D5, S11, S12, and S13, we can see that the TOT is
This reveals that the key missing ingredient is the second term of this difference.
where the second line again follows from Bayes’ rule.
Suppose has bounded support, and also assume
With the addition of S14 and S15, this will imply that we can use the Taylor approximation
In principle, once this function can be (approximately) identified, one can average the effects over the treated population using the conditional density .
Once again, we see that the leading term of an extrapolation for an ex ante evaluation, is related to the results of an ex post evaluation: it is precisely the counterfactual average in Eq. (14). There are a number of ways of estimating derivatives nonparametrically (Fan and Gijbels, 1996; Pagan and Ullah, 1999), or one could alternatively attempt to approximate the function for with a low-order global polynomial. We recognize that, in practical empirical applications, estimates of higher order derivatives may be quite imprecise.
Nevertheless, we provide this simple method to illustrate the separate role of conditions for credible causal inference (D4, D5, S11, S12, and S13), and the additional structure needed (S14 and S15) to identify a policy-relevant parameter such as the TOT. What alternative additional structure could be imposed on Eqs (1)-(3) to identify parameters such as the TOT seems to be an open question, and likely to be somewhat context-dependent.
There has been a recent explosion of applied research utilizing the RD design (see Lee and Lemieux (2009)). From this applied literature, a certain “folk wisdom” has emerged about sensible approaches to implementing the RD design in practice. The key challenge with the RD design is how to use a finite sample of data on and to estimate the conditional expectations in the discontinuity . Lee and Lemieux (2009) discuss these common practices and their justification in greater detail. Here we simply highlight some of the key points of that review, and then conclude with the recommended “checklist” suggested by Lee and Lemieux (2009). 33
It has become standard to summarize RD analyses with a simple graph showing the relationship between the outcome and assignment variable. This has several advantages. The presentation of the “raw data” enhances the transparency of the research design. A graph can also give the reader a sense of whether the “jump” in the outcome variable at the cutoff is unusually large compared to the bumps in the regression curve away from the cutoff. Also, a graphical analysis can help identify why different functional forms give different answers, and can help identify outliers, which can be a problem in any empirical analysis. The problem with graphical presentations, however, is that there is some room for the researcher to construct graphs making it seem as though there are effects when there are none, or hiding effects that truly exist. A way to guard against this visual bias is to partition into intervals—with the discontinuity threshold at one of the boundaries—and present the mean within each interval. Often, the data on will already be discrete. This way, the behavior of the function around the threshold is given no special “privilege” in the presentation, yet it allows for the data to “speak for itself” as to whether there is an important jump at the threshold. 34
Here it is helpful to keep distinct the notions of identification and estimation. The RD design, as discussed above, is non-parametrically identified, and no parametric restrictions are needed to compute , given an infinite amount of data. But with a finite sample, one has a choice of different statistics, some referred to as “non-parametric” (e.g. kernel regression, local linear regression), while others considered “parametric” (e.g. a low-order polynomial). As Powell (1994) points out, it is perhaps more helpful to view models rather than particular statistics as “parametric” or “non-parametric”. 35 The bottom line is that when the analyst chooses a particular functional form (say, a low-order polynomial) in estimation, and the true function does not belong to that polynomial class, then the resulting estimator will, in general, be biased. When the analyst uses a non-parametric procedure such as local linear regression—essentially running a regression using only data points “close” to the cutoff—there will also be bias. 36 With a finite sample, it is impossible to know which case has a smaller bias without knowing something about the true function. There will be some functions where a low-order polynomial is a very good approximation and produces little or no bias, and therefore it is efficient to use all data points—both “close to” and “far away” from the threshold. In other situations, a polynomial may be a bad approximation, and smaller biases will occur with a local linear regression.
In practice, parametric and non-parametric approaches lead to the computation of the exact same statistic. For example, the procedure of regressing the outcome on and a treatment dummy and an interaction , can be viewed as a “parametric” regression or a local linear regression with a very large bandwidth. Similarly, if one wanted to exclude the influence of data points in the tails of the distribution, one could call the exact same procedure “parametric” after trimming the tails, or “non-parametric” by viewing the restriction in the range of as a result of using a smaller bandwidth. 37 Our main suggestion in estimation is to not rely on one particular method or specification. In any empirical analysis, results that are stable across alternative and equally plausible specifications are generally viewed as more reliable than those that are sensitive to minor changes in specification. RD is no exception in this regard.
Often the consequence of trying many different specifications is that it may result in a wide range of estimates. Although there is no simple formula that works in all situations and contexts for weeding out inappropriate specifications, it seems reasonable, at a minimum, not to rely on an estimate resulting from a specification that can be rejected by the data when tested against a strictly more flexible specification. For example, it seems wise to place less confidence in results from a low-order polynomial model, when it is rejected in favor of a less restrictive model (e.g., separate means for each discrete value of ). Similarly, there seems little reason to prefer a specification that uses all the data, if using the same specification but restricting to observations closer to the threshold gives a substantially (and statistically) different answer.
A Recommended “checklist” for implementation
Below we summarize the recommendations given by Lee and Lemieux (2009) for the analysis, presentation, and estimation of RD designs.
Although it is impractical for researchers to present every permutation of presentation (e.g. points 2-4 for every one of 20 baseline covariates), probing the sensitivity of the results to this array of tests and alternative specifications—even if they only appear in online appendices—is an important component of a thorough RD analysis.
Returning to our hypothetical job search assistance program, consider the same setup as the RD design described above: based on past employment and earnings and a model of who would most benefit from the program, the government constructs a score where those with will receive the treatment (phone call/personal visit of job counselor). But now assume that crossing the threshold only determines whether the agency explicitly provides information to the individual about the existence of the program. In other words, determines —as defined in the case of random assignment with imperfect compliance, discussed above in Section 3.2. As a result, is no longer a deterministic function of , but is a potentially discontinuous function in . This is known as the “fuzzy” RD design (see Hahn et al. (2001), for example, for a formal definition).
The easiest way to understand the fuzzy RD design is to keep in mind that the relationship between the fuzzy design and the sharp design parallels the relation between random assignment with imperfect compliance and random assignment with perfect compliance. Because of this parallel, our assessment of the potential internal validity of the design and potential caveats follows the discussion in Section 3.2. Testing the design follows the same principles as in Section 3.2. Furthermore, one could, in principle, combine the extrapolative ideas in Section 3.3.2 and Section 3.2.2 to use fuzzy RD design estimates to make extrapolations along two dimensions: predicting the effect under mandatory compliance, and predicting the average effects at points away from the threshold.
We limit our discussion here to making explicit the conditions for identification of average effects within the common econometric framework we have been utilizing. The different conditions for this case are
In sum, we have (1) the conditions from the Sharp RD (D5 ( observable), S11 (Positive Density at Threshold), S13 (Continuous Density at Threshold)), (2) a condition from the randomized experiment with imperfect compliance (S9 (Probabilistic Monotonicity)), and (3) two new “hybrid” conditions—D6 (Discontinuous rule for ) and S14 (Exclusion Restriction).
Given the discontinuous rule D6, we have
and because of S11, S13, and S14, we can combine the integrals so that the difference equals
Normalizing the difference by the quantity yields
where the normalizing factor ensures that the weights in square brackets average to one. 38
Thus the fuzzy RD estimand is a weighted average of treatment effects. The weights reflect two factors: the relative likelihood that a given type ’s will be close to the threshold reflected in the term , and the influence that crossing the threshold has on the probability of treatment, as reflected in . S9 ensures these weights are nonnegative.
From a purely ex ante evaluation perspective, the weights would seem peculiar, and not related to any meaningful economic concept. But from a purely ex post evaluation perspective, the weights are a statement of fact. As soon as one believes that there is causal information in comparing just above and below the threshold—and such an intuition is entirely driven by the institutional knowledge given by D5 and D6—then it appears that the only way to obtain some kind of average effect (while remaining as agnostic as possible about the other unobservable mechanisms that enter the latent propensity ) with the data , is to make sure that the implied weights integrate to 1. We have no choice but to divide the difference by .
As we have argued in previous sections, rather than abandon potentially highly credible causal evidence because the data and circumstance that we are handed did not deliver us the “desired” weights, we believe a constructive approach might leverage off the credibility of quasi-experimental estimates, and use them as inputs in an extrapolative exercise that will necessarily involve imposing more structure on the problem.