Handbook of Labor Economics, Vol. 4, No. Suppl PA, 2011
ISSN: 1573-4463
doi: 10.1016/S0169-7218(11)00412-6
Chapter 6Identification of Models of the Labor Market
Abstract
This chapter discusses identification of common selection models of the labor market. We start with the classic Roy model and show how it can be identified with exclusion restrictions. We then extend the argument to the generalized Roy model, treatment effect models, duration models, search models, and dynamic discrete choice models. In all cases, key ingredients for identification are exclusion restrictions and support conditions.
JEL classification
• C14 • C51 • J22 • J24
Keywords
• Identification • Roy model • Discrete choice • Selection • Treatment effects
This chapter discusses identification of common selection models of the labor market. We are primarily concerned with nonparametric identification. We view nonparametric identification as important for the following reasons.
First, recent advances in computer power, more widespread use of large data sets, and better methods mean that estimation of increasingly flexible functional forms is possible. Flexible functional forms should be encouraged. The functional form and distributional assumptions used in much applied work rarely come from the theory. Instead, they come from convenience. Furthermore, they are often not innocuous. 1
Second, the process of thinking about nonparametric identification is useful input into applied work. It is helpful to an applied researcher both in informing her about which type of data would be ideal and which aspects of the model she might have some hope of estimating. If a feature of the model is not nonparametrically identified, then one knows it cannot be identified directly from the data. Some additional type of functional form assumption must be made. As a result, readers of empirical papers are often skeptical of the results in cases in which the model is not nonparametrically identified.
Third, identification is an important part of a proof of consistency of a nonparametric estimator.
However, we acknowledge the following limitation of focusing on nonparametric identification. With any finite data set, an empirical researcher can almost never be completely nonparametric. Some aspects of the data that might be formally identified could never be estimated with any reasonable level of precision. Instead, estimators are usually only nonparametric in the sense that one allows the flexibility of the model to grow with the sample size. A nice example of this is Sieve estimators in which one estimates finite parameter models but the number of parameters gets large with the data set. An example would be approximating a function by a polynomial and letting the degree of the polynomial get large as the sample size increases. However, in that case one still must verify that the model is nonparametrically identified in order to show that the model is consistent. One must also construct standard errors appropriately. In this chapter we do not consider the purely statistical aspects of nonparametric estimation, such as calculation of standard errors. This is a very large topic within econometrics. 2
The key issue in identification of most models of the labor market is the selection problem. For example, individuals are typically not randomly assigned to jobs. With this general goal in mind we begin with the simplest and most fundamental selection model in labor economics, the Roy (1951) model. We go into some detail to explain Heckman and Honoré’s (1990) results on identification of this model. A nice aspect of identification of the Roy model is that the basic methodology used in this case can be extended to show identification of other labor models. We spend the rest of the chapter showing how this basic intuition can be used in a wide variety of labor market models. Specifically we cover identification in the generalized Roy model, treatment effect models, the competing risk model, search models, and forward looking dynamic models. While we are clearly not covering all models in labor economics, we hope the ideas are presented in a way that the similarities in the basic models can be seen and can be extended by the reader to alternative frameworks.
The plan of this chapter is specifically as follows. Section 2 discusses some econometric preliminaries. We consider the Roy model in Section 3, generalize this to the Generalized Roy model in Section 4, and then use the model to think about identification of treatment effects in Section 5. In Section 6 we consider duration models and search models and then consider estimation of dynamic discrete choice models in Section 7. Finally in Section 8 we offer some concluding thoughts.
Throughout this chapter we use capital letters with subscripts to denote random variables and small letters without subscripts to denote possible outcomes of that random variable. We will also try to be explicit throughout this chapter in denoting conditioning. Thus, for example, we will use the notation
to denote the expected value of outcome conditional on the regressor variable being equal to some realization .
The word “identification” has come to mean different things to different labor economists. Here, we use a formal econometrics definition of identification. Consider two different models that lead to two data generating processes. If the data generated by these two models have exactly the same distribution then the two models are not separately identified from each other. However, if any two different model specifications lead to different data distributions, the two specifications are separately identified. We give a more precise definition below. Our definition of identification is based on some of the notation and set up of Matzkin’s (2007) following an exposition based on Shaikh (2010).
Let denote the true distribution of the observed data . An econometric model defines a data generating process. We assume that the model is specified up to an unknown vector of parameters, functions and distribution functions. This is known to lie in space . Within the class of models, the element determines the distribution of the data that is observable to the researcher . Notice that identification is fundamentally data dependent. With a richer data set, the distribution would be a different object.
Let be the set of all possible distributions that could be generated by the class of models we consider (i.e. ). We assume that the model is correctly specified, which means that . The identified set is defined as
This is the set of possible that could have generated data that has distribution . By assuming that we have assumed that our model is correctly specified so this set is not empty. We say that is identified if is a singleton for all .
The question we seek to answer here is under what conditions is it possible to learn about (or some feature of ) from the distribution of the observed data . Our interest is not always to identify the full data generating process. Often we are interested in only a subset of the model, or a particular outcome from it. Specifically, our goal may be to identify
where is a known function. For example in a regression model , the feature of interest is typically the regression coefficients. In this case would take the trivial form
However, this notation allows for more general cases in which we might be interested in identifying specific aspects of the model. For example, if our interest is in identifying the covariance between and in the case of the linear regression model, we do not need to know per se, but rather a transformation of these parameters. That is we could be interested in
We could also be interested in a forecast of the model such as
for some specific . The distinction between identification of features of the model as opposed to the full model is important, as in many cases the full model is not identified but the key feature of interest is identified.
To think about identification of we define
That is, it is the set of possible values of that are consistent with the data distribution . We say that is identified if is a singleton.
As an example consider the standard regression model with two regressors:
with for any value (where ). In this case , where is the joint distribution of and . One would write as , where is the parameter space for and is the space of joint distributions between and that satisfy for all . Since the data here is represented by , represents the joint distribution of . Given knowledge of and we know the data generating process and thus we know .
To focus ideas suppose we are interested in identifying (i.e. ) in regression model (2.1) above. Let the true value of the data generating process be so that by definition . In this case , that is it is the set of that would lead our data to have distribution . In this case is the set of values of in this set (i.e. ).
In the case of 2 covariates, we know the model is identified as long as and are not degenerate and not collinear. To see how this definition of identification applies to this model, note that for any the lack of perfect multicollinearity means that we can always find values of for which
Since is one aspect of the joint distribution of , it must be the case that when ,. Since this is true for any value of , then must be the singleton .
However, consider the well known case of perfect multicollinearity in which the model is not identified. In particular suppose that
For the true value of consider some other value . Then for any ,
If is the same for the two models, then the joint distribution of is the same in the two cases. Thus the identification condition above is violated because with , and thus . Since the true value as well, is not a singleton and thus is not identified.
Another important issue is the support of the data. The simplest definition of support is just the range of the data. When data are discrete, this is the set of values that occur with positive probability. Thus a binary variable that is either zero or one would have support . The result of a die roll has support . With continuous variables things get somewhat more complicated. One can think of the support of a random variable as the set of values for which the density is positive. For example, the support of a normal random variable would be the full real line (which we will often refer to as “full support”). The support of a uniform variable on is [0, 1]. The support of an exponential variable would be the positive real line.
This can be somewhat trickier in dealing with outcomes that occur with measure zero. For example one could think of the support of a uniform variable as ,, or . The distinction between these objects will not be important in what we are doing, but to be formal we will use the Davidson (1994)definition of support. He defines the support of a random variable with distribution as the set of points at which is (strictly) increasing. 3 By this definition, the support of a uniform would be . We will also use the notation to denote the unconditional support of random variable and to denote the conditional support.
To see the importance of this concept, consider a simple case of the separable regression model
with a single continuous variable and for . In this case we know that
Letting be the support of , it is straightforward to see that is identified on the set . But is not identified outside the set because the data is completely silent about these values. Thus if , is globally identified. However, if only covers a subset of the real line it is not. For example, one interesting counterfactual is the change in the expected value of if were increased by :. If this is trivially identified, but if the support of were bounded from above, this would no longer be the case. That is, if the supremum of is , then for any value of , is not identified and thus the unconditional expected value of is not identified either. This is just a restatement of the well known fact that one cannot project out of the data unless one makes functional form assumptions. Our point here is that support assumptions are very important in nonparametric identification results. One can only identify over the range of plausible values of if has full support. For this reason, we will often make strong support condition assumptions. This also helps illuminate the tradeoff between functional form assumptions and flexibility. In order to project off the support of the data in a simple regression model one needs to use some functional form assumption. The same is true for selection models.
There is one complication that we need to deal with throughout. It is not a terribly important issue, but will shape some of our assumptions. Consider again the separable regression model
As mentioned above , so it seems trivial to see that is identified, but that is not quite true. To see the problem, suppose that both and are standard normals. Consider two different models for ,
These models only differ at the point , but since is normal this is a zero probability event and we could never distinguish between these models because they imply the same joint distribution of . For the exact same reason it isn’t really a concern (except in very special cases such as if one was evaluating a policy in which we would set for everyone). Since this will be an issue throughout this chapter we explain how to deal with it now and use this convention throughout the chapter.
We will make the following assumptions.
Assumption 2.1. can be written as , where the elements of are continuously distributed (no point has positive mass), and is distributed discretely (all support points have positive mass).
Assumption 2.2. For any , is almost surely continuous across .
The first part says that we can partition our observables into continuous and discrete ones. One could easily allow for variables that are partially continuous and partially discrete, but this would just make our results more tedious to exposit. The second assumption states that choosing a value of at which is discontinuous (in the continuous variables) is a zero probability event.
Theorem 2.1. UnderAssumptions 2.1and2.2and assuming model(2.2)with for , is identified on a set that has measure 1.
(Proof in Appendix.)
The proof just states that is identified almost everywhere. More specifically it is identified everywhere that it is continuous.
The classic model of selection in the labor market is the Roy (1951) model. In the Roy model, workers choose one of two possible occupations: hunting and fishing. They cannot pursue both at the same time. The worker’s log wage is if he fishes and if he hunts. Workers maximize income so they choose the occupation with the higher wage. Thus a worker chooses to fish if . The occupation is defined as
and the log wage is defined as
Workers face a simple binary choice: choose the job with the highest wage. This simplicity has led the model to be used in one form or another in a number of important labor market contexts. Many discrete choice models share the Roy model’s structure. Examples in labor economics include the choice of whether to continue schooling, what school to attend, what occupation to pursue, whether to join a union, whether to migrate, whether to work, whether to obtain training, and whether to marry.
As mentioned in the introduction, we devote considerable attention to identification of this model. In subsequent sections we generalize these results to other models.
The responsiveness of the supply of fishermen to changes in the price of fish depends critically on the joint distribution of . Thus we need to know what a fisherman would have made if he had chosen to hunt. However, we do not observe this but must infer its counterfactual distribution from the data at hand. Our focus is on this selection problem. Specifically, much of this chapter is concerned with the following question: Under what conditions is the joint distribution ofidentified? We start by considering estimation in a parametric model and then consider nonparametric identification.
Roy (1951) is concerned with how occupational choice affects the aggregate distribution of earnings and makes a series of claims about this relationship. These claims turn out to be true when the distribution of skills in the two occupations is lognormal.
Heckman and Honoré (1990) consider identification of the Roy model (i.e., the joint distribution of ). They show that there are two methods for identifying the Roy model. The first is through distributional assumptions. The second is through exclusion restrictions. 4
In order to focus ideas, we use the following case:
where the unobservable error terms are independent of the observable variables and and denote log wages in the fishing and hunting sectors respectively. We distinguish between three types of variables. influences productivity in both fishing and hunting, influences fishing only, and influences hunting only. The variables and are “exclusion restrictions,” and play a very important role in the identification results below. In the context of the Roy model, an exclusion restriction could be a change in the price of rabbits which increases income from hunting, but not from fishing. The notation is general enough to incorporate a model without exclusion restrictions (in which case one or more of the would be empty).
Our version of the Roy framework imposes two strong assumptions. First, that is separable in and for . Second, we assume that and are independent of one another. Note that independence implies homoskedasticity: the variance of cannot depend on . There is a large literature looking at various other more flexible specifications and this is discussed thoroughly in Matzkin (2007). It is also trivial to extend this model to allow for a general relationship between and , as we discuss in Section 3.3 below.
We focus on the separable independent model for two reasons. First, the assumptions of separability and independence have bite beyond a completely general nonparametric relationship. That is, to the extent that they are true, identification is facilitated by these assumptions. Presumably because researchers think these assumptions are approximately true, virtually all empirical research uses these assumptions. Second, despite these strong assumptions, they are obviously much weaker than the standard assumptions that is linear (i.e. ) and that is normally distributed. One approach to writing this chapter would have been to go through all of the many specifications and alternative assumptions. We choose to focus on a single base specification for expositional simplicity.
Heckman and Honoré (1990) first discuss identification of the joint distribution of using distributional assumptions. They show that when one can observe the distribution of wages in both sectors, and assuming is joint normally distributed, then the joint distribution of is identified from a single cross section even without any exclusion restrictions or regressors. To see why, write equations (3.3) and (3.4) without regressors (so , the mean of ):
where
Letting
(with and the pdf and cdf of a standard normal),
and for each ,
One can derive the following conditions from properties of normal random variables found in Heckman and Honoré (1990):
This gives us seven equations in the five unknowns , and . It is straightforward to show that the five parameters can be identified from this system of equations.
However, Theorems 7 and 8 of Heckman and Honoré (1990) show that when one relaxes the log normality assumption, without exclusion restrictions in the outcome equation, the model is no longer identified. This is true despite the strong assumption of agent income maximization. This result is not particularly surprising in the sense that our goal is to estimate a full joint distribution of a two dimensional object , but all we can observe is two one dimensional distributions (wages conditional on job choice). Since there is no information in the data about the wage that a fisherman may have received as a hunter, one cannot identify this joint distribution. In fact, Theorem 7 of Heckman and Honoré (1990) states that we can never distinguish the actual model from an alternative model in which skills are independent of each other.
It is often the case that we only observe wages in one sector. For example, when estimating models of participation in the labor force, the wage is observed only if the individual works. We can map this into our model by associating working with “fishing” and not working with “hunting.” That is, we let denote income if working and let denote the value of not working.5
But there are other examples in which we observe the wage in only one sector. For example, in many data sets we do not observe wages of workers in the black market sector. Another example is return immigration in which we know when a worker leaves the data to return to their home country, but we do not observe that wage.
In Section 3.2 we discuss identification of the nonparametric version of the model. However, it turns out that identification of the more complicated model is quite similar to estimation of the model with normally distributed errors. Thus we review this in detail before discussing the nonparametric model. We also remark that providing a consistent estimator also provides a constructive proof of identification, so one can also interpret these results as (informally) showing identification in the normal model. The model is similar to Willis and Rosen’s (1979) Roy Model of educational choices or Lee’s (1978) model of union status and the empirical approach is analogous. We assume that
In a labor supply model where represents market work, is the market wage which will be observed for workers only. , the pecuniary value of not working, is never observed in the data. Keane et al. (2011) example of the static model of a married woman’s labor force participation is similar.
One could simply estimate this model by maximum likelihood. However we discuss a more traditional four step method to illustrate how the parametric model is identified. This four step process will be analogous to the more complicated nonparametric identification below. Step 1 is a “reduced form probit” of occupational choices as a function of all covariates in the model. Step 2 estimates the wage equations by controlling for selection as in the second step of a Heckman Two step (Heckman, 1979). Step 3 uses the coefficients of the wage equations and plugs these back into a probit equation to estimate a “structural probit.” Step 4 shows identification of the remaining elements of the variance-covariance matrix of the residuals.
The probability of choosing fishing (i.e., work) is:
where is the cdf of a standard normal, is the standard deviation of , and
This is referred to as the “reduced form model” as it is a reduced form in the classical sense: the parameters are a known function of the underlying structural parameters. It can be estimated by maximum likelihood as a probit model. Let represent the estimated parameter vector. This is all that can be learned from the choice data alone. We need further information to identify and to separate from .
This is essentially the second stage of a Heckman (1979) two step. To review the idea behind it, let
Then consider the regression
where (by definition of regression) and thus:
The wage of those who choose to work is
Showing that is a fairly straightforward integration problem and is well known. Because Eq. (3.6) is a conditional expectation function, OLS regression of on ,, and gives consistent estimates of , and . is the value of estimated in Eq. (3.5).
Note that we do not require an exclusion restriction. Since is a nonlinear function, but is linear, this model is identified. However, without an exclusion restriction, identification is purely through functional form. When we consider a nonparametric version of the model below, exclusion restrictions are necessary. We discuss this issue in Section 3.2.
Our next goal is to estimate and . In Step 1 we obtained consistent estimates of and in Step 2 we obtained consistent estimates of and .
When there is only one exclusion restriction (i.e. is a scalar), identification proceeds as follows. Because we identified in Step 2 and in Step 1, we can identify . Once is identified, it is easy to see how to identify (because is identified) and (because and are identified).
In terms of estimation of these objects, if there is more than one exclusion restriction the model is over-identified. If we have two exclusion restrictions, and are both 2 × 1 vectors, and thus we wind up with 2 consistent estimates of . The most standard way of solving this model is by estimating the “structural probit:”
That is, one just runs a probit of on ,, and where and are our estimates of and .
Step 3 is essential if our goal is to estimate the labor supply equation. If we are only interested in controlling for selection to obtain consistent estimates of the wage equation, we do not need to worry about the structural probit. However, notice that
and thus the labor supply elasticity is:
where, as before, is the log of income if working. Thus knowledge of is essential for identifying the effects of wages on participation.
One could not estimate the structural probit without the exclusion restriction as the first two components of the probit in Eq. (3.7) would be perfectly collinear. For any we could find a value of and that delivers the same choice probabilities. Furthermore, if these parameters were not identified, the elasticity of labor supply with respect to wages would not be identified either.
Lastly, we identify all the components of , () as follows. We have described how to obtain consistent estimates of and . This gives us two equations in three parameters. We can obtain the final equation by using the variance of the residual in the selection model since
Let index the set of individuals who choose and is the residual for individuals who choose . Using “hats” to denote estimators we can estimate the model as
Although the parametric case with exclusion restrictions is more commonly known, the model in the previous section is still identified non-parametrically if the researcher is willing to impose stronger support conditions on the observable variables. Heckman and Honoré (1990, Theorem 12) provide conditions under which one can identify the model nonparametrically using exclusion restrictions. We present this case below.
Assumption 3.1. is continuously distributed with distribution function , support , and is independent of . The marginal distributions of and have medians equal to zero.
Assumption 3.2 is crucial for identification. It states that for any value of , varies across the full real line and for any value of , varies across the full real line. This means that we can condition on a set of variables for which the probability of being a hunter (i.e. ) is arbitrarily close to 1. This is clearly a very strong assumption that we will discuss further.
We need the following two assumptions for the reasons discussed in Section 2.4.
Assumption 3.3. can be written as where the elements of are continuously distributed (no point has positive mass), and is distributed discretely (all support points have positive mass).
Assumption 3.4. For any , and are almost surely continuous across .
Under these assumptions we can prove the theorem following Heckman and Honoré (1990).
Theorem 3.1. If ( , if , ) are all observed and generated under model(3.1)-(3.4), then underAssumptions 3.1–3.4, , , and are identified on a set that has measure 1.
(Proof in Appendix.)
A key theme of this chapter is that the basic structure of identification in this model is similar to identification of more general selection models, so we explain this result in much detail. The basic structure of the proof we present below is similar to Heckman and Honoré’s proof of their Theorems 10 and 12. We modify the proof to allow for the case where is not observed.
The proof in the Appendix is more precise, but in the text we present the basic ideas. We follow a structure analogous to the parametric empirical approach when the residuals are normally distributed as presented in Section 3.1. First we consider identification of the occupational choice given only observable covariates and the choice model. This is the nonparametric analogue of the reduced form probit. Second we estimate given the data on , which is the analogue of the second stage of the Heckman two step, and is more broadly the nonparametric version of the classical selection model. In the third step we consider the nonparametric analogue of identification of the structural probit. Since we will have already established identification of , identification of this part of the model boils down to identification of . Finally in the fourth step we consider identification of (the joint distribution of ). We discuss each of these steps in order.
To map the Roy model into our formal definition of identification presented in Section 2.2, the model is determined by , where is the joint distribution of . The observable data here is . Thus is the joint distribution of this observable data and represents the possible data generating processes consistent with .
The nonparametric identification of this model is established in Matzkin (1992). We can write the model as
where is the distribution function of .
Using data only on choices, this model is only identified up to a monotonic transformation. To see why, note that we can write when
but this is equivalent to the condition
where is any strictly increasing function. Clearly the model in Eq. (3.8) cannot be distinguished from an alternative model in Eq. (3.9). This is the nonparametric analog of the problem that the scale (i.e., the variance of ) and location (only the difference between and but not the level of either) of the parametric binary choice model are not identified. Without loss of generality we can normalize the model up to a monotonic transformation. There are many ways to do this. A very convenient normalization is to choose the transformation because has a uniform distribution. 6 So we define
Then
Thus we have established that we can (i) write the model as if and only if where is uniform and (ii) that is identified.
This argument can be mapped into our formal definition of identification from Section 2.2 above. The goal here is identification of , so we define . Note that even though is not part of , it is a known function of the components of . The key set now is , which is now defined as the set of possible values that could have generated the joint distribution of . Since , no other possible value of could generate the data. Thus only contains the true value and is thus a singleton.
Next consider identification of Median regression identifies
The goal is to identify . The problem is that when we vary we also typically vary . This is the standard selection problem. Because we can add any constant to and subtract it from without changing the model, a normalization that allows us to pin down the location of is that . The problem is that this is the unconditional median rather than the conditional one. The solution here is what is often referred to as identification at infinity (e.g. Chamberlain, 1986, or Heckman, 1990). For some value suppose we can find a value of to send arbitrarily close to one. It is referred to as identification at infinity because if were linear in the exclusion restriction this could be achieved by sending . In our fishing/hunting example, this could be sending the price of rabbits to zero which in turn sends log income from hunting to . Then notice that 7
Thus is identified.
Conditioning on so that is arbitrarily close to one is essentially conditioning on a group of individuals for whom there is no selection, and thus there is no selection problem. Thus we are essentially saying that if we can condition on a group of people for whom there is no selection we can solve the selection bias problem.
While this may seem like cheating, without strong functional form assumptions it is necessary for identification. To see why, suppose there is some upper bound of equal to which would prevent us from using this type of argument. Consider any potential worker with a value of . For those individuals it must be the case that
so they must always be a hunter. As a result, the data is completely uninformative about the distribution of for these individuals. For this reason the unconditional median of would not be identified. We will discuss approaches to dealing with this potential problem in the Treatment Effect section below.
To relate this to the framework from Section 2.2 above now we define , so contains the values of consistent with . However since
is the only element of , thus it is identified.
Identification of the slope only without “identification at infinity”
If one is only interested in identifying the “slope” of and not the intercept, one can avoid using an identification at infinity argument. That is, for any two points and , consider identifying the difference . The key to identification is the existence of the exclusion restriction . For these two points, suppose we can find values and so that
There may be many pairs of that satisfy this equality and we could choose any of them. Define . The key aspect of this is that since , and thus the probability of being a fisherman is the same given the two sets of points, then the bias terms are also the same: .
As long as we have sufficient variation in we can do this everywhere and identify up to location.
In terms of identifying , the exclusion restriction that influences wages as a fisherman but not as a hunter (i.e. ) will be crucial. Consider identifying for any particular value . The key here is finding a value of so that
Assumption 3.2 guarantees that we can do this. To see why Eq. (3.10) is useful, note that it must be that for this value of
But the fact that has median zero implies that
Since is identified, is identified from this expression. 8
Again to relate this to the framework in Section 2.2 above, now and is the set of functions that are consistent with . Above we showed that if , then . Thus since we already showed that is identified, is the only element of .
Next consider identification of given and . We will show how to identify the joint distribution of closely following the exposition of Heckman and Taber (2008). Note that from the data one can observe
which is the cumulative distribution function of evaluated at the point . By varying the point of evaluation one can identify the joint distribution of from which one can derive the joint distribution of .
Finally in terms of the identification conditions in Section 2.2 above, now and is the set of distributions consistent with . Since is uniquely defined by the expression (3.12) and since everything else in this expression is identified, is the only element of .
For expositional purposes we focus on the case in which the observables are independent of the unobservables, but relaxing these assumptions is easy to do. The simplest case is to allow for a general relationship between and . To see how easy this is, consider a case in which is just binary, for example denoting men and women. Independence seems like a very strong assumption in this case. For example, the distribution of unobserved preferences might be different for women and men, leading to different selection patterns. In order to allow for this, we could identify and estimate the Roy model separately for men and for women. Expanding from binary to finite support is trivial, and going beyond that to continuous is straightforward. Thus one can relax the independence assumption easily. But for expositional purposes we prefer our specification.
The distinction between and was not important in steps 1 and 2 of our discussion above. When one is only interested in the outcome equation , relaxing the independence assumption between and can be done as well. However, in step 3 this distinction is important in identifying and the independence assumption is not easy to relax.
If we allow for general dependence between and , the “identification at infinity” argument becomes more important as the argument about “Identification of the Slope Only without Identification at Infinity” no longer goes through. In that case the crucial feature of the model was that . However, without independence this is no longer generally true because . Thus even if , when , in general .
We now show that the model is not identified in general without an exclusion restriction. 9 Consider a simplified version of the model,
where is uniform (0,1) and is independent of with distribution and we use the location normalization . As in Section 3.2, we observe , whether or , and if then we observe .
We can think about estimating the model from the median regression
Under the assumption that it must be the case that , but this is our only restriction on and . Thus the model above has the same conditional median as an alternative model
where and . Equations (3.13) and (3.14) are observationally equivalent. Without an exclusion restriction, it is impossible to tell if observed income from working varies with because it varies with or because it varies with the labor force participation rate and thus the extent of selection. Thus the models in Eqs (3.13) and (3.14) are not distinguishable using conditional medians.
To show the two models are indistinguishable using the full joint distribution of the data, consider an alternative data generating model with the same first stage, but now is determined by
where is independent of with . Let be the joint distribution of in the alternative model. We will continue to assume that in the alternative model . The question is whether the alternative model is able to generate the same data distribution.
In the alternative model
Thus these two models generate exactly the same joint distribution of data and cannot be separately identified as long as we define so that 10
We next consider the “Generalized Roy Model” (as defined in e.g. (Heckman and Vytlacil, 2007a). The basic Roy model assumes that workers only care about their income. The Generalized Roy Model allows workers to care about non-pecuniary aspects of the job as well. Let and be the utility that individual would receive from being a fisherman or a hunter respectively, where for ,
where represents the non-pecuniary utility gain from observables and . The variable allows for the fact that there may be other variables that affect the taste for hunting versus fishing directly, but do not affect wages in either sector. 11 Note that we are imposing separability between and . In general we can provide conditions in which the results presented here will go through if we relax this assumption, but we impose it for expositional simplicity. The occupation is now defined as
We continue to assume that
It will be useful to define a reduced form version of this model. Note that people fish when
In the previous section we described how the choice model can only be identified up to a monotonic transform and that assuming the error term is uniform is a convenient normalization. We do the same thing here. Let be the distribution function of . Then we define
As above, this normalization is convenient because it is straightforward to show that
and that is uniformly distributed on the unit interval.
We assume that the econometrician can observe the occupations of the workers and the wages that they receive in their chosen occupations as well as .
It turns out that the basic assumptions that allow us to identify the Roy model also allow us to identify the generalized Roy model.
We start with the reduced form model in which we need two more assumptions.
Assumption 4.1. is continuously distributed and is independent of . Furthermore, is distributed uniform on the unit interval and the medians of both and are zero.
Assumption 4.2. The support of is for all .
We also slightly extend the restrictions on the functions to include and .
Assumption 4.3. can be written as where the elements of are continuously distributed (no point has positive mass), and are distributed discretely (all support points have positive mass).
Assumption 4.4. For any ,, , and are almost surely continuous across
Theorem 4.1. UnderAssumptions 4.1–4.4, and the joint distribution of and of are identified from the joint distribution of on a set that has measure 1 where are generated by model(4.1)-(4.4).
(Proof in Appendix.)
The intuition for identification follows directly from the intuition given for the basic Roy model. We show this in 3 steps:
so identification of from comes directly.
The analogous argument works for when we send .
The analogous argument works for the joint distribution of .
Note that not all parameters are identified such as the non-pecuniary gain from fishing . To identify the “structural” generalized Roy model we make two additional assumptions:
Assumption 4.5. The median of is zero.
Assumption 4.6. For any value of , has full support (i.e. the whole real line).
Theorem 4.2. UnderAssumptions 4.1–4.6, , the distribution of , and the distribution of are identified.
(Proof in Appendix.)
Note that Theorem 4.1 gives the joint distribution of while Theorem 4.2 gives the joint distribution of . Since , this really just amounts to saying that is identified.
Furthermore, whereas and are identified in Theorem 4.1, is identified in Theorem 4.2. Recall is the added utility (measured in money) of being a fisherman relative to a hunter. The exclusion restrictions and help us identify this. These exclusion restrictions allow us to vary the pecuniary gains of the two sectors, holding preferences constant. Identification is analogous to the “Step 3: identification of ” in the standard Roy model. To see where identification comes from, for every think about the following conditional median
Since the median of is zero, this means that
and thus
Because and is identified, is identified also. The argument above shows that we do not need both and , we only need or .
Suppose there is no variable that affects earnings in one sector but not preferences ( or ). An alternative way to identify is to use a cost measured in dollars. Consider the linear version of the model with normal errors and without exclusion restrictions so that
The reduced form probit is:
where is the standard deviation of . Theorem 4.1 above establishes that the functions and (i.e., and ) as well as the variance of and are identified. We still need to identify , and . Thus we are able to identify
If and are scalars we still have three parameters () and two restrictions . If they are not scalars, we still have one more parameter than restriction. However suppose that one of the exclusion restrictions represents a cost variable that is measured in the same units as . For example in a schooling case suppose that represents the present value of earnings as a college graduate, represents the present value of high school graduate as a college graduate, and the exclusion restriction, , represents the present value of college tuition. In this case the coefficient on is , so is identified. Given it is very easy to show that the rest of the parameters are identified as well. Heckman et al. (1998) provide an example of this argument using tuition as in the style above. In Section 7.3 we discuss Heckman and Navarro (2007) who use this approach as well.
In pointing out what is identified in the model it is also important to point out what is not identified. Most importantly in the generalized Roy model we were able to identify the joint distribution between the error terms in the selection equation and each of the outcomes, but not the joint distribution of the variables in the outcome equation. In particular the joint distribution between the error terms is not identified. Even strong functional form assumptions will not solve this problem. Fir example, it is easy to show that in the joint normal model the covariance of is not identified.
As the theorems above make clear, nonparametric identification requires exclusion restrictions. However, completely parametric models typically do not require exclusion restrictions. In specific empirical examples, identification could primarily be coming from the exclusion restriction or identification could be coming primarily from the functional form assumptions (or some combination between the two). When researchers use exclusion restrictions in data, it is important to be careful about which assumptions are important.
We describe one example from Altonji et al. (2005b). Based on Evans and Schwab (1995), Neal (1997), and Neal and Grogger (2000) they consider a bivariate probit model of Catholic schooling and college attendance.
where is the indicator function taking the value one if its argument is true and zero otherwise, is a dummy variable indicating attendance at a Catholic school, and is a dummy variable indicating college attendance. Identification of the effect of Catholic schooling on college attendance (or high school graduation) is the primary focus of these studies. The question at hand is in practice whether the assumed functional forms for and are important for identifying the coefficient and thus the effect of Catholic schools on college attendance.
The model in Eqs (4.7) and (4.8) is a minor extension of the generalized Roy model. The first key difference is that the outcome variable in Eq. (4.8) is binary (attend college or not), whereas in the case of the Generalized Roy model the outcomes were continuous (earnings in either sector). The second key difference is that the outcome equation for Catholic versus Non-Catholic school only differs in the intercept (). The error term () and the slope coefficients () are restricted to be the same. Nevertheless, the machinery to prove non-parametric identification of the Generalized Roy model can be applied to this framework. 12
Using data from the National Longitudinal Survey of 1972, Altonji et al. (2005b) consider an array of instruments and different specifications for Eqs (4.7) and (4.8). In Table 1 we present a subset of their results. We show four different models. The “Single Equation Model” gives results in which selection into Catholic school is not accounted for. The first column gives results from a probit model (with point estimates, standard errors, and marginal effects). The second column give results from a Linear Probability model. Next we present the estimates of from a Bivariate Probit models with alternative exclusion restrictions. The final row presents the results with no exclusion restrictions. Finally we also present results from an instrumental variable linear probability model with the same set of exclusion restrictions.
Single equation models | ||
Probit | OLS | |
0.239 | 0.239 | |
[0.640] | ||
(0.198) | (0.070) | |
Two equation models | ||
Excluded variable | Bivariate probit | 2SLS |
Catholic | 0.285 | −0.093 |
[0.761] | ||
(0.543) | (0.324) | |
Catholic × Distance | 0.478 | 2.572 |
[1.333] | ||
(0.516) | (2.442) | |
None | 0.446 | |
[1.224] | ||
(0.542) |
Urban Non-Whites from NLS-72. The first set of results come from simple probits and from OLS. The further results come from Bivariate Probits and from two stage least squares. We present the marginal effect of Catholic high school attendance on college attendance. [Point Estimate from Probit in Brackets.] (Standard Errors in Parentheses.)Source: Altonji et al. (2005b).
One can see that the marginal effect from the single equation probit is very similar to the OLS estimate. It indicates that college attendance rates are approximately 23.9 percentage points higher for Catholic high school graduates than for public high school graduates. The rest of the table presents results from three bivariate probit models and two instrumental variables models using alternative exclusion restrictions. The problem is clearest when the interaction between the student coming from a Catholic school and distance to the nearest Catholic school is used as an instrument. The 2SLS gives nonsensical results: a coefficient of 2.572 with an enormous standard error. This indicates that the instrument has little power. However, the bivariate probit result is more reasonable. It suggests that the true marginal causal effect is around 0.478 and the point estimate is statistically significant. This seems inconsistent with the 2SLS results which indicated that this exclusion restriction had very little power. However it is clear what is going on when we compare this result to the model at the bottom of the table without an exclusion restriction. The estimate is very similar with a similar standard error. The linearity and normality assumptions drive the results.
The case in which Catholic religion by itself is used as an instrument is less problematic. The IV result suggests a strong amount of positive selection but still yields a large standard error. The bivariate probit model suggests a marginal effect that is a bit larger than the OLS effect. However, note that the standard errors for the model with and without an exclusion restriction are quite similar, which seems inconsistent with the idea that the exclusion restriction is providing a lot of identifying information. Further note that the IV result suggests a strong positive selection bias while the bivariate probit without exclusion restrictions suggests a strong negative bias. The bivariate probit in which Catholic is excluded is somewhere between the two. This suggests that both functional form and exclusion restrictions are important in this case. We should emphasize the “suggests” part of this sentence as none of this is a formal test. It does, however, make one wonder how much trust to put in the bivariate probit results by themselves.
Another paper documenting the importance of functional form assumptions is Das et al. (2003), who estimate the return to education for young Australian women. They estimate equations for years of education, the probability of working, and wages. When estimating the wage equation they address both the endogeneity of years of education and also selection caused because we only observe wages for workers. They allow for flexibility in the returns to education (where the return depends on years of education) and also in the distribution of the residuals. They find that when they assume normality of the error terms, the return to education is approximately 12%, regardless of years of education. However, once they allow for more flexible functional forms for the error terms, they find that the returns to education decline sharply with years of education. For example, they find that at 10 years of education, the return to education is over 15%. However, at 14 years, the return to education is only about 5%.
There is a very large literature on the estimation of treatment effects. For more complete summaries see Heckman and Robb (1986), Heckman et al. (1999), Heckman and Vytlacil (2007a,b), Abbring and Heckman (2007), or Imbens and Wooldridge (2009). 13 DiNardo and Lee (2011) provide a discussion that is complementary to ours. Our goal in this section is not to survey the whole literature but provide a brief summary and to put it into the context of identification of the Generalized Roy Model.
The goal of this literature is to estimate the value of receiving a treatment defined as:
In the context of the Roy model, is the income gain from moving from hunting to fishing. This income gain potentially varies across individuals in the population. Thus for people who choose to be fishermen, is positive and for people who choose to be hunters, is negative.
Estimation of treatment effects is of great interest in many literatures. The term “treatment effect”makes the most sense in the context of the medical literature. Choice could represent taking a medical treatment (such as an experimental drug) while could represent no treatment. In that case and would represent some measure of health status for individual with and without the treatment. Thus the treatment effect is the effect of the drug on the health outcome for individual .
The classic example in labor economics is job training. In that case, would represent a labor market outcome for individuals who received training and would represents the outcome in the absence of training.
In both the case of drug treatment and job training, empirical researchers have exploited randomized trials. Medical patients are often randomly assigned either a treatment or a placebo (i.e., a sugar pill that should have no effect on health). Likewise, many job training programs are randomly assigned. For example, in the case of the Job Training Partnership Act, a large number of unemployed individuals applied for job training (see e.g. Bloom et al., 1997). Of those who applied for training, some were assigned training and some were assigned no training.
Because assignment is random and affects the level of treatment, one can treat assignment as an exclusion restriction that is correlated with treatment (i.e., the probability that ) but is uncorrelated with preferences or ability because it is random. In this sense, random assignment solves the selection problem that is the focus of the Roy model. As we show below, exogenous variation provided by experiments allows the researcher to cleanly identify some properties of the distribution of and under relatively weak assumptions. Furthermore, the methods for estimating these objects are simple, which adds to their appeal.
The treatment effect framework is also widely used for evaluating quasi-experimental data as well. By quasi-experimental data, we mean data that are not experimental, but exploit variation that is “almost as good as” random assignment.
Within the context of the generalized Roy model note that in general
An important special case of the treatment effect defined in Eq. (5.1) is when
In this case, the treatment effect is a constant across individuals. Identification of this parameter is relatively straightforward. However, there is a substantial literature that studies identification of heterogeneous treatment effects. As we point out above, treatment effects are positive for some people and negative for others in the context of the Roy model. Furthermore, there is ample empirical evidence that the returns to job training are not constant, but instead vary across the population (Heckman et al., 1999).
In Section 4.2 we explain why the joint distribution of is not identified. This means that the distribution of is not identified and even relatively simple summary statistics like the median of this distribution is not identified in general. The key problem is that even when assignment is random, we do not observe the same people in both occupations.
Since the full generalized Roy model is complicated, hard to describe, and very demanding in terms of data, researchers often focus on a summary statistic to summarize the result. The most common in this literature is the Average Treatment Effect (ATE) defined as
From Theorem 4.1 we know that (under the assumptions of that theorem) the distribution of and are identified. Thus, their expected values are also identified under the one additional assumption that these expected values exist.
Assumption 5.1. The expected values of and are finite.
Theorem 5.1. Under the assumptions ofTheorem 4.1andAssumption 5.1, the Average Treatment effect is identified.
(Proof in Appendix.)
To see where identification of this object comes from, abstract from so that the only observable is , which affects the non-pecuniary gain in utility from occupation across occupations. With experimental data, could be randomly generated assignments to occupation. Notice that
Thus the exclusion restriction is the key to identification. Note also that we need groups of individuals where (who are always fishermen) and (who are always hunters); thus “identification at infinity” is essential as well. For the reasons discussed in the nonparametric Roy model above, if were never higher than some then would not be identified. Similarly if were never lower than some , then would not be identified.
While one could directly estimate the ATE using “identification at infinity”, as described above, this is not the common practice and not something we would advocate. The standard approach would be to estimate the full Generalized Roy Model and then use it to simulate the various treatment effects. This is often done using a completely parametric approach as in, for example, the classic paper by Willis and Rosen (1979). However, there are quite a few nonparametric alternatives as well, including construction of the Marginal Treatment effects as discussed in Sections 5.3 and 5.4 below.
As it turns out, even with experimental data, it is rarely the case that is identically one or zero with positive probability. In the case of medicine, some people assigned the treatment do not take the treatment. In the training example, many people who are offered subsidized training decide not to undergo the training. Thus, when compliance with assignment is less than 100%, we cannot recover the . In Section 5.2 we discuss more precisely what we do recover when there is less than 100% compliance.
It is also instructive to relate the ATE to instrumental variables estimation. Let be the outcome of interest
and let be a dummy variable indicating whether . Consider estimating the model
using instrumental variables with as an instrument for . Assume that is correlated with but not with or . Consider first the constant treatment effect model described in Eqs (5.2) and (5.3) so that for everyone in the population. In that case
Then two stage least squares on the model above yields
Thus in the constant treatment effect model, instrumental variables provide a consistent estimate of the treatment effect. However, this result does not carry over to heterogeneous treatment effects or the average treatment effects as Heckman (1997) shows. Following the expression above we get
in general. In Sections 5.2 and 5.3 below, we describe what instrumental variables identify.
In practice there are two potential problems with the assumptions behind Theorem 5.1 above
We discuss a number of different approaches, some of which assume an exclusion restriction but relax the support conditions and others that do not require exclusion restrictions.
Imbens and Angrist (1994) and Angrist et al. (1996) consider identification when the support of takes on a finite number of points. They show that when varying the instrument over this range, they can identify what they call a Local Average Treatment Effect. Furthermore, they show how instrumental variables can be used to estimate it. It is again easiest to think about this problem after abstracting from , as it is straightforward to condition on these variables (see Imbens and Angrist, 1994, for details). For simplicity’s sake, consider the case in which the instrument is binary and takes on the values . In many cases not only is the instrument discrete, but it is also binary. For example, in randomized medical trials, represents assignment to treatment, whereas represents assignment to the placebo. In job training programs, represents assignment to the training program, whereas represents no assigned training.
It is important to point out that not all patients assigned treatment actually receive the treatment. Thus if the patient actually takes the drug and if the individual does not take the drug. Likewise, not all individuals who are assigned training actually receive the training, so if the individual goes to training and if she does not. The literature on Local Average Treatment Effects handles this case as well as many others. However, we do require that the instrument of assignment has power: . Without loss of generality we will assume that .
Using the reduced form version of the generalized Roy model the choice problem is
where is uniformly distributed.
The following six objects can be learned directly from the data:
The above equations show that our earlier assumption that implies . This, combined with the structure embedded in Eq. (5.6) means that
so then an individual who is a fisherman when is also a fisherman when . Similar reasoning implies . Using this and Bayes rule yields
Using the fact that , one can show that
Combining Eq. (5.10) with Eqs (5.8) and (5.9) yields
Rearranging Eq. (5.11) shows that we can identify
since everything on the right hand side is directly identified from the data.
Using the analogous argument one can show that
is identified. But this means that we can identify
which Imbens and Angrist (1994) define as the Local Average Treatment Effect. This is the average treatment effect for that group of individuals who would alter their treatment status if their value of changed. Given the variation in , this is the only group for whom we can identify a treatment effect. Any individual in the data with would never choose , so the data are silent about . Similarly the data is silent about .
Imbens and Angrist (1994) also show that the standard linear Instrumental Variables estimator yield consistent estimates of Local Average Treatment Effects. Consider the instrumental variables estimator of Eq. (5.4)
In Eq. (5.5) we showed that
Let denote the probability that . The numerator of the above expression is
where the key simplification comes from the fact that
Next consider the denominator
Imbens and Angrist never explicitly use the generalized Roy model or the latent index framework. Instead, they write their problem only in terms of the choice probabilities. However, in order to do this they must make one additional assumption. Specifically, they assume that if when then when . Thus changing to never causes some people to switch from fishing to hunting. It only causes people to switch from hunting to fishing. They refer to this as a monotonicity assumption. Vytlacil (2002) points out that this is implied by the latent index model when the index is separable from , as we assumed in Eq. (5.6). As is implied by Eq. (5.7), increasing the index will cause some people to switch from hunting to fishing, but not the reverse. 14
Throughout, we use the latent index framework that is embedded in the Generalized Roy model, for three reasons. First, we can appeal to the identification results of the Generalized Roy model. Second, the latent index can be interpreted as the added utility from making a decision. Thus we can use the estimated model for welfare analysis. Third, placing the choice in an optimizing framework allows us to test the restrictions on choice that come from the theory of optimization.
As we have pointed out, not everyone offered training actually takes the training. For example, in the case of the JTPA, only 60% of those offered the training actually received it (Bloom et al., 1997). Presumably, those who took the training are those who stood the most to gain from the training. For example, the reason that many people do not take training is that they receive a job offer before training begins. For these people, the training may have been of relatively little value. Furthermore, 2% of those who applied for and were not assigned training program wind up receiving the training (Bloom et al., 1997). Angrist et al. (1996) refer to those who were assigned training, but did not take the training as never-takers. Those who receive the training whether or not they are assigned are always-takers. Those who receive the training only when assigned the training are compliers. In terms of the latent index framework, the never-takers are those for whom , the compliers are those for whom , and the always-takers are those for whom .
The monotonicity assumption embedded in the latent index framework rules out the existence of a final group: the defiers. In the context of training, this would be an individual who receives training when not assigned training but would not receive training when assigned. At least in the context of training programs (and many other contexts) it seems safe to assume that there are no defiers.
Heckman and Vytlacil (1999, 2001, 2005, 2007b) develop a framework that is useful for constructing many types of treatment effects. They focus on the marginal treatment effect (MTE) defined in our context as
They show formally how to identify this object. We present their methodology using our notation.
Note that if we allow for regressors , let the exclusion restriction to take on values beyond zero and one, then if and are in the support of the data, then Eq. (5.12) can be rewritten as
for . Now notice that for any ,
Thus if is in the support of , then is identified. Since the model is symmetric, under similar conditions is identified as well. Finally since
the marginal treatment effect is identified.
The marginal treatment effect is interesting in its own right. It is the value of the treatment for any individual with and . In addition, it is also useful because the different types of treatment effects can be defined in terms of the marginal treatment effect. For example
One can see from this expression that without full support this will not be identified because will not be identified everywhere.
Heckman and Vytlacil (2005) also show that the instrumental variables estimator defined in Eq. (5.5) (conditional on ) is
where they give an explicit functional form for . It is complicated enough that we do not repeat it here but it can be found in Heckman and Vytlacil (2005).
This framework is also useful for seeing what is not identified. In particular if does not have full support so that it is bounded above or below, the average treatment effect will not be identified. However, many other interesting treatment effects can be identified. For example, the Local Average Treatment Effect in a model with no regressors is
More generally, in this series of papers, Heckman and Vytlacil show that the marginal treatment effect can also be used to organize many ideas in the literature. One interesting case is policy effects. They define the policy relevant treatment effect as the treatment resulting from a particular policy. They show that if the relationship between the policy and the observable covariates is known, the policy relevant treatment effect can be identified from the marginal treatment effects.
Heckman and Vytlacil (1999, 2001, 2005) suggest procedures to estimate the marginal treatment effect. They suggest what they call “local instrumental variables.” Using our notation for the generalized Roy model in which when , where is uniformly distributed, they show that
To see why this is the same definition of MTE as in Eq. (5.15)), note that
Thus one can estimate the marginal treatment effect in three steps. First estimate , second estimate using some type of nonparametric regression approach, and third take the derivative.
Because as a normalization is uniformly distributed
Thus we can estimate from a nonparametric regression of on .
A very simple way to do this is to use a linear probability model of regressed on a polynomial of . By letting the terms in the polynomial get large with the sample size, this can be considered a nonparametric estimator. For the second stage we regress the outcome on a polynomial of our estimate of . To see how this works consider the case in which both polynomials are quadratics. We would use the following two stage least squares procedure:
where is the predicted value from the first stage. The coefficient may not be 0 because as we change the instrument affects different groups of people. The MTE is the effect of changing on . For the case above the MTE is:
Although the polynomial procedure above is transparent, the most common technique used to estimate the MTE is local linear regression.
French and Song (2010) estimate the labor supply response to Disability Insurance (DI) receipt for DI applicants. Individuals are deemed eligible for DI benefits if they are “unable to engage in substantial gainful activity”—i.e., if they are unable to work. Beneficiaries receive, on average $12,000 per year, plus Medicare health insurance. Thus, there are strong incentives to apply for benefits. They continue to receive these benefits only if they earn less than a certain amount per year ($10,800 in 2007). For this reason, the DI system likely has strong labor supply disincentives. A healthy DI recipient is unlikely to work if that causes the loss of DI and health insurance benefits.
The DI system attempts to allow benefits only to those who are truly disabled. Many DI applicants have their case heard by a judge who determines those who are truly disabled. Some applicants appear more disabled than others. The most disabled applicants are unable to work, and thus will not work whether or not they get the benefit. For less serious cases, the applicant will work, but only if she is denied benefits. The question, then, is what is the optimal threshold level for the amount of observed disability before the individual is allowed benefits? Given the definition of disability, this threshold should depend on the probability that an individual does not work, even when denied the benefit. Furthermore, optimal taxation arguments suggest that benefits should be given to groups whose labor supply is insensitive to benefit allowance. Thus the effect of DI allowance on labor supply is of great interest to policy makers.
OLS is likely to be inconsistent because those who are allowed benefits are likely to be less healthy than those who are denied. Those allowed benefits would have had low earnings even if they did not receive benefits. French and Song propose an IV estimator using the process of assignment of cases to judges. Cases are assigned to judges on a rotational basis within each hearing office, which means that for all practical purposes, judges are randomly assigned to cases conditional on the hearing office and the day. Some judges are much more lenient than others. For example, the least lenient 5% of all judges allow benefits to less than 45% of the cases they hear, whereas the most lenient 5% of all judges allow benefits to 80% of all the cases they hear. Although some of those who are denied benefits appeal and get benefits later, most do not. If assignment of cases to judges is random then the instrument of judge assignment is a plausibly exogenous instrument. Furthermore, and as long as judges vary in terms of leniency and not ability to detect individuals who are disabled, 15 the instrument can identify a MTE.
French and Song use a two stage procedure. In the first stage they estimate the probability that an individual is allowed benefits, conditional on the average judge specific allowance rate. They estimate a version of Eq. (5.17) where is an indicator equal to 1 if case was allowed benefits and is the average allowance rate of the judge who heard case . In the second stage they estimate earnings conditional on whether the individual was allowed benefits (as predicted by the judge specific allowance rate). They estimate a version of Eq. (5.18) where is annual earnings 5 years after assignment to a judge. Figure 1 shows the estimated MTE (using the formula in Eq. (5.19)) using several different specifications of polynomial in the first and second stage equations. Assuming that the treatment effect is constant (i.e., ), they find that annual earnings 5 years after assignment to a judge are $1500 for those allowed benefits and $3900 for those denied benefits, so the estimated treatment effect is $2400. This is the MTE-linear case in Fig. 1. However, this masks considerable heterogeneity in the treatment effects. They find that when allowance rates rise, the labor supply response of the marginal case also rises. When allowing for the quadratic term to be non-zero, they find that less lenient judges (who allow 45% of all cases) have a MTE of a $1800 decline in earnings. More lenient judges (who allow 80% of all cases) have a MTE of $3200 decline in earnings. Figure 1 also shows results when allowing for cubic and quartic terms in the polynomials in the first and second stage equations. This result is consistent with the notion that as allowance rates rise, more healthy individuals are allowed benefits. These healthier individuals are more likely to work when not receiving DI benefits, and thus their labor supply response to DI receipt is greater.
One problem with an instrument such as this is that the instrument lacks full support. Even the most lenient judge does not allow everyone benefits. Even the strictest judge does not deny everyone. However, the current policy debate is whether the thresholds should be changed by only a modest amount. For this reason, the MTE on the support of the data is the effect of interest, whereas the ATE is not.
Doyle (2007) estimates the Marginal Treatment Effect of foster care on future earnings and other outcomes. Foster care likely increases earnings of some children but decreases it for others. For the most serious child abuse cases, foster care will likely help the child. For less serious cases, the child is probably best left at home. The question, then, is at what point should the child abuse investigator remove the child from the household? What is the optimal threshold level for the amount of observed abuse before which the child is removed from the household and placed into foster care?
Only children from the most disadvantaged backgrounds are placed in foster care. They would have had low earnings even if they were not placed in foster care. Thus, OLS estimates are likely inconsistent. To overcome this problem, Doyle uses IV. Case investigators are assigned to cases on a rotational basis, conditional on time and the location of the case. Case investigators are assigned to possible child abuse cases after a complaint of possible child abuse is made (by the child’s teacher, for example). Investigators have a great deal of latitude about whether the child should be sent into foster care. Furthermore, some investigators are much more lenient than others. For example, one standard deviation in the case manager removal differential (the difference between his average removal rate and the removal rate of other investigators who handle cases at the same time and place) is 10%. Whether the child is removed from the home is a good predictor of whether the child is sent to foster care. So long as assignment of cases to investigators is random and investigators only vary in terms of leniency (and not ability to detect child abuse) then the instrument of investigator assignment is a useful and plausibly exogenous instrument.
Doyle uses a two stage procedure where in the first stage he estimates the probability that a child is placed in foster care as a function of the investigator removal rate. In the second stage he estimates adult earnings as a function of whether the child was placed in foster care (as predicted by the instrument). He finds that children placed into foster care earn less than those not placed into foster care over most of the range of the data. Two stage least squares estimates reveal that foster care reduces adult quarterly earnings by about $1000, which is very close to average earnings. Interestingly, he finds that when child foster care placement rates rise, earnings of the marginal case fall. For example, earnings of the marginal child handled by a lenient investigator (who places only 20% of the children in foster care) are unaffected by placement. For less lenient investigators, who place 25% of the cases in foster care, earnings of the marginal case decline by over $1500.
Carneiro and Lee (2009) estimate the counterfactual marginal distributions of wages for college and high school graduates, and examine who enters college. They find that those with the highest returns are the most likely to attend college. Thus, increases in college cause changes in the distribution of ability among college and high school graduates. For fixed skill prices, they find that a 14% increase in college participation (analogous to the increase observed in the 1980s) reduces the college premium by 12%. Likewise, Carneiro et al. (2010) find that while the conventional IV estimate of the return to schooling (using distance to a college and local labor market conditions as the instruments) is 0.095, the estimated marginal return to a policy that expands each individual’s probability of attending college by the same proportion is only 0.015.
Perhaps the simplest and most common assumption is that assignment of the treatment is random conditional on observable covariates (sometimes referred to as unconfoundedness). The easiest way to think about this is that the selection error term is independent of the other error terms:
where is independent of .
We continue to assume that and . Note that we have explicitly dropped from the model as we consider cases in which we do not have exclusion restrictions. The implication of this assumption is that unobservable factors that determine one’s income as a fisherman do not affect the choice to become a fisherman. That is while it allows for selection on observables in a very general way, it does not allow for selection on unobservables.
Interestingly, this is still not enough for us to identify the Average Treatment Effect. If there are values of observable covariates for which or the model is not identified. If then it is straightforward to identify , but is not identified. Thus we need the additional assumption
Assumption 5.3. For almost all in the support of ,
Theorem 5.2. UnderAssumptions 5.2 and 5.3the Average Treatment Effect is identified.
(Proof in Appendix.)
Estimation in this case is relatively straightforward. One can use matching 16 or regression analysis to estimate the average treatment effect.
In our original discussion of identification we defined as “the set of values of that are consistent with the data distribution .” We said that was identified if this set was a singleton. However, there is another concept of identification we have not discussed until this point; this is set identification. Sometimes we may be interested in a parameter that is not point identified, but this does not mean we cannot say anything about it. In this subsection we consider the case of set identification (i.e. trying to characterize the set ) focusing on the case in which is the Average Treatment Effect. Suppose that we have some prior knowledge (possibly an exclusion restriction that gives us a LATE). What can we learn about the ATE without making any functional form assumptions? In a series of papers Manski (1989, 1990, 1995, 1997) and Manski and Pepper (2000, 2009) develop procedures to derive set estimators of the Average Treatment Effect and other parameters given weak assumptions. By “set identification” we mean the set of possible Average Treatment Effects given the assumptions placed on the data. Throughout this section we will continue to assume that the structure of the Generalized Roy model holds and we derive results under these assumptions. In many cases the papers we mentioned do not impose this structure and get more general results.
Following Manski (1990) or Manski (1995), notice that
We observe all of the objects in Eqs (5.20) and (5.21) except and . The data are completely uninformative about these two objects. However, suppose we have some prior knowledge about the support of and . In particular, suppose that the support of and are bounded above by and from below by . Thus, by assumption and . Using these assumptions and Eqs (5.20) and (5.21) we can establish that
Using these bounds and the definition of the ATE
yields
In practice the bounds above can yield wide ranges and are often not particularly informative. A number of other assumptions can be used to decrease the size of the identified set.
Manski (1990, 1995) shows that one method of tightening the bounds is with an instrumental variable. We can write the expressions (5.20) and (5.21) conditional on for any as for each ,
Since is, by assumption, mean independent of and (it only affects the probability of choosing one occupation versus the other), then and . Assume there is a binary instrumental variable, , which equals either 0 or 1. We can then follow exactly the same argument as in Eqs (5.22) and (5.23), but conditioning on and using Eq. (5.25) yields
Thus we can bound from below by subtracting (5.27) from (5.26):
Our choice of a binary value of can be trivially relaxed. In the cases in which takes on many values one could choose any two values in the support of to get upper and lower bounds. If our goal is to minimize the size of the set we would choose the values and to minimize the difference between the upper and lower bounds in (5.28):
The importance of support conditions once again becomes apparent from this expression. If we could find values and such that
then this expression is zero and we obtain point identification of the ATE. When or are bounded from below we are only able to obtain set estimates. A nice aspect of this is that it represents a nice middle point between identifying LATE versus claiming the ATE is not identified. If the identification at infinity effect is not exactly true, but approximately true so that one can find values of and so that and are small, then the bounds will be tight. If one cannot find such values, the bounds will be far apart.
In many cases these bounds may be wide. Wide bounds can be viewed in two ways. One interpretation is that the bounding procedure is not particularly helpful in learning about the true ATE. However, a different interpretation is that it shows that the data, without additional assumptions, is not particularly helpful for learning about the ATE. Below we discuss additional assumptions for tightening the bounds on the ATE, such as Monotone treatment response, Monotone treatment selection, and Monotone instruments. In order to keep matters simple, below we assume that there is no exclusion restriction. However, if a exclusion restriction is known, this allows us to tighten the bounds.
Next we consider the assumption of Monotone Treatment Response introduced in Manski (1997), which we write as
Assumption 5.4. Monotone Treatment Response
with probability one.
In the fishing/hunting example this is not a particularly natural assumption, but for many applications in labor economics it is. Suppose we are interested in knowing the returns to a college degree, and is income for individual if a college graduate whereas is income if a high school graduate. It is reasonable to believe that the causal effect of school or training cannot be negative. That is, one could reasonably assume that receiving more education can’t causally lower your wage. Thus, Monotone Treatment Response seems like a reasonable assumption in this case. This can lower the bounds above quite a bit because now we know that
From this Manski (1997) shows that
Another interesting assumption that can also help tighten the bounds is the Monotone Treatment Selection assumption introduced in Manski and Pepper (2000). In our framework this can be written as
Assumption 5.5. Monotone Treatment Selection: for or ,
Again this might not be completely natural for the fishing/hunting example, but may be plausible in many other cases. For example it seems like a reasonable assumption in schooling if we believe that there is positive sorting into schooling. Put differently, suppose the average college graduate is a more able person than the average high school graduate and would earn higher income, even if she did not have the college degree. If this is true, then the average difference in earnings between college and high school graduates overstates the true causal effect of college on earnings. This also helps to further tighten the bounds as this implies that
Note that by combining the MTR and MTS assumption, one can get the tighter bounds:
Manski and Pepper (2000) also develop the idea of a monotone instrumental variable. An instrumental variable is defined as one for which for any two values of the instrument and ,
In words, the assumption is that the instrument does not directly affect the outcome variable . It only affects one’s choices. Using somewhat different notation, but their exact wording, they define a monotone instrumental variable in the following way
Assumption 5.6. Let be an ordered set. Covariate is a monotone instrumental variable in the sense of mean-monotonicity if, for ,each value of , and all such that ,
This is a straight generalization of the instrumental variable assumption, but imposes much weaker requirements for an instrument. It does not require that the instrument be uncorrelated with the outcome, but simply that the outcome monotonically increase with the instrument. An example is that parental income has often been used as an instrument for education. Richer parents are better able to afford a college degree for their child. However, it seems likely that the children of rich parents would have had high earnings, even in the absence of a college degree.
They show that this implies that
One can obtain tighter bounds by combining the Monotone Instrumental Variable assumption with the Monotone Treatment Response assumption but we do not explicitly present this result.
Blundell et al. (2007) estimate changes in the distribution of wages in the United Kingdom using bounds to allow for the impact of non-random selection into work. They first document the growth in wage inequality among workers over the 1980s and 1990s. However, they point out that rates of non-participation in the labor force have grown in the UK over the same time period. Nevertheless, they show that selection effects alone cannot explain the rise in inequality observed among workers: the worst case bounds establish that inequality has increased. However, worst case bounds are not sufficiently informative to understand such questions as whether most of the rise in wage inequality is due to increases in wage inequality within education groups versus across education groups. Next, they add an additional assumptions to tighten the bounds. First, they assume the probability of work is higher for those with higher wages, which is essentially the Monotone Treatment Selection assumption shown in Assumption 5.5. Second, they make the Monotone Instrumental Variables assumption shown in Assumption 5.6. They assume that higher values of out of work benefit income are positively associated with wages. They show that both of these assumptions tighten the bounds considerably. They find that when these additional restrictions are made, then they can show that both within group and between group inequality has increased.
Altonji et al. (2005a) suggest another approach which is to use the amount of selection on observable covariates as a guide to the potential amount of selection on unobservables. To motivate this approach, consider an experiment in which treatment status is randomly assigned. The key to random assignment is that it imposes that treatment status be independent of the unobservables in the treatment model. Since they are unobservable, one can never explicitly test whether the treatment was truly random. However, if randomization was carried out correctly, treatment should also be uncorrelated with observable covariates. This is testable, and applying this test is standard in experimental approaches.
Researchers use this same argument in non-experimental cases as well. If a researcher wants to argue that his instrument or treatment is approximately randomly assigned, then it should be uncorrelated with observable covariates as well. Even if this is strictly not required for consistent estimates of instrumental variables, readers may be skeptical of the assumption that the instrument is uncorrelated with the unobservables if it is correlated with the observables. Researchers often test for this type of relationship as well. 17 The problem with this approach is that simply testing the null of uncorrelatedness is not that useful. Just because you reject the null does not mean it isn’t approximately true. We would not want to throw out an instrument with a tiny bias just because we have a data set large enough to detect a small correlation between it and an observable. Along the same lines, just because you fail to reject the null does not mean it is true. If one has a small data set with little power one could fail to reject the null even though the instrument is poor. To address these issues, Altonji et al. (2005a) design a framework that allows them to describe how large the treatment effect would be if “selection on the unobservables is the same as selection on the observables.”
Their key variables are discrete, so they consider a latent variable model in which a dummy variable for graduation from high school can be written as
where can be written as
represent all covariates, both those that are observable to the econometrician and those that are unobservable, the variable is a dummy variable representing whether the covariate is observable to the empirical researcher, represents the observable part of the index, and denotes the unobservable part.
Within this framework, one can see that different assumptions about what dictates which observables are chosen can be used to identify the model. Their specific goal is to quantify what it means for “selection on the observables to be the same as selection on the unobservables.” They argue that the most natural way to formalize this idea is to assume that is randomly assigned so that the unobservables and observables are drawn from the same underlying distribution.
The next question is what this assumption implies on the data that can be useful for identification. They consider the projection:
where can be any random variable. They show that if is randomly assigned,
This restriction is typically sufficient to insure identification of .18
Altonji et al. (2005a,b) argue that for their example this is an extreme assumption and the truth is somewhere in between this assumption and the assumption that is uncorrelated with the unobservables which would correspond to . They assume that when ,
There are at least three arguments for why selection on unobservables would be expected to be less severe than selection on observables (as it is measured here). First, some of the variation in the unobservable is likely just measurement in the dependent variable. Second, data collectors likely collect the variables that are likely to be correlated with many things. Third, there is often a time lapse between the time the baseline data is collected (the observables) and when the outcome is realized. If unanticipated events occur in between these two time periods, that would lead to the result.
Notice that if then assuming is the same as assuming . However, if were very large the two estimates would be very different, which would shed doubt on the assumption of random assignment. Since essentially picks up the relationship between the instrument and the observable covariates, the bounds would be wide when there is a lot of selection on observables and will be tight when there is little selection on observables.
Altonji, Elder, and Taber consider the case of whether the decision to attend Catholic high school affects outcomes such as test scores and high school graduation rates. Those who attend Catholic schools have higher graduation rates than those who do not attend Catholic schools. However, those who attend Catholic may be very different from those who do not. They find that (on the basis of observables) while this is true in the population, it is not true when one conditions on the individuals who attend Catholic school in eighth grade. To formalize this, they use their approach and estimate the model under the two different assumptions. In their application the projection variable, , is the latent variable determining whether an individual attends Catholic school. First they estimate a simple probit of high school graduation on Catholic high school attendance as well as many other covariates. This corresponds to the case. They find a marginal effect of 0.08, meaning that Catholic school raises high school graduation by eight percentage points. Next they estimate a bivariate probit of Catholic high school attendance and high school graduation subject to the constraint that . In this case they find a Catholic high school effect of 0.05. The closeness of these two estimates strongly suggests that the Catholic high school effect is not simply a product omitted variable bias. The tightness of the two estimates arose both because was small and because they use a wide array of powerful explanatory variables.