Chapter 6Identification of Models of the Labor Market

Eric French*, Christopher Taber**

* Federal Reserve Bank of Chicago

** Department of Economics, University of Wisconsin-Madison and NBER


This chapter discusses identification of common selection models of the labor market. We start with the classic Roy model and show how it can be identified with exclusion restrictions. We then extend the argument to the generalized Roy model, treatment effect models, duration models, search models, and dynamic discrete choice models. In all cases, key ingredients for identification are exclusion restrictions and support conditions.

JEL classification

• C14 • C51 • J22 • J24


• Identification • Roy model • Discrete choice • Selection • Treatment effects

1 Introduction

This chapter discusses identification of common selection models of the labor market. We are primarily concerned with nonparametric identification. We view nonparametric identification as important for the following reasons.

First, recent advances in computer power, more widespread use of large data sets, and better methods mean that estimation of increasingly flexible functional forms is possible. Flexible functional forms should be encouraged. The functional form and distributional assumptions used in much applied work rarely come from the theory. Instead, they come from convenience. Furthermore, they are often not innocuous. 1

Second, the process of thinking about nonparametric identification is useful input into applied work. It is helpful to an applied researcher both in informing her about which type of data would be ideal and which aspects of the model she might have some hope of estimating. If a feature of the model is not nonparametrically identified, then one knows it cannot be identified directly from the data. Some additional type of functional form assumption must be made. As a result, readers of empirical papers are often skeptical of the results in cases in which the model is not nonparametrically identified.

Third, identification is an important part of a proof of consistency of a nonparametric estimator.

However, we acknowledge the following limitation of focusing on nonparametric identification. With any finite data set, an empirical researcher can almost never be completely nonparametric. Some aspects of the data that might be formally identified could never be estimated with any reasonable level of precision. Instead, estimators are usually only nonparametric in the sense that one allows the flexibility of the model to grow with the sample size. A nice example of this is Sieve estimators in which one estimates finite parameter models but the number of parameters gets large with the data set. An example would be approximating a function by a polynomial and letting the degree of the polynomial get large as the sample size increases. However, in that case one still must verify that the model is nonparametrically identified in order to show that the model is consistent. One must also construct standard errors appropriately. In this chapter we do not consider the purely statistical aspects of nonparametric estimation, such as calculation of standard errors. This is a very large topic within econometrics. 2

The key issue in identification of most models of the labor market is the selection problem. For example, individuals are typically not randomly assigned to jobs. With this general goal in mind we begin with the simplest and most fundamental selection model in labor economics, the Roy (1951) model. We go into some detail to explain Heckman and Honoré’s (1990) results on identification of this model. A nice aspect of identification of the Roy model is that the basic methodology used in this case can be extended to show identification of other labor models. We spend the rest of the chapter showing how this basic intuition can be used in a wide variety of labor market models. Specifically we cover identification in the generalized Roy model, treatment effect models, the competing risk model, search models, and forward looking dynamic models. While we are clearly not covering all models in labor economics, we hope the ideas are presented in a way that the similarities in the basic models can be seen and can be extended by the reader to alternative frameworks.

The plan of this chapter is specifically as follows. Section 2 discusses some econometric preliminaries. We consider the Roy model in Section 3, generalize this to the Generalized Roy model in Section 4, and then use the model to think about identification of treatment effects in Section 5. In Section 6 we consider duration models and search models and then consider estimation of dynamic discrete choice models in Section 7. Finally in Section 8 we offer some concluding thoughts.

2 Econometric Preliminaries

2.1 Notation

Throughout this chapter we use capital letters with image subscripts to denote random variables and small letters without image subscripts to denote possible outcomes of that random variable. We will also try to be explicit throughout this chapter in denoting conditioning. Thus, for example, we will use the notation


to denote the expected value of outcome image conditional on the regressor variable image being equal to some realization image.

2.2 Identification

The word “identification” has come to mean different things to different labor economists. Here, we use a formal econometrics definition of identification. Consider two different models that lead to two data generating processes. If the data generated by these two models have exactly the same distribution then the two models are not separately identified from each other. However, if any two different model specifications lead to different data distributions, the two specifications are separately identified. We give a more precise definition below. Our definition of identification is based on some of the notation and set up of Matzkin’s (2007) following an exposition based on Shaikh (2010).

Let image denote the true distribution of the observed data image. An econometric model defines a data generating process. We assume that the model is specified up to an unknown vector image of parameters, functions and distribution functions. This is known to lie in space image. Within the class of models, the element image determines the distribution of the data that is observable to the researcher image. Notice that identification is fundamentally data dependent. With a richer data set, the distribution image would be a different object.

Let image be the set of all possible distributions that could be generated by the class of models we consider (i.e. image). We assume that the model is correctly specified, which means that image. The identified set is defined as


This is the set of possible image that could have generated data that has distribution image. By assuming that image we have assumed that our model is correctly specified so this set is not empty. We say that image is identified if image is a singleton for all image.

The question we seek to answer here is under what conditions is it possible to learn about image (or some feature of image) from the distribution of the observed data image. Our interest is not always to identify the full data generating process. Often we are interested in only a subset of the model, or a particular outcome from it. Specifically, our goal may be to identify


where image is a known function. For example in a regression model image, the feature of interest is typically the regression coefficients. In this case image would take the trivial form


However, this notation allows for more general cases in which we might be interested in identifying specific aspects of the model. For example, if our interest is in identifying the covariance between image and image in the case of the linear regression model, we do not need to know image per se, but rather a transformation of these parameters. That is we could be interested in


We could also be interested in a forecast of the model such as


for some specific image. The distinction between identification of features of the model as opposed to the full model is important, as in many cases the full model is not identified but the key feature of interest is identified.

To think about identification of image we define


That is, it is the set of possible values of image that are consistent with the data distribution image. We say that image is identified if image is a singleton.

As an example consider the standard regression model with two regressors:

image     (2.1)

with image for any value image (where image). In this case image, where image is the joint distribution of image and image. One would write image as image, where image is the parameter space for image and image is the space of joint distributions between image and image that satisfy image for all image. Since the data here is represented by image,image represents the joint distribution of image. Given knowledge of image and image we know the data generating process and thus we know image.

To focus ideas suppose we are interested in identifying image (i.e. image) in regression model (2.1) above. Let the true value of the data generating process be image so that by definition image. In this case image, that is it is the set of image that would lead our data image to have distribution image. In this case image is the set of values of image in this set (i.e. image).

In the case of 2 covariates, we know the model is identified as long as image and image are not degenerate and not collinear. To see how this definition of identification applies to this model, note that for any image the lack of perfect multicollinearity means that we can always find values of image for which


Since image is one aspect of the joint distribution of image, it must be the case that when image,image. Since this is true for any value of image, then image must be the singleton image.

However, consider the well known case of perfect multicollinearity in which the model is not identified. In particular suppose that


For the true value of image consider some other value image. Then for any image,


If image is the same for the two models, then the joint distribution of image is the same in the two cases. Thus the identification condition above is violated because with image,image and thus image. Since the true value image as well, image is not a singleton and thus image is not identified.

2.3 Support

Another important issue is the support of the data. The simplest definition of support is just the range of the data. When data are discrete, this is the set of values that occur with positive probability. Thus a binary variable that is either zero or one would have support image. The result of a die roll has support image. With continuous variables things get somewhat more complicated. One can think of the support of a random variable as the set of values for which the density is positive. For example, the support of a normal random variable would be the full real line (which we will often refer to as “full support”). The support of a uniform variable on image is [0, 1]. The support of an exponential variable would be the positive real line.

This can be somewhat trickier in dealing with outcomes that occur with measure zero. For example one could think of the support of a uniform variable as image,image, or image. The distinction between these objects will not be important in what we are doing, but to be formal we will use the Davidson (1994)definition of support. He defines the support of a random variable with distribution image as the set of points at which image is (strictly) increasing. 3 By this definition, the support of a uniform would be image. We will also use the notation image to denote the unconditional support of random variable image and image to denote the conditional support.

To see the importance of this concept, consider a simple case of the separable regression model


with a single continuous image variable and image for image. In this case we know that


Letting image be the support of image, it is straightforward to see that image is identified on the set image. But image is not identified outside the set image because the data is completely silent about these values. Thus if image, image is globally identified. However, if image only covers a subset of the real line it is not. For example, one interesting counterfactual is the change in the expected value of image if image were increased by image:image. If image this is trivially identified, but if the support of image were bounded from above, this would no longer be the case. That is, if the supremum of image is image, then for any value of image,image is not identified and thus the unconditional expected value of image is not identified either. This is just a restatement of the well known fact that one cannot project out of the data unless one makes functional form assumptions. Our point here is that support assumptions are very important in nonparametric identification results. One can only identify image over the range of plausible values of image if image has full support. For this reason, we will often make strong support condition assumptions. This also helps illuminate the tradeoff between functional form assumptions and flexibility. In order to project off the support of the data in a simple regression model one needs to use some functional form assumption. The same is true for selection models.

2.4 Continuity

There is one complication that we need to deal with throughout. It is not a terribly important issue, but will shape some of our assumptions. Consider again the separable regression model

image     (2.2)

As mentioned above image, so it seems trivial to see that image is identified, but that is not quite true. To see the problem, suppose that both image and image are standard normals. Consider two different models for image,

Model 1:



Model 2:


These models only differ at the point image, but since image is normal this is a zero probability event and we could never distinguish between these models because they imply the same joint distribution of image. For the exact same reason it isn’t really a concern (except in very special cases such as if one was evaluating a policy in which we would set image for everyone). Since this will be an issue throughout this chapter we explain how to deal with it now and use this convention throughout the chapter.

We will make the following assumptions.

Assumption 2.1. image can be written as image, where the elements of image are continuously distributed (no point has positive mass), and image is distributed discretely (all support points have positive mass).

Assumption 2.2. For any image,image is almost surely continuous across image.

The first part says that we can partition our observables into continuous and discrete ones. One could easily allow for variables that are partially continuous and partially discrete, but this would just make our results more tedious to exposit. The second assumption states that choosing a value of image at which image is discontinuous (in the continuous variables) is a zero probability event.

Theorem 2.1. UnderAssumptions 2.1and2.2and assuming model(2.2)with image for image , image is identified on a set imagethat has measure 1.

(Proof in Appendix.)

The proof just states that image is identified almost everywhere. More specifically it is identified everywhere that it is continuous.

3 The Roy Model

The classic model of selection in the labor market is the Roy (1951) model. In the Roy model, workers choose one of two possible occupations: hunting and fishing. They cannot pursue both at the same time. The worker’s log wage is image if he fishes and image if he hunts. Workers maximize income so they choose the occupation with the higher wage. Thus a worker chooses to fish if image. The occupation is defined as

image     (3.1)

and the log wage is defined as

image     (3.2)

Workers face a simple binary choice: choose the job with the highest wage. This simplicity has led the model to be used in one form or another in a number of important labor market contexts. Many discrete choice models share the Roy model’s structure. Examples in labor economics include the choice of whether to continue schooling, what school to attend, what occupation to pursue, whether to join a union, whether to migrate, whether to work, whether to obtain training, and whether to marry.

As mentioned in the introduction, we devote considerable attention to identification of this model. In subsequent sections we generalize these results to other models.

The responsiveness of the supply of fishermen to changes in the price of fish depends critically on the joint distribution of image. Thus we need to know what a fisherman would have made if he had chosen to hunt. However, we do not observe this but must infer its counterfactual distribution from the data at hand. Our focus is on this selection problem. Specifically, much of this chapter is concerned with the following question: Under what conditions is the joint distribution ofimageidentified? We start by considering estimation in a parametric model and then consider nonparametric identification.

Roy (1951) is concerned with how occupational choice affects the aggregate distribution of earnings and makes a series of claims about this relationship. These claims turn out to be true when the distribution of skills in the two occupations is lognormal.

Heckman and Honoré (1990) consider identification of the Roy model (i.e., the joint distribution of image). They show that there are two methods for identifying the Roy model. The first is through distributional assumptions. The second is through exclusion restrictions. 4

In order to focus ideas, we use the following case:

image     (3.3)

image     (3.4)

where the unobservable error terms image are independent of the observable variables image and image and image denote log wages in the fishing and hunting sectors respectively. We distinguish between three types of variables. image influences productivity in both fishing and hunting, image influences fishing only, and image influences hunting only. The variables image and image are “exclusion restrictions,” and play a very important role in the identification results below. In the context of the Roy model, an exclusion restriction could be a change in the price of rabbits which increases income from hunting, but not from fishing. The notation is general enough to incorporate a model without exclusion restrictions (in which case one or more of the image would be empty).

Our version of the Roy framework imposes two strong assumptions. First, that image is separable in image and image for image. Second, we assume that image and image are independent of one another. Note that independence implies homoskedasticity: the variance of image cannot depend on image. There is a large literature looking at various other more flexible specifications and this is discussed thoroughly in Matzkin (2007). It is also trivial to extend this model to allow for a general relationship between image and image, as we discuss in Section 3.3 below.

We focus on the separable independent model for two reasons. First, the assumptions of separability and independence have bite beyond a completely general nonparametric relationship. That is, to the extent that they are true, identification is facilitated by these assumptions. Presumably because researchers think these assumptions are approximately true, virtually all empirical research uses these assumptions. Second, despite these strong assumptions, they are obviously much weaker than the standard assumptions that image is linear (i.e. image) and that image is normally distributed. One approach to writing this chapter would have been to go through all of the many specifications and alternative assumptions. We choose to focus on a single base specification for expositional simplicity.

Heckman and Honoré (1990) first discuss identification of the joint distribution of image using distributional assumptions. They show that when one can observe the distribution of wages in both sectors, and assuming image is joint normally distributed, then the joint distribution of image is identified from a single cross section even without any exclusion restrictions or regressors. To see why, write equations (3.3) and (3.4) without regressors (so image, the mean of image):







(with image and image the pdf and cdf of a standard normal),


and for each image,


One can derive the following conditions from properties of normal random variables found in Heckman and Honoré (1990):








This gives us seven equations in the five unknowns image, and image. It is straightforward to show that the five parameters can be identified from this system of equations.

However, Theorems 7 and 8 of Heckman and Honoré (1990) show that when one relaxes the log normality assumption, without exclusion restrictions in the outcome equation, the model is no longer identified. This is true despite the strong assumption of agent income maximization. This result is not particularly surprising in the sense that our goal is to estimate a full joint distribution of a two dimensional object image, but all we can observe is two one dimensional distributions (wages conditional on job choice). Since there is no information in the data about the wage that a fisherman may have received as a hunter, one cannot identify this joint distribution. In fact, Theorem 7 of Heckman and Honoré (1990) states that we can never distinguish the actual model from an alternative model in which skills are independent of each other.

3.1 Estimation of the normal linear labor supply model

It is often the case that we only observe wages in one sector. For example, when estimating models of participation in the labor force, the wage is observed only if the individual works. We can map this into our model by associating working with “fishing” and not working with “hunting.” That is, we let image denote income if working and let image denote the value of not working.5

But there are other examples in which we observe the wage in only one sector. For example, in many data sets we do not observe wages of workers in the black market sector. Another example is return immigration in which we know when a worker leaves the data to return to their home country, but we do not observe that wage.

In Section 3.2 we discuss identification of the nonparametric version of the model. However, it turns out that identification of the more complicated model is quite similar to estimation of the model with normally distributed errors. Thus we review this in detail before discussing the nonparametric model. We also remark that providing a consistent estimator also provides a constructive proof of identification, so one can also interpret these results as (informally) showing identification in the normal model. The model is similar to Willis and Rosen’s (1979) Roy Model of educational choices or Lee’s (1978) model of union status and the empirical approach is analogous. We assume that



In a labor supply model where image represents market work, image is the market wage which will be observed for workers only. image, the pecuniary value of not working, is never observed in the data. Keane et al. (2011) example of the static model of a married woman’s labor force participation is similar.

One could simply estimate this model by maximum likelihood. However we discuss a more traditional four step method to illustrate how the parametric model is identified. This four step process will be analogous to the more complicated nonparametric identification below. Step 1 is a “reduced form probit” of occupational choices as a function of all covariates in the model. Step 2 estimates the wage equations by controlling for selection as in the second step of a Heckman Two step (Heckman, 1979). Step 3 uses the coefficients of the wage equations and plugs these back into a probit equation to estimate a “structural probit.” Step 4 shows identification of the remaining elements of the variance-covariance matrix of the residuals.

Step 1: Estimation of choice model

The probability of choosing fishing (i.e., work) is:

image     (3.5)

where image is the cdf of a standard normal, image is the standard deviation of image, and


This is referred to as the “reduced form model” as it is a reduced form in the classical sense: the parameters are a known function of the underlying structural parameters. It can be estimated by maximum likelihood as a probit model. Let image represent the estimated parameter vector. This is all that can be learned from the choice data alone. We need further information to identify image and to separate image from image.

Step 2: Estimating the wage equation

This is essentially the second stage of a Heckman (1979) two step. To review the idea behind it, let


Then consider the regression


where image (by definition of regression) and thus:


The wage of those who choose to work is

image     (3.6)

Showing that image is a fairly straightforward integration problem and is well known. Because Eq. (3.6) is a conditional expectation function, OLS regression of image on image,image, and image gives consistent estimates of image, and image. image is the value of image estimated in Eq. (3.5).

Note that we do not require an exclusion restriction. Since image is a nonlinear function, but image is linear, this model is identified. However, without an exclusion restriction, identification is purely through functional form. When we consider a nonparametric version of the model below, exclusion restrictions are necessary. We discuss this issue in Section 3.2.

Step 3: The structural probit

Our next goal is to estimate image and image. In Step 1 we obtained consistent estimates of image and in Step 2 we obtained consistent estimates of image and image.

When there is only one exclusion restriction (i.e. image is a scalar), identification proceeds as follows. Because we identified image in Step 2 and image in Step 1, we can identify image. Once image is identified, it is easy to see how to identify image (because image is identified) and image (because image and image are identified).

In terms of estimation of these objects, if there is more than one exclusion restriction the model is over-identified. If we have two exclusion restrictions, image and image are both 2 × 1 vectors, and thus we wind up with 2 consistent estimates of image. The most standard way of solving this model is by estimating the “structural probit:”

image     (3.7)

That is, one just runs a probit of image on image,image, and image where image and image are our estimates of image and image.

Step 3 is essential if our goal is to estimate the labor supply equation. If we are only interested in controlling for selection to obtain consistent estimates of the wage equation, we do not need to worry about the structural probit. However, notice that


and thus the labor supply elasticity is:


where, as before, image is the log of income if working. Thus knowledge of image is essential for identifying the effects of wages on participation.

One could not estimate the structural probit without the exclusion restriction image as the first two components of the probit in Eq. (3.7) would be perfectly collinear. For any image we could find a value of image and image that delivers the same choice probabilities. Furthermore, if these parameters were not identified, the elasticity of labor supply with respect to wages would not be identified either.

Step 4: Estimation of the variance matrix of the residuals

Lastly, we identify all the components of image, (image) as follows. We have described how to obtain consistent estimates of image and image. This gives us two equations in three parameters. We can obtain the final equation by using the variance of the residual in the selection model since


Let image index the set of individuals who choose image and image is the residual image for individuals who choose image. Using “hats” to denote estimators we can estimate the model as




3.2 Identification of the Roy model: the non-parametric approach

Although the parametric case with exclusion restrictions is more commonly known, the model in the previous section is still identified non-parametrically if the researcher is willing to impose stronger support conditions on the observable variables. Heckman and Honoré (1990, Theorem 12) provide conditions under which one can identify the model nonparametrically using exclusion restrictions. We present this case below.

Assumption 3.1. image is continuously distributed with distribution function image, support image, and is independent of image. The marginal distributions of image and image have medians equal to zero.

Assumption 3.2. image for all image.

Assumption 3.2 is crucial for identification. It states that for any value of image, image varies across the full real line and for any value of image, image varies across the full real line. This means that we can condition on a set of variables for which the probability of being a hunter (i.e. image) is arbitrarily close to 1. This is clearly a very strong assumption that we will discuss further.

We need the following two assumptions for the reasons discussed in Section 2.4.

Assumption 3.3. image can be written as image where the elements of image are continuously distributed (no point has positive mass), and image is distributed discretely (all support points have positive mass).

Assumption 3.4. For any image,image and image are almost surely continuous across image.

Under these assumptions we can prove the theorem following Heckman and Honoré (1990).

Theorem 3.1. If (image , image if image , image ) are all observed and generated under model(3.1)-(3.4), then underAssumptions 3.13.4, image , image , and image are identified on a set imagethat has measure 1.

(Proof in Appendix.)

A key theme of this chapter is that the basic structure of identification in this model is similar to identification of more general selection models, so we explain this result in much detail. The basic structure of the proof we present below is similar to Heckman and Honoré’s proof of their Theorems 10 and 12. We modify the proof to allow for the case where image is not observed.

The proof in the Appendix is more precise, but in the text we present the basic ideas. We follow a structure analogous to the parametric empirical approach when the residuals are normally distributed as presented in Section 3.1. First we consider identification of the occupational choice given only observable covariates and the choice model. This is the nonparametric analogue of the reduced form probit. Second we estimate image given the data on image, which is the analogue of the second stage of the Heckman two step, and is more broadly the nonparametric version of the classical selection model. In the third step we consider the nonparametric analogue of identification of the structural probit. Since we will have already established identification of image, identification of this part of the model boils down to identification of image. Finally in the fourth step we consider identification of image (the joint distribution of image). We discuss each of these steps in order.

To map the Roy model into our formal definition of identification presented in Section 2.2, the model is determined by image, where image is the joint distribution of image. The observable data here is image. Thus image is the joint distribution of this observable data and image represents the possible data generating processes consistent with image.

Step 1: Identification of choice model

The nonparametric identification of this model is established in Matzkin (1992). We can write the model as


where image is the distribution function of image.

Using data only on choices, this model is only identified up to a monotonic transformation. To see why, note that we can write image when

image     (3.8)

but this is equivalent to the condition

image     (3.9)

where image is any strictly increasing function. Clearly the model in Eq. (3.8) cannot be distinguished from an alternative model in Eq. (3.9). This is the nonparametric analog of the problem that the scale (i.e., the variance of image) and location (only the difference between image and image but not the level of either) of the parametric binary choice model are not identified. Without loss of generality we can normalize the model up to a monotonic transformation. There are many ways to do this. A very convenient normalization is to choose the transformation image because image has a uniform distribution. 6 So we define





Thus we have established that we can (i) write the model as image if and only if image where image is uniform image and (ii) that image is identified.

This argument can be mapped into our formal definition of identification from Section 2.2 above. The goal here is identification of image, so we define image. Note that even though image is not part of image, it is a known function of the components of image. The key set now is image, which is now defined as the set of possible values image that could have generated the joint distribution of image. Since image, no other possible value of image could generate the data. Thus image only contains the true value and is thus a singleton.

Step 2: Identification of the wage equation image

Next consider identification of image Median regression identifies


The goal is to identify image. The problem is that when we vary image we also typically vary image. This is the standard selection problem. Because we can add any constant to image and subtract it from image without changing the model, a normalization that allows us to pin down the location of image is that image. The problem is that this is the unconditional median rather than the conditional one. The solution here is what is often referred to as identification at infinity (e.g. Chamberlain, 1986, or Heckman, 1990). For some value image suppose we can find a value of image to send image arbitrarily close to one. It is referred to as identification at infinity because if image were linear in the exclusion restriction image this could be achieved by sending image. In our fishing/hunting example, this could be sending the price of rabbits to zero which in turn sends log income from hunting to image. Then notice that 7


Thus image is identified.

Conditioning on image so that image is arbitrarily close to one is essentially conditioning on a group of individuals for whom there is no selection, and thus there is no selection problem. Thus we are essentially saying that if we can condition on a group of people for whom there is no selection we can solve the selection bias problem.

While this may seem like cheating, without strong functional form assumptions it is necessary for identification. To see why, suppose there is some upper bound of image equal to image which would prevent us from using this type of argument. Consider any potential worker with a value of image. For those individuals it must be the case that


so they must always be a hunter. As a result, the data is completely uninformative about the distribution of image for these individuals. For this reason the unconditional median of image would not be identified. We will discuss approaches to dealing with this potential problem in the Treatment Effect section below.

To relate this to the framework from Section 2.2 above now we define image, so image contains the values of image consistent with image. However since


image is the only element of image, thus it is identified.

Identification of the slope only without “identification at infinity”

If one is only interested in identifying the “slope” of image and not the intercept, one can avoid using an identification at infinity argument. That is, for any two points image and image, consider identifying the difference image. The key to identification is the existence of the exclusion restriction image. For these two points, suppose we can find values image and image so that


There may be many pairs of image that satisfy this equality and we could choose any of them. Define image. The key aspect of this is that since image, and thus the probability of being a fisherman is the same given the two sets of points, then the bias terms are also the same: image.

This allows us to write


As long as we have sufficient variation in image we can do this everywhere and identify image up to location.

Step 3: Identification of image

In terms of identifying image, the exclusion restriction that influences wages as a fisherman but not as a hunter (i.e. image) will be crucial. Consider identifying image for any particular value image. The key here is finding a value of image so that

image     (3.10)

Assumption 3.2 guarantees that we can do this. To see why Eq. (3.10) is useful, note that it must be that for this value of image

image     (3.11)

But the fact that image has median zero implies that


Since image is identified, image is identified from this expression. 8

Again to relate this to the framework in Section 2.2 above, now image and image is the set of functions image that are consistent with image. Above we showed that if image, then image. Thus since we already showed that image is identified, image is the only element of image.

Step 4: Identification of image

Next consider identification of image given image and image. We will show how to identify the joint distribution of image closely following the exposition of Heckman and Taber (2008). Note that from the data one can observe

image     (3.12)

which is the cumulative distribution function of image evaluated at the point image. By varying the point of evaluation one can identify the joint distribution of image from which one can derive the joint distribution of image.

Finally in terms of the identification conditions in Section 2.2 above, now image and image is the set of distributions image consistent with image. Since image is uniquely defined by the expression (3.12) and since everything else in this expression is identified, image is the only element of image.

3.3 Relaxing independence between observables and unobservables

For expositional purposes we focus on the case in which the observables are independent of the unobservables, but relaxing these assumptions is easy to do. The simplest case is to allow for a general relationship between image and image. To see how easy this is, consider a case in which image is just binary, for example denoting men and women. Independence seems like a very strong assumption in this case. For example, the distribution of unobserved preferences might be different for women and men, leading to different selection patterns. In order to allow for this, we could identify and estimate the Roy model separately for men and for women. Expanding from binary image to finite support image is trivial, and going beyond that to continuous image is straightforward. Thus one can relax the independence assumption easily. But for expositional purposes we prefer our specification.

The distinction between image and image was not important in steps 1 and 2 of our discussion above. When one is only interested in the outcome equation image, relaxing the independence assumption between image and image can be done as well. However, in step 3 this distinction is important in identifying image and the independence assumption is not easy to relax.

If we allow for general dependence between image and image, the “identification at infinity” argument becomes more important as the argument about “Identification of the Slope Only without Identification at Infinity” no longer goes through. In that case the crucial feature of the model was that image. However, without independence this is no longer generally true because image. Thus even if image, when image, in general image.

3.4 The importance of exclusion restrictions

We now show that the model is not identified in general without an exclusion restriction. 9 Consider a simplified version of the model,



where image is uniform (0,1) and image is independent of image with distribution image and we use the location normalization image. As in Section 3.2, we observe image, whether image or image, and if image then we observe image.

We can think about estimating the model from the median regression

image     (3.13)

Under the assumption that image it must be the case that image, but this is our only restriction on image and image. Thus the model above has the same conditional median as an alternative model

image     (3.14)

where image and image. Equations (3.13) and (3.14) are observationally equivalent. Without an exclusion restriction, it is impossible to tell if observed income from working varies with image because it varies with image or because it varies with the labor force participation rate and thus the extent of selection. Thus the models in Eqs (3.13) and (3.14) are not distinguishable using conditional medians.

To show the two models are indistinguishable using the full joint distribution of the data, consider an alternative data generating model with the same first stage, but now image is determined by


where image is independent of image with image. Let image be the joint distribution of image in the alternative model. We will continue to assume that in the alternative model image. The question is whether the alternative model is able to generate the same data distribution.

In the true model


In the alternative model


Thus these two models generate exactly the same joint distribution of data and cannot be separately identified as long as we define image so that 10


4 The Generalized Roy Model

We next consider the “Generalized Roy Model” (as defined in e.g. (Heckman and Vytlacil, 2007a). The basic Roy model assumes that workers only care about their income. The Generalized Roy Model allows workers to care about non-pecuniary aspects of the job as well. Let image and image be the utility that individual image would receive from being a fisherman or a hunter respectively, where for image,

image     (4.1)

where image represents the non-pecuniary utility gain from observables image and image. The variable image allows for the fact that there may be other variables that affect the taste for hunting versus fishing directly, but do not affect wages in either sector. 11 Note that we are imposing separability between image and image. In general we can provide conditions in which the results presented here will go through if we relax this assumption, but we impose it for expositional simplicity. The occupation is now defined as

image     (4.2)

We continue to assume that


image     (4.3)

image     (4.4)

It will be useful to define a reduced form version of this model. Note that people fish when


In the previous section we described how the choice model can only be identified up to a monotonic transform and that assuming the error term is uniform is a convenient normalization. We do the same thing here. Let image be the distribution function of image. Then we define

image     (4.5)

image     (4.6)

As above, this normalization is convenient because it is straightforward to show that


and that image is uniformly distributed on the unit interval.

We assume that the econometrician can observe the occupations of the workers and the wages that they receive in their chosen occupations as well as image.

4.1 Identification

It turns out that the basic assumptions that allow us to identify the Roy model also allow us to identify the generalized Roy model.

We start with the reduced form model in which we need two more assumptions.

Assumption 4.1. image is continuously distributed and is independent of image. Furthermore, image is distributed uniform on the unit interval and the medians of both image and image are zero.

Assumption 4.2. The support of image is image for all image.

We also slightly extend the restrictions on the functions to include image and image.

Assumption 4.3. image can be written as image where the elements of image are continuously distributed (no point has positive mass), and image are distributed discretely (all support points have positive mass).

Assumption 4.4. For any image,image, image,image and image are almost surely continuous across


Theorem 4.1. UnderAssumptions 4.14.4, image and the joint distribution of image and of image are identified from the joint distribution of image on a set imagethat has measure 1 where image are generated by model(4.1)-(4.4).

(Proof in Appendix.)

The intuition for identification follows directly from the intuition given for the basic Roy model. We show this in 3 steps:

1. Identification of image is like the “Step 1: identification of choice model” section. We can only identify image up to a monotonic transformation for exactly the same reason given in that section. We impose the normalization that image is uniform in Assumption 4.2. Given that assumption


so identification of image from image comes directly.

2. Identification of image and image are completely analogous to “Step 2: identification of image” in Section 3.2. That is


The analogous argument works for image when we send image.

3. Identification of the joint distribution of image and of image are analogous to the “Step 4: identification of image” discussion in the Roy model. That is if we let image represent the joint distribution of image then


The analogous argument works for the joint distribution of image.

Note that not all parameters are identified such as the non-pecuniary gain from fishing image. To identify the “structural” generalized Roy model we make two additional assumptions:

Assumption 4.5. The median of image is zero.

Assumption 4.6. For any value of image,image has full support (i.e. the whole real line).

Theorem 4.2. UnderAssumptions 4.14.6, image , the distribution of image , and the distribution of image are identified.

(Proof in Appendix.)

Note that Theorem 4.1 gives the joint distribution of image while Theorem 4.2 gives the joint distribution of image. Since image, this really just amounts to saying that image is identified.

Furthermore, whereas image and image are identified in Theorem 4.1, image is identified in Theorem 4.2. Recall image is the added utility (measured in money) of being a fisherman relative to a hunter. The exclusion restrictions image and image help us identify this. These exclusion restrictions allow us to vary the pecuniary gains of the two sectors, holding preferences image constant. Identification is analogous to the “Step 3: identification of image” in the standard Roy model. To see where identification comes from, for every image think about the following conditional median


Since the median of image is zero, this means that


and thus


Because image and image is identified, image is identified also. The argument above shows that we do not need both image and image, we only need image or image.

Suppose there is no variable that affects earnings in one sector but not preferences (image or image). An alternative way to identify image is to use a cost measured in dollars. Consider the linear version of the model with normal errors and without exclusion restrictions image so that




The reduced form probit is:


where image is the standard deviation of image. Theorem 4.1 above establishes that the functions image and image (i.e., image and image) as well as the variance of image and image are identified. We still need to identify image, image and image. Thus we are able to identify


If image and image are scalars we still have three parameters (image) and two restrictions image. If they are not scalars, we still have one more parameter than restriction. However suppose that one of the exclusion restrictions represents a cost variable that is measured in the same units as image. For example in a schooling case suppose that image represents the present value of earnings as a college graduate, image represents the present value of high school graduate as a college graduate, and the exclusion restriction, image, represents the present value of college tuition. In this case image the coefficient on image is image, so image is identified. Given image it is very easy to show that the rest of the parameters are identified as well. Heckman et al. (1998) provide an example of this argument using tuition as in the style above. In Section 7.3 we discuss Heckman and Navarro (2007) who use this approach as well.

4.2 Lack of identification of the joint distribution of image

In pointing out what is identified in the model it is also important to point out what is not identified. Most importantly in the generalized Roy model we were able to identify the joint distribution between the error terms in the selection equation and each of the outcomes, but not the joint distribution of the variables in the outcome equation. In particular the joint distribution between the error terms image is not identified. Even strong functional form assumptions will not solve this problem. Fir example, it is easy to show that in the joint normal model the covariance of image is not identified.

4.3 Are functional forms innocuous? Evidence from Catholic schools

As the theorems above make clear, nonparametric identification requires exclusion restrictions. However, completely parametric models typically do not require exclusion restrictions. In specific empirical examples, identification could primarily be coming from the exclusion restriction or identification could be coming primarily from the functional form assumptions (or some combination between the two). When researchers use exclusion restrictions in data, it is important to be careful about which assumptions are important.

We describe one example from Altonji et al. (2005b). Based on Evans and Schwab (1995), Neal (1997), and Neal and Grogger (2000) they consider a bivariate probit model of Catholic schooling and college attendance.

image     (4.7)

image     (4.8)

where image is the indicator function taking the value one if its argument is true and zero otherwise, image is a dummy variable indicating attendance at a Catholic school, and image is a dummy variable indicating college attendance. Identification of the effect of Catholic schooling on college attendance (or high school graduation) is the primary focus of these studies. The question at hand is in practice whether the assumed functional forms for image and image are important for identifying the image coefficient and thus the effect of Catholic schools on college attendance.

The model in Eqs (4.7) and (4.8) is a minor extension of the generalized Roy model. The first key difference is that the outcome variable in Eq. (4.8) is binary (attend college or not), whereas in the case of the Generalized Roy model the outcomes were continuous (earnings in either sector). The second key difference is that the outcome equation for Catholic versus Non-Catholic school only differs in the intercept (image). The error term (image) and the slope coefficients (image) are restricted to be the same. Nevertheless, the machinery to prove non-parametric identification of the Generalized Roy model can be applied to this framework. 12

Using data from the National Longitudinal Survey of 1972, Altonji et al. (2005b) consider an array of instruments and different specifications for Eqs (4.7) and (4.8). In Table 1 we present a subset of their results. We show four different models. The “Single Equation Model” gives results in which selection into Catholic school is not accounted for. The first column gives results from a probit model (with point estimates, standard errors, and marginal effects). The second column give results from a Linear Probability model. Next we present the estimates of image from a Bivariate Probit models with alternative exclusion restrictions. The final row presents the results with no exclusion restrictions. Finally we also present results from an instrumental variable linear probability model with the same set of exclusion restrictions.

Table 1 Estimated effects of Catholic schools on college attendance from linear and nonlinear models.

Single equation models
Probit OLS
0.239 0.239
(0.198) (0.070)
Two equation models
Excluded variable Bivariate probit 2SLS
Catholic 0.285 −0.093
(0.543) (0.324)
Catholic × Distance 0.478 2.572
(0.516) (2.442)
None 0.446

Urban Non-Whites from NLS-72. The first set of results come from simple probits and from OLS. The further results come from Bivariate Probits and from two stage least squares. We present the marginal effect of Catholic high school attendance on college attendance. [Point Estimate from Probit in Brackets.] (Standard Errors in Parentheses.)Source: Altonji et al. (2005b).

One can see that the marginal effect from the single equation probit is very similar to the OLS estimate. It indicates that college attendance rates are approximately 23.9 percentage points higher for Catholic high school graduates than for public high school graduates. The rest of the table presents results from three bivariate probit models and two instrumental variables models using alternative exclusion restrictions. The problem is clearest when the interaction between the student coming from a Catholic school and distance to the nearest Catholic school is used as an instrument. The 2SLS gives nonsensical results: a coefficient of 2.572 with an enormous standard error. This indicates that the instrument has little power. However, the bivariate probit result is more reasonable. It suggests that the true marginal causal effect is around 0.478 and the point estimate is statistically significant. This seems inconsistent with the 2SLS results which indicated that this exclusion restriction had very little power. However it is clear what is going on when we compare this result to the model at the bottom of the table without an exclusion restriction. The estimate is very similar with a similar standard error. The linearity and normality assumptions drive the results.

The case in which Catholic religion by itself is used as an instrument is less problematic. The IV result suggests a strong amount of positive selection but still yields a large standard error. The bivariate probit model suggests a marginal effect that is a bit larger than the OLS effect. However, note that the standard errors for the model with and without an exclusion restriction are quite similar, which seems inconsistent with the idea that the exclusion restriction is providing a lot of identifying information. Further note that the IV result suggests a strong positive selection bias while the bivariate probit without exclusion restrictions suggests a strong negative bias. The bivariate probit in which Catholic is excluded is somewhere between the two. This suggests that both functional form and exclusion restrictions are important in this case. We should emphasize the “suggests” part of this sentence as none of this is a formal test. It does, however, make one wonder how much trust to put in the bivariate probit results by themselves.

Another paper documenting the importance of functional form assumptions is Das et al. (2003), who estimate the return to education for young Australian women. They estimate equations for years of education, the probability of working, and wages. When estimating the wage equation they address both the endogeneity of years of education and also selection caused because we only observe wages for workers. They allow for flexibility in the returns to education (where the return depends on years of education) and also in the distribution of the residuals. They find that when they assume normality of the error terms, the return to education is approximately 12%, regardless of years of education. However, once they allow for more flexible functional forms for the error terms, they find that the returns to education decline sharply with years of education. For example, they find that at 10 years of education, the return to education is over 15%. However, at 14 years, the return to education is only about 5%.

5 Treatment Effects

There is a very large literature on the estimation of treatment effects. For more complete summaries see Heckman and Robb (1986), Heckman et al. (1999), Heckman and Vytlacil (2007a,b), Abbring and Heckman (2007), or Imbens and Wooldridge (2009). 13 DiNardo and Lee (2011) provide a discussion that is complementary to ours. Our goal in this section is not to survey the whole literature but provide a brief summary and to put it into the context of identification of the Generalized Roy Model.

The goal of this literature is to estimate the value of receiving a treatment defined as:

image     (5.1)

In the context of the Roy model, image is the income gain from moving from hunting to fishing. This income gain potentially varies across individuals in the population. Thus for people who choose to be fishermen, image is positive and for people who choose to be hunters, image is negative.

Estimation of treatment effects is of great interest in many literatures. The term “treatment effect”makes the most sense in the context of the medical literature. Choice image could represent taking a medical treatment (such as an experimental drug) while image could represent no treatment. In that case image and image would represent some measure of health status for individual image with and without the treatment. Thus the treatment effect image is the effect of the drug on the health outcome for individual image.

The classic example in labor economics is job training. In that case, image would represent a labor market outcome for individuals who received training and image would represents the outcome in the absence of training.

In both the case of drug treatment and job training, empirical researchers have exploited randomized trials. Medical patients are often randomly assigned either a treatment or a placebo (i.e., a sugar pill that should have no effect on health). Likewise, many job training programs are randomly assigned. For example, in the case of the Job Training Partnership Act, a large number of unemployed individuals applied for job training (see e.g. Bloom et al., 1997). Of those who applied for training, some were assigned training and some were assigned no training.

Because assignment is random and affects the level of treatment, one can treat assignment as an exclusion restriction that is correlated with treatment (i.e., the probability that image) but is uncorrelated with preferences or ability because it is random. In this sense, random assignment solves the selection problem that is the focus of the Roy model. As we show below, exogenous variation provided by experiments allows the researcher to cleanly identify some properties of the distribution of image and image under relatively weak assumptions. Furthermore, the methods for estimating these objects are simple, which adds to their appeal.

The treatment effect framework is also widely used for evaluating quasi-experimental data as well. By quasi-experimental data, we mean data that are not experimental, but exploit variation that is “almost as good as” random assignment.

5.1 Treatment effects and the generalized Roy model

Within the context of the generalized Roy model note that in general


An important special case of the treatment effect defined in Eq. (5.1) is when

image     (5.2)

image     (5.3)

In this case, the treatment effect image is a constant across individuals. Identification of this parameter is relatively straightforward. However, there is a substantial literature that studies identification of heterogeneous treatment effects. As we point out above, treatment effects are positive for some people and negative for others in the context of the Roy model. Furthermore, there is ample empirical evidence that the returns to job training are not constant, but instead vary across the population (Heckman et al., 1999).

In Section 4.2 we explain why the joint distribution of image is not identified. This means that the distribution of image is not identified and even relatively simple summary statistics like the median of this distribution is not identified in general. The key problem is that even when assignment is random, we do not observe the same people in both occupations.

Since the full generalized Roy model is complicated, hard to describe, and very demanding in terms of data, researchers often focus on a summary statistic to summarize the result. The most common in this literature is the Average Treatment Effect (ATE) defined as


From Theorem 4.1 we know that (under the assumptions of that theorem) the distribution of image and image are identified. Thus, their expected values are also identified under the one additional assumption that these expected values exist.

Assumption 5.1. The expected values of image and image are finite.

Theorem 5.1. Under the assumptions ofTheorem 4.1andAssumption 5.1, the Average Treatment effect is identified.

(Proof in Appendix.)

To see where identification of this object comes from, abstract from image so that the only observable is image, which affects the non-pecuniary gain in utility from occupation across occupations. With experimental data, image could be randomly generated assignments to occupation. Notice that


Thus the exclusion restriction is the key to identification. Note also that we need groups of individuals where image (who are always fishermen) and image (who are always hunters); thus “identification at infinity” is essential as well. For the reasons discussed in the nonparametric Roy model above, if image were never higher than some image then image would not be identified. Similarly if image were never lower than some image, then image would not be identified.

While one could directly estimate the ATE using “identification at infinity”, as described above, this is not the common practice and not something we would advocate. The standard approach would be to estimate the full Generalized Roy Model and then use it to simulate the various treatment effects. This is often done using a completely parametric approach as in, for example, the classic paper by Willis and Rosen (1979). However, there are quite a few nonparametric alternatives as well, including construction of the Marginal Treatment effects as discussed in Sections 5.3 and 5.4 below.

As it turns out, even with experimental data, it is rarely the case that image is identically one or zero with positive probability. In the case of medicine, some people assigned the treatment do not take the treatment. In the training example, many people who are offered subsidized training decide not to undergo the training. Thus, when compliance with assignment is less than 100%, we cannot recover the image. In Section 5.2 we discuss more precisely what we do recover when there is less than 100% compliance.

It is also instructive to relate the ATE to instrumental variables estimation. Let image be the outcome of interest


and let image be a dummy variable indicating whether image. Consider estimating the model

image     (5.4)

using instrumental variables with image as an instrument for image. Assume that image is correlated with image but not with image or image. Consider first the constant treatment effect model described in Eqs (5.2) and (5.3) so that image for everyone in the population. In that case


Then two stage least squares on the model above yields


Thus in the constant treatment effect model, instrumental variables provide a consistent estimate of the treatment effect. However, this result does not carry over to heterogeneous treatment effects or the average treatment effects as Heckman (1997) shows. Following the expression above we get

image     (5.5)

in general. In Sections 5.2 and 5.3 below, we describe what instrumental variables identify.

In practice there are two potential problems with the assumptions behind Theorem 5.1 above

The researcher may not have a valid exclusion restriction. We discuss some of the options for this case in Sections 5.5-5.7.
Even if they do, the variable may not have full support. By this we mean that the instrumental variable image may not vary enough, so that for some observed values of image everyone is always a fisherman and for other observed values of image everyone is always a hunter. We discuss what can be identified using exclusion restrictions with limited support in Sections 5.2-5.4 and 5.6.

We discuss a number of different approaches, some of which assume an exclusion restriction but relax the support conditions and others that do not require exclusion restrictions.

5.2 Local average treatment effects

Imbens and Angrist (1994) and Angrist et al. (1996) consider identification when the support of image takes on a finite number of points. They show that when varying the instrument over this range, they can identify what they call a Local Average Treatment Effect. Furthermore, they show how instrumental variables can be used to estimate it. It is again easiest to think about this problem after abstracting from image, as it is straightforward to condition on these variables (see Imbens and Angrist, 1994, for details). For simplicity’s sake, consider the case in which the instrument image is binary and takes on the values image. In many cases not only is the instrument discrete, but it is also binary. For example, in randomized medical trials, image represents assignment to treatment, whereas image represents assignment to the placebo. In job training programs, image represents assignment to the training program, whereas image represents no assigned training.

It is important to point out that not all patients assigned treatment actually receive the treatment. Thus image if the patient actually takes the drug and image if the individual does not take the drug. Likewise, not all individuals who are assigned training actually receive the training, so image if the individual goes to training and image if she does not. The literature on Local Average Treatment Effects handles this case as well as many others. However, we do require that the instrument of assignment has power: image. Without loss of generality we will assume that image.

Using the reduced form version of the generalized Roy model the choice problem is

image     (5.6)

where image is uniformly distributed.

The following six objects can be learned directly from the data:







The above equations show that our earlier assumption that image implies image. This, combined with the structure embedded in Eq. (5.6) means that

image     (5.7)

so then an individual who is a fisherman when image is also a fisherman when image. Similar reasoning implies image. Using this and Bayes rule yields

image     (5.8)

image     (5.9)

Using the fact that image, one can show that

image     (5.10)

Combining Eq. (5.10) with Eqs (5.8) and (5.9) yields

image     (5.11)

Rearranging Eq. (5.11) shows that we can identify

image     (5.12)

since everything on the right hand side is directly identified from the data.

Using the analogous argument one can show that


is identified. But this means that we can identify

image     (5.13)

which Imbens and Angrist (1994) define as the Local Average Treatment Effect. This is the average treatment effect for that group of individuals who would alter their treatment status if their value of image changed. Given the variation in image, this is the only group for whom we can identify a treatment effect. Any individual in the data with image would never choose image, so the data are silent about image. Similarly the data is silent about image.

Imbens and Angrist (1994) also show that the standard linear Instrumental Variables estimator yield consistent estimates of Local Average Treatment Effects. Consider the instrumental variables estimator of Eq. (5.4)


In Eq. (5.5) we showed that


Let image denote the probability that image. The numerator of the above expression is


where the key simplification comes from the fact that


Next consider the denominator




Imbens and Angrist never explicitly use the generalized Roy model or the latent index framework. Instead, they write their problem only in terms of the choice probabilities. However, in order to do this they must make one additional assumption. Specifically, they assume that if image when image then image when image. Thus changing image to image never causes some people to switch from fishing to hunting. It only causes people to switch from hunting to fishing. They refer to this as a monotonicity assumption. Vytlacil (2002) points out that this is implied by the latent index model when the index image is separable from image, as we assumed in Eq. (5.6). As is implied by Eq. (5.7), increasing the index image will cause some people to switch from hunting to fishing, but not the reverse. 14

Throughout, we use the latent index framework that is embedded in the Generalized Roy model, for three reasons. First, we can appeal to the identification results of the Generalized Roy model. Second, the latent index can be interpreted as the added utility from making a decision. Thus we can use the estimated model for welfare analysis. Third, placing the choice in an optimizing framework allows us to test the restrictions on choice that come from the theory of optimization.

As we have pointed out, not everyone offered training actually takes the training. For example, in the case of the JTPA, only 60% of those offered the training actually received it (Bloom et al., 1997). Presumably, those who took the training are those who stood the most to gain from the training. For example, the reason that many people do not take training is that they receive a job offer before training begins. For these people, the training may have been of relatively little value. Furthermore, 2% of those who applied for and were not assigned training program wind up receiving the training (Bloom et al., 1997). Angrist et al. (1996) refer to those who were assigned training, but did not take the training as never-takers. Those who receive the training whether or not they are assigned are always-takers. Those who receive the training only when assigned the training are compliers. In terms of the latent index framework, the never-takers are those for whom image, the compliers are those for whom image, and the always-takers are those for whom image.

The monotonicity assumption embedded in the latent index framework rules out the existence of a final group: the defiers. In the context of training, this would be an individual who receives training when not assigned training but would not receive training when assigned. At least in the context of training programs (and many other contexts) it seems safe to assume that there are no defiers.

5.3 Marginal treatment effects

Heckman and Vytlacil (1999, 2001, 2005, 2007b) develop a framework that is useful for constructing many types of treatment effects. They focus on the marginal treatment effect (MTE) defined in our context as


They show formally how to identify this object. We present their methodology using our notation.

Note that if we allow for regressors image, let the exclusion restriction image to take on values beyond zero and one, then if image and image are in the support of the data, then Eq. (5.12) can be rewritten as

image     (5.14)

for image. Now notice that for any image,


Thus if image is in the support of image, then image is identified. Since the model is symmetric, under similar conditions image is identified as well. Finally since

image     (5.15)

the marginal treatment effect is identified.

The marginal treatment effect is interesting in its own right. It is the value of the treatment for any individual with image and image. In addition, it is also useful because the different types of treatment effects can be defined in terms of the marginal treatment effect. For example


One can see from this expression that without full support this will not be identified because image will not be identified everywhere.

Heckman and Vytlacil (2005) also show that the instrumental variables estimator defined in Eq. (5.5) (conditional on image) is


where they give an explicit functional form for image. It is complicated enough that we do not repeat it here but it can be found in Heckman and Vytlacil (2005).

This framework is also useful for seeing what is not identified. In particular if image does not have full support so that it is bounded above or below, the average treatment effect will not be identified. However, many other interesting treatment effects can be identified. For example, the Local Average Treatment Effect in a model with no regressors image is

image     (5.16)

More generally, in this series of papers, Heckman and Vytlacil show that the marginal treatment effect can also be used to organize many ideas in the literature. One interesting case is policy effects. They define the policy relevant treatment effect as the treatment resulting from a particular policy. They show that if the relationship between the policy and the observable covariates is known, the policy relevant treatment effect can be identified from the marginal treatment effects.

5.4 Applications of the marginal treatment effects approach

Heckman and Vytlacil (1999, 2001, 2005) suggest procedures to estimate the marginal treatment effect. They suggest what they call “local instrumental variables.” Using our notation for the generalized Roy model in which image when image, where image is uniformly distributed, they show that


To see why this is the same definition of MTE as in Eq. (5.15)), note that


Thus one can estimate the marginal treatment effect in three steps. First estimate image, second estimate image using some type of nonparametric regression approach, and third take the derivative.

Because as a normalization image is uniformly distributed


Thus we can estimate image from a nonparametric regression of image on image.

A very simple way to do this is to use a linear probability model of image regressed on a polynomial of image. By letting the terms in the polynomial get large with the sample size, this can be considered a nonparametric estimator. For the second stage we regress the outcome image on a polynomial of our estimate of image. To see how this works consider the case in which both polynomials are quadratics. We would use the following two stage least squares procedure:

image     (5.17)

image     (5.18)

where image is the predicted value from the first stage. The image coefficient may not be 0 because as we change image the instrument affects different groups of people. The MTE is the effect of changing image on image. For the case above the MTE is:

image     (5.19)

Although the polynomial procedure above is transparent, the most common technique used to estimate the MTE is local linear regression.

French and Song (2010) estimate the labor supply response to Disability Insurance (DI) receipt for DI applicants. Individuals are deemed eligible for DI benefits if they are “unable to engage in substantial gainful activity”—i.e., if they are unable to work. Beneficiaries receive, on average $12,000 per year, plus Medicare health insurance. Thus, there are strong incentives to apply for benefits. They continue to receive these benefits only if they earn less than a certain amount per year ($10,800 in 2007). For this reason, the DI system likely has strong labor supply disincentives. A healthy DI recipient is unlikely to work if that causes the loss of DI and health insurance benefits.

The DI system attempts to allow benefits only to those who are truly disabled. Many DI applicants have their case heard by a judge who determines those who are truly disabled. Some applicants appear more disabled than others. The most disabled applicants are unable to work, and thus will not work whether or not they get the benefit. For less serious cases, the applicant will work, but only if she is denied benefits. The question, then, is what is the optimal threshold level for the amount of observed disability before the individual is allowed benefits? Given the definition of disability, this threshold should depend on the probability that an individual does not work, even when denied the benefit. Furthermore, optimal taxation arguments suggest that benefits should be given to groups whose labor supply is insensitive to benefit allowance. Thus the effect of DI allowance on labor supply is of great interest to policy makers.

OLS is likely to be inconsistent because those who are allowed benefits are likely to be less healthy than those who are denied. Those allowed benefits would have had low earnings even if they did not receive benefits. French and Song propose an IV estimator using the process of assignment of cases to judges. Cases are assigned to judges on a rotational basis within each hearing office, which means that for all practical purposes, judges are randomly assigned to cases conditional on the hearing office and the day. Some judges are much more lenient than others. For example, the least lenient 5% of all judges allow benefits to less than 45% of the cases they hear, whereas the most lenient 5% of all judges allow benefits to 80% of all the cases they hear. Although some of those who are denied benefits appeal and get benefits later, most do not. If assignment of cases to judges is random then the instrument of judge assignment is a plausibly exogenous instrument. Furthermore, and as long as judges vary in terms of leniency and not ability to detect individuals who are disabled, 15 the instrument can identify a MTE.

French and Song use a two stage procedure. In the first stage they estimate the probability that an individual is allowed benefits, conditional on the average judge specific allowance rate. They estimate a version of Eq. (5.17) where image is an indicator equal to 1 if case image was allowed benefits and image is the average allowance rate of the judge who heard case image. In the second stage they estimate earnings conditional on whether the individual was allowed benefits (as predicted by the judge specific allowance rate). They estimate a version of Eq. (5.18) where image is annual earnings 5 years after assignment to a judge. Figure 1 shows the estimated MTE (using the formula in Eq. (5.19)) using several different specifications of polynomial in the first and second stage equations. Assuming that the treatment effect is constant (i.e., image), they find that annual earnings 5 years after assignment to a judge are $1500 for those allowed benefits and $3900 for those denied benefits, so the estimated treatment effect is $2400. This is the MTE-linear case in Fig. 1. However, this masks considerable heterogeneity in the treatment effects. They find that when allowance rates rise, the labor supply response of the marginal case also rises. When allowing for the quadratic term image to be non-zero, they find that less lenient judges (who allow 45% of all cases) have a MTE of a $1800 decline in earnings. More lenient judges (who allow 80% of all cases) have a MTE of $3200 decline in earnings. Figure 1 also shows results when allowing for cubic and quartic terms in the polynomials in the first and second stage equations. This result is consistent with the notion that as allowance rates rise, more healthy individuals are allowed benefits. These healthier individuals are more likely to work when not receiving DI benefits, and thus their labor supply response to DI receipt is greater.


Figure 1 Marginal treatment effect.

One problem with an instrument such as this is that the instrument lacks full support. Even the most lenient judge does not allow everyone benefits. Even the strictest judge does not deny everyone. However, the current policy debate is whether the thresholds should be changed by only a modest amount. For this reason, the MTE on the support of the data is the effect of interest, whereas the ATE is not.

Doyle (2007) estimates the Marginal Treatment Effect of foster care on future earnings and other outcomes. Foster care likely increases earnings of some children but decreases it for others. For the most serious child abuse cases, foster care will likely help the child. For less serious cases, the child is probably best left at home. The question, then, is at what point should the child abuse investigator remove the child from the household? What is the optimal threshold level for the amount of observed abuse before which the child is removed from the household and placed into foster care?

Only children from the most disadvantaged backgrounds are placed in foster care. They would have had low earnings even if they were not placed in foster care. Thus, OLS estimates are likely inconsistent. To overcome this problem, Doyle uses IV. Case investigators are assigned to cases on a rotational basis, conditional on time and the location of the case. Case investigators are assigned to possible child abuse cases after a complaint of possible child abuse is made (by the child’s teacher, for example). Investigators have a great deal of latitude about whether the child should be sent into foster care. Furthermore, some investigators are much more lenient than others. For example, one standard deviation in the case manager removal differential (the difference between his average removal rate and the removal rate of other investigators who handle cases at the same time and place) is 10%. Whether the child is removed from the home is a good predictor of whether the child is sent to foster care. So long as assignment of cases to investigators is random and investigators only vary in terms of leniency (and not ability to detect child abuse) then the instrument of investigator assignment is a useful and plausibly exogenous instrument.

Doyle uses a two stage procedure where in the first stage he estimates the probability that a child is placed in foster care as a function of the investigator removal rate. In the second stage he estimates adult earnings as a function of whether the child was placed in foster care (as predicted by the instrument). He finds that children placed into foster care earn less than those not placed into foster care over most of the range of the data. Two stage least squares estimates reveal that foster care reduces adult quarterly earnings by about $1000, which is very close to average earnings. Interestingly, he finds that when child foster care placement rates rise, earnings of the marginal case fall. For example, earnings of the marginal child handled by a lenient investigator (who places only 20% of the children in foster care) are unaffected by placement. For less lenient investigators, who place 25% of the cases in foster care, earnings of the marginal case decline by over $1500.

Carneiro and Lee (2009) estimate the counterfactual marginal distributions of wages for college and high school graduates, and examine who enters college. They find that those with the highest returns are the most likely to attend college. Thus, increases in college cause changes in the distribution of ability among college and high school graduates. For fixed skill prices, they find that a 14% increase in college participation (analogous to the increase observed in the 1980s) reduces the college premium by 12%. Likewise, Carneiro et al. (2010) find that while the conventional IV estimate of the return to schooling (using distance to a college and local labor market conditions as the instruments) is 0.095, the estimated marginal return to a policy that expands each individual’s probability of attending college by the same proportion is only 0.015.

5.5 Selection on observables

Perhaps the simplest and most common assumption is that assignment of the treatment is random conditional on observable covariates (sometimes referred to as unconfoundedness). The easiest way to think about this is that the selection error term is independent of the other error terms:

Assumption 5.2.


where image is independent of image.

We continue to assume that image and image. Note that we have explicitly dropped image from the model as we consider cases in which we do not have exclusion restrictions. The implication of this assumption is that unobservable factors that determine one’s income as a fisherman do not affect the choice to become a fisherman. That is while it allows for selection on observables in a very general way, it does not allow for selection on unobservables.

Interestingly, this is still not enough for us to identify the Average Treatment Effect. If there are values of observable covariates image for which image or image the model is not identified. If image then it is straightforward to identify image, but image is not identified. Thus we need the additional assumption

Assumption 5.3. For almost all image in the support of image,


Theorem 5.2. UnderAssumptions 5.2 and 5.3the Average Treatment Effect is identified.

(Proof in Appendix.)

Estimation in this case is relatively straightforward. One can use matching 16 or regression analysis to estimate the average treatment effect.

5.6 Set identification of treatment effects

In our original discussion of identification we defined image as “the set of values of image that are consistent with the data distribution image.” We said that image was identified if this set was a singleton. However, there is another concept of identification we have not discussed until this point; this is set identification. Sometimes we may be interested in a parameter that is not point identified, but this does not mean we cannot say anything about it. In this subsection we consider the case of set identification (i.e. trying to characterize the set image) focusing on the case in which image is the Average Treatment Effect. Suppose that we have some prior knowledge (possibly an exclusion restriction that gives us a LATE). What can we learn about the ATE without making any functional form assumptions? In a series of papers Manski (1989, 1990, 1995, 1997) and Manski and Pepper (2000, 2009) develop procedures to derive set estimators of the Average Treatment Effect and other parameters given weak assumptions. By “set identification” we mean the set of possible Average Treatment Effects given the assumptions placed on the data. Throughout this section we will continue to assume that the structure of the Generalized Roy model holds and we derive results under these assumptions. In many cases the papers we mentioned do not impose this structure and get more general results.

Following Manski (1990) or Manski (1995), notice that

image     (5.20)

image     (5.21)

We observe all of the objects in Eqs (5.20) and (5.21) except image and image. The data are completely uninformative about these two objects. However, suppose we have some prior knowledge about the support of image and image. In particular, suppose that the support of image and image are bounded above by image and from below by image. Thus, by assumption image and image. Using these assumptions and Eqs (5.20) and (5.21) we can establish that

image     (5.22)

image     (5.23)

Using these bounds and the definition of the ATE

image     (5.24)



In practice the bounds above can yield wide ranges and are often not particularly informative. A number of other assumptions can be used to decrease the size of the identified set.

Manski (1990, 1995) shows that one method of tightening the bounds is with an instrumental variable. We can write the expressions (5.20) and (5.21) conditional on image for any image as for each image,

image     (5.25)

Since image is, by assumption, mean independent of image and image (it only affects the probability of choosing one occupation versus the other), then image and image. Assume there is a binary instrumental variable, image, which equals either 0 or 1. We can then follow exactly the same argument as in Eqs (5.22) and (5.23), but conditioning on image and using Eq. (5.25) yields

image     (5.26)

image     (5.27)

Thus we can bound image from below by subtracting (5.27) from (5.26):

image     (5.28)

Our choice of a binary value of image can be trivially relaxed. In the cases in which image takes on many values one could choose any two values in the support of image to get upper and lower bounds. If our goal is to minimize the size of the set we would choose the values image and image to minimize the difference between the upper and lower bounds in (5.28):


The importance of support conditions once again becomes apparent from this expression. If we could find values image and imagesuch that



then this expression is zero and we obtain point identification of the ATE. When image or image are bounded from below we are only able to obtain set estimates. A nice aspect of this is that it represents a nice middle point between identifying LATE versus claiming the ATE is not identified. If the identification at infinity effect is not exactly true, but approximately true so that one can find values of image and image so that image and image are small, then the bounds will be tight. If one cannot find such values, the bounds will be far apart.

In many cases these bounds may be wide. Wide bounds can be viewed in two ways. One interpretation is that the bounding procedure is not particularly helpful in learning about the true ATE. However, a different interpretation is that it shows that the data, without additional assumptions, is not particularly helpful for learning about the ATE. Below we discuss additional assumptions for tightening the bounds on the ATE, such as Monotone treatment response, Monotone treatment selection, and Monotone instruments. In order to keep matters simple, below we assume that there is no exclusion restriction. However, if a exclusion restriction is known, this allows us to tighten the bounds.

Next we consider the assumption of Monotone Treatment Response introduced in Manski (1997), which we write as

Assumption 5.4. Monotone Treatment Response


with probability one.

In the fishing/hunting example this is not a particularly natural assumption, but for many applications in labor economics it is. Suppose we are interested in knowing the returns to a college degree, and image is income for individual image if a college graduate whereas image is income if a high school graduate. It is reasonable to believe that the causal effect of school or training cannot be negative. That is, one could reasonably assume that receiving more education can’t causally lower your wage. Thus, Monotone Treatment Response seems like a reasonable assumption in this case. This can lower the bounds above quite a bit because now we know that

image     (5.29)

image     (5.30)

From this Manski (1997) shows that


Another interesting assumption that can also help tighten the bounds is the Monotone Treatment Selection assumption introduced in Manski and Pepper (2000). In our framework this can be written as

Assumption 5.5. Monotone Treatment Selection: for image or image,


Again this might not be completely natural for the fishing/hunting example, but may be plausible in many other cases. For example it seems like a reasonable assumption in schooling if we believe that there is positive sorting into schooling. Put differently, suppose the average college graduate is a more able person than the average high school graduate and would earn higher income, even if she did not have the college degree. If this is true, then the average difference in earnings between college and high school graduates overstates the true causal effect of college on earnings. This also helps to further tighten the bounds as this implies that


Note that by combining the MTR and MTS assumption, one can get the tighter bounds:


Manski and Pepper (2000) also develop the idea of a monotone instrumental variable. An instrumental variable is defined as one for which for any two values of the instrument image and image,


In words, the assumption is that the instrument does not directly affect the outcome variable image. It only affects one’s choices. Using somewhat different notation, but their exact wording, they define a monotone instrumental variable in the following way

Assumption 5.6. Let image be an ordered set. Covariate image is a monotone instrumental variable in the sense of mean-monotonicity if, for image,each value of image, and all image such that image,


This is a straight generalization of the instrumental variable assumption, but imposes much weaker requirements for an instrument. It does not require that the instrument be uncorrelated with the outcome, but simply that the outcome monotonically increase with the instrument. An example is that parental income has often been used as an instrument for education. Richer parents are better able to afford a college degree for their child. However, it seems likely that the children of rich parents would have had high earnings, even in the absence of a college degree.

They show that this implies that


One can obtain tighter bounds by combining the Monotone Instrumental Variable assumption with the Monotone Treatment Response assumption but we do not explicitly present this result.

Blundell et al. (2007) estimate changes in the distribution of wages in the United Kingdom using bounds to allow for the impact of non-random selection into work. They first document the growth in wage inequality among workers over the 1980s and 1990s. However, they point out that rates of non-participation in the labor force have grown in the UK over the same time period. Nevertheless, they show that selection effects alone cannot explain the rise in inequality observed among workers: the worst case bounds establish that inequality has increased. However, worst case bounds are not sufficiently informative to understand such questions as whether most of the rise in wage inequality is due to increases in wage inequality within education groups versus across education groups. Next, they add an additional assumptions to tighten the bounds. First, they assume the probability of work is higher for those with higher wages, which is essentially the Monotone Treatment Selection assumption shown in Assumption 5.5. Second, they make the Monotone Instrumental Variables assumption shown in Assumption 5.6. They assume that higher values of out of work benefit income are positively associated with wages. They show that both of these assumptions tighten the bounds considerably. They find that when these additional restrictions are made, then they can show that both within group and between group inequality has increased.

5.7 Using selection on observables to infer selection on unobservables

Altonji et al. (2005a) suggest another approach which is to use the amount of selection on observable covariates as a guide to the potential amount of selection on unobservables. To motivate this approach, consider an experiment in which treatment status is randomly assigned. The key to random assignment is that it imposes that treatment status be independent of the unobservables in the treatment model. Since they are unobservable, one can never explicitly test whether the treatment was truly random. However, if randomization was carried out correctly, treatment should also be uncorrelated with observable covariates. This is testable, and applying this test is standard in experimental approaches.

Researchers use this same argument in non-experimental cases as well. If a researcher wants to argue that his instrument or treatment is approximately randomly assigned, then it should be uncorrelated with observable covariates as well. Even if this is strictly not required for consistent estimates of instrumental variables, readers may be skeptical of the assumption that the instrument is uncorrelated with the unobservables if it is correlated with the observables. Researchers often test for this type of relationship as well. 17 The problem with this approach is that simply testing the null of uncorrelatedness is not that useful. Just because you reject the null does not mean it isn’t approximately true. We would not want to throw out an instrument with a tiny bias just because we have a data set large enough to detect a small correlation between it and an observable. Along the same lines, just because you fail to reject the null does not mean it is true. If one has a small data set with little power one could fail to reject the null even though the instrument is poor. To address these issues, Altonji et al. (2005a) design a framework that allows them to describe how large the treatment effect would be if “selection on the unobservables is the same as selection on the observables.”

Their key variables are discrete, so they consider a latent variable model in which a dummy variable for graduation from high school can be written as


where image can be written as


image represent all covariates, both those that are observable to the econometrician and those that are unobservable, the variable image is a dummy variable representing whether the covariate is observable to the empirical researcher, image represents the observable part of the index, and image denotes the unobservable part.

Within this framework, one can see that different assumptions about what dictates which observables are chosen image can be used to identify the model. Their specific goal is to quantify what it means for “selection on the observables to be the same as selection on the unobservables.” They argue that the most natural way to formalize this idea is to assume that image is randomly assigned so that the unobservables and observables are drawn from the same underlying distribution.

The next question is what this assumption implies on the data that can be useful for identification. They consider the projection:


where image can be any random variable. They show that if image is randomly assigned,


This restriction is typically sufficient to insure identification of image.18

Altonji et al. (2005a,b) argue that for their example this is an extreme assumption and the truth is somewhere in between this assumption and the assumption that image is uncorrelated with the unobservables which would correspond to image. They assume that when image,


There are at least three arguments for why selection on unobservables would be expected to be less severe than selection on observables (as it is measured here). First, some of the variation in the unobservable is likely just measurement in the dependent variable. Second, data collectors likely collect the variables that are likely to be correlated with many things. Third, there is often a time lapse between the time the baseline data is collected (the observables) and when the outcome is realized. If unanticipated events occur in between these two time periods, that would lead to the result.

Notice that if image then assuming image is the same as assuming image. However, if image were very large the two estimates would be very different, which would shed doubt on the assumption of random assignment. Since image essentially picks up the relationship between the instrument and the observable covariates, the bounds would be wide when there is a lot of selection on observables and will be tight when there is little selection on observables.

Altonji, Elder, and Taber consider the case of whether the decision to attend Catholic high school affects outcomes such as test scores and high school graduation rates. Those who attend Catholic schools have higher graduation rates than those who do not attend Catholic schools. However, those who attend Catholic may be very different from those who do not. They find that (on the basis of observables) while this is true in the population, it is not true when one conditions on the individuals who attend Catholic school in eighth grade. To formalize this, they use their approach and estimate the model under the two different assumptions. In their application the projection variable, image, is the latent variable determining whether an individual attends Catholic school. First they estimate a simple probit of high school graduation on Catholic high school attendance as well as many other covariates. This corresponds to the image case. They find a marginal effect of 0.08, meaning that Catholic school raises high school graduation by eight percentage points. Next they estimate a bivariate probit of Catholic high school attendance and high school graduation subject to the constraint that image. In this case they find a Catholic high school effect of 0.05. The closeness of these two estimates strongly suggests that the Catholic high school effect is not simply a product omitted variable bias. The tightness of the two estimates arose both because image was small and because they use a wide array of powerful explanatory variables.

