Handbook of Labor Economics, Vol. 4, No. Suppl PA, 2011
ISSN: 1573-4463
doi: 10.1016/S0169-7218(11)00410-2
Chapter 4The Structural Estimation of Behavioral Models: Discrete Choice Dynamic Programming Methods and Applications
Abstract
The purpose of this chapter is twofold: (1) to provide an accessible introduction to the methods of structural estimation of discrete choice dynamic programming (DCDP) models and (2) to survey the contributions of applications of these methods to substantive and policy issues in labor economics. The first part of the chapter describes solution and estimation methods for DCDP models using, for expository purposes, a prototypical female labor force participation model. The next part reviews the contribution of the DCDP approach to three leading areas in labor economics: labor supply, job search and human capital. The final section discusses approaches to validating DCDP models.
JEL classification
• J • C51 • C52 • C54
Keywords
• Structural estimation • Discrete choice • Dynamic programming • Labor supply • Job search • Human capital
The purpose of this chapter is twofold: (1) to provide an accessible introduction to the methods of structural estimation of discrete choice dynamic programming (DCDP) models and (2) to survey the contributions of applications of these methods to substantive and policy issues in labor economics.1 The development of estimation methods for DCDP models over the last 25 years has opened up new frontiers for empirical research in labor economics as well as other areas such as industrial organization, economic demography, health economics, development economics and political economy.2 Reflecting the generality of the methodology, the first DCDP papers, associated with independent contributions by Gotz and McCall (1984), Miller (1984), Pakes (1986), Rust (1987) and Wolpin (1984), addressed a variety of topics, foreshadowing the diverse applications to come in labor economics and other fields. Gotz and McCall considered the sequential decision to re-enlist in the military, Miller the decision to change occupations, Pakes the decision to renew a patent, Rust the decision to replace a bus engine and Wolpin the decision to have a child.
The first part of this chapter provides an introduction to the solution and estimation methods for DCDP models. We begin by placing the method within the general latent variable framework of discrete choice analysis. This general framework nests static and dynamic models and nonstructural and structural estimation approaches. Our discussion of DCDP models starts by considering an agent making a binary choice. For concreteness, and for simplicity, we take as a working example the unitary model of a married couple’s decision about the woman’s labor force participation. To fix ideas, we use the static model with partial wage observability, that is, when wage offers are observed only for women who are employed, to draw the connection between theory, data and estimation approaches. In that context, we delineate several goals of estimation, for example, testing theory or evaluating counterfactuals, and discuss the ability of alternative estimation approaches, encompassing those that are parametric or nonparametric and structural or nonstructural, to achieve those goals. We show how identification issues relate to what one can learn from estimation.
The discussion of the static model sets the stage for dynamics, which we introduce again, for expository purposes, within the labor force participation example by incorporating a wage return to work experience (learning by doing).3 A comparison of the empirical structure of the static and dynamic models reveals that the dynamic model is, in an important sense, a static model in disguise. In particular, the essential element in the estimation of both the static and dynamic model is the calculation of a latent variable representing the difference in payoffs associated with the two alternatives (in the binary case) that may be chosen. In the static model, the latent variable is the difference in alternative-specific utilities. In the case of the dynamic model, the latent variable is the difference in alternative-specific value functions (expected discounted values of payoffs). The only essential difference between the static and dynamic cases is that alternative-specific utilities are more easily calculated than alternative-specific value functions, which require solving a dynamic programming problem. In both cases, computational considerations play a role in the choice of functional forms and distributional assumptions.
There are a number of modeling choices in all discrete choice analyses, although some are more important in the dynamic context because of computational issues. Modeling choices include the number of alternatives, the size of the state space, the error structure and distributional assumptions and the functional forms for the structural relationships. In addition, in the dynamic case, one must make an assumption about how expectations are formed.4 To illustrate the DCDP methodology, the labor force participation model assumes additive, normally distributed, iid over time errors for preferences and wage offers. We first discuss the role of exclusion restrictions in identification, and work through the solution and estimation procedure. We then show how a computational simplification can be achieved by assuming errors to be independent type 1 extreme value (Rust, 1987) and describe the model assumptions that are consistent with adopting that simplification. Although temporal independence of the unobservables is often assumed, the DCDP methodology does not require it. We show how the solution and estimation of DCDP models is modified to allow for permanent unobserved heterogeneity and for serially correlated errors. In the illustrative model, the state space was chosen to be of a small finite dimension. We then describe the practical problem that arises in implementing the DCDP methodology as the state space expands, the well-known curse of dimensionality (Bellman, 1957), and describe suggested practical solutions found in the literature including discretization, approximation and randomization.
To illustrate the DCDP framework in a multinomial choice setting, we extend the labor force participation model to allow for a fertility decision at each period and for several levels of work intensity. In that context, we also consider the implications of introducing nonadditive errors (that arise naturally within the structure of models that fully specify payoffs and constraints) and general functional forms. It is a truism that any dynamic optimization model that can be (numerically) solved can be estimated.
Throughout the presentation, the estimation approach is assumed to be maximum likelihood or, as is often the case when there are many alternatives, simulated maximum likelihood. However, with simulated data from the solution to the dynamic programming problem, other methods, such as minimum distance estimation, are also available. We do not discuss those methods because, except for solving the dynamic programming model, their application is standard. Among the more recent developments in the DCDP literature is a Bayesian approach to the solution and estimation of DCDP models. Although the method has the potential to reduce the computational burden associated with DCDP models, it has not yet found wide application. We briefly outline the approach. All of these estimation methods require that the dynamic programming problem be fully solved (numerically). We complete the methodology section with a brief discussion of a method that does not require solving the full dynamic programming problem (Hotz and Miller, 1993).
Applications of the DCDP approach within labor economics have spanned most major areas of research. We discuss the contributions of DCDP applications in three main areas: (i) labor supply, (ii) job search and (iii) schooling and career choices. Although the boundaries among these areas are not always clear and these areas do not exhaust all of the applications of the method in labor economics, they form a reasonably coherent taxonomy within which to demonstrate key empirical contributions of the approach.5 In each area, we show how the DCDP applications build on the theoretical insights and empirical findings in the prior literature. We highlight the findings of the DCDP literature, particularly those that involve counterfactual scenarios or policy experiments.
The ambitiousness of the research agenda that the DCDP approach can accommodate is a major strength. This strength is purchased at a cost. To be able to perform counterfactual analyses, DCDP models must rely on extra-theoretic modeling choices, including functional form and distributional assumptions. Although the DCDP approach falls short of an assumption-free ideal, as do all other empirical approaches, it is useful to ask whether there exists convincing evidence about the credibility of these exercises. In reviewing the DCDP applications, we pay careful attention to the model validation exercises that were performed. The final section of the chapter addresses the overall issue of model credibility.
The development of the DCDP empirical framework was a straightforward and natural extension of the static discrete choice framework. The common structure they share is based on the latent variable specification, the building block for all economic models of discrete choice. To illustrate the general features of the latent variable specification, consider a binary choice model in which an economic agent with imperfect foresight, denoted by , makes a choice at each discrete period , from , between two alternatives . In the labor economics context, examples might be the choice of whether to accept a job offer or remain unemployed or whether to attend college or enter the labor force. The outcome is determined by whether a latent variable, , reflecting the difference in the (expected) payoffs of the and alternatives, crosses a scalar threshold value, which, without loss of generality, is taken to be zero. The preferred alternative is the one with the largest payoff, i.e., where if and otherwise.
In its most general form, the latent variable may be a function of three types of variables: , a vector of the history of past choices , , a vector of contemporaneous and lagged values of additional variables (; ) that enter the decision problem, and , a vector of contemporaneous and lagged unobservables that also enter the decision problem.6 The agent’s decision rule at each age is given by whether the latent variable crosses the threshold, that is,
All empirical binary choice models, dynamic or static, are special cases of this formulation. The underlying behavioral model that generated the latent variable is dynamic if agents are forward looking and either contains past choices, , or unobservables, , that are serially correlated. 7 The underlying model is static (i) if agents are myopic or (ii) if agents are forward looking and there is no link among the past, current and future periods through or serially correlated unobservables.
Researchers may have a number of different, though not necessarily mutually exclusive, goals. They include:
It is assumed that these statements are ceteris paribus, not only in the sense of conditioning on the other observables, but also in conditioning on the unobservables and their joint conditional (on observables) distribution.8 Different empirical strategies, for example, structural or nonstructural, may be better suited for some of these goals than for others.
In drawing out the connection between the structure of static and dynamic discrete choice models, it is instructive to consider an explicit example. We take as the prime motivating example one of the oldest and most studied topics in labor economics, the labor force participation of married women. 9 We first illustrate the connection between research goals and empirical strategies in a static framework and then modify the model to allow for dynamics.
Consider the following static model of the labor force participation decision of a married woman. Assume a unitary model in which the couple’s utility is given by
where is household ’s consumption at period if the wife works and is equal to zero otherwise, is the number of young children in the household, and are other observable factors and unobservable factors that affect the couple’s valuation of the wife’s leisure (or home production). In this context, corresponds to the couple’s duration of marriage. The utility function has the usual properties: .
The wife receives a wage offer of in each period and the husband, who is assumed to work each period, generates income . If the wife works, the household incurs a per-child child-care cost, , which is assumed to be time-invariant and the same for all households. 10 The household budget constraint is thus
Wage offers are not generally observed for nonworkers. It is, thus, necessary to specify a wage offer function to carry out estimation. Let wage offers be generated by
where are observable and unobservable factors. would conventionally contain educational attainment and “potential” work experience (age − education − 6). Unobservable factors that enter the couple’s utility function and unobservable factors that influence the woman’s wage offer are assumed to be mutually serially uncorrelated and to have joint distribution .
Substituting (3) into (2) using (4) yields
from which we get alternative-specific utilities, if the wife works and if she does not, namely
The latent variable function, the difference in utilities, , is thus given by
The participation decision is determined by the sign of the latent variable: if otherwise.
It is useful to distinguish the household’s state space, , consisting of all of the determinants of the household’s decision, that is, , from the part of the state space observable to the researcher, , that is, consisting only of . Now, define to be the set of values of the unobservables that enter the utility and wage functions that induces a couple with a given observable state space () to choose . Then, the probability of choosing , conditional on , is given by
where .
As is clear from (8), is a composite of three elements of the model: . These elements comprise the structure of the participation model. Structural estimation (S) is concerned with recovering some or all of the structural elements of the model. Nonstructural (NS) estimation is concerned with recovering . In principal, each of these estimation approaches can adopt auxiliary assumptions in terms of parametric (P) forms for some or all of the structural elements or for or be nonparametric (NP). Thus, there are four possible approaches to estimation: NP-NS, P-NS, NP-S and P-S. 11
We now turn to a discussion about the usefulness of each of these approaches for achieving the three research goals mentioned above. The first research goal, testing the theory, requires that there be at least one testable implication of the model. From (6) and the properties of the utility function, it is clear that an increase in the wage offer increases the utility of working, but has no effect on the utility of not working. Thus, the probability of working for any given agent must be increasing in the wage offer. The second goal, to determine the impact of changing any of the state variables in the model on an individual’s participation probability, requires taking the derivative of the participation probability with respect to the state variable of interest. The third goal requires taking the derivative of the participation probability with respect to something that does not vary in the data. That role is played by the unknown child care cost parameter, . Determining its impact would provide a quantitative assessment of the effect of a child care subsidy on a married woman’s labor force participation.12
Given the structure of the model, to achieve any of these goals, regardless of the estimation approach, it is necessary to adopt an assumption of independence between the unobservable factors affecting preferences and wage offers and the observable factors. Absent such an assumption, variation in the observables, , either among individuals or over time for a given individual, would cause participation to differ both because of their effect on preferences and/or wage offers and because of their relationship to the unobserved determinants of preferences and/or wage offers through . In what follows, we adopt the assumption of full independence, that is, , so as not to unduly complicate the discussion.
If we make no further assumptions, we can estimate nonparametrically.
Goal 1: To accomplish the first goal, we need to be able to vary the wage offer independently of other variables that affect participation. To do that, there must be an exclusion restriction, in particular, a variable in that is not in . Moreover, determining the sign of the effect of a wage increase on the participation probability requires knowing the sign of the effect of the variable in (not in ) on the wage. Of course, if we observed all wage offers, the wage would enter into the latent variable rather than the wage determinants ( and ) and the prediction of the theory could be tested directly without an exclusion restriction.
What is the value of such an exercise? Assume that the observation set is large enough that sampling error can be safely ignored and consider the case where all wage offers are observed. Suppose one finds, after nonparametric estimation of the participation probability function, that there is some “small” range of wages over which the probability of participation is declining as the wage increases. Thus, the theory is rejected by the data. Now, suppose we wanted to use the estimated participation probability function to assess the impact of a proportional wage tax on participation. This is easily accomplished by comparing the sample participation probability in the data with the participation probability that comes about by reducing each individual’s wage by the tax. Given that the theory is rejected, should we use the participation probability function for this purpose? Should our answer depend on how large is the range of wages over which the violation occurs? Should we add more variables and retest the model? And, if the model is not rejected after adding those variables, should we then feel comfortable in using it for the tax experiment? If there are no ready answers to these questions in so simple a model, as we believe is the case, then how should we approach them in contexts where the model’s predictions are not so transparent and therefore for practical purposes untestable, as is normally the case in DCDP models? Are there other ways to validate models? We leave these as open questions for now, but return to them in the concluding section of the chapter.
Goal 2: Clearly, it is possible, given an estimate of , to determine the effect on participation of a change in any of the variables within the range of the data. However, one cannot predict the effect of a change in a variable that falls outside of the range of the data.
Goal 3: It is not possible to separately identify and . To see that note that because it is that enters ; knowledge of does not allow one to separately identify and . We thus cannot perform the child care subsidy policy experiment.
In this approach, one chooses a functional form for . For example, one might choose a cumulative standard normal function in which the variables in enter as a single index.
Goal 1: As in the NP-NS approach, because of the partial observability of wage offers, testing the model’s prediction still requires an exclusion restriction, that is, a variable in that is not in .
Goal 2: It is possible, given an estimate of , to determine the effect on participation of a change in any of the variables not only within, but also outside, the range of the data.
Goal 3: As in the NP-NS approach, it is not possible to separately identify from variation in because enters .
In this approach, one would attempt to separately identify from (8) without imposing auxiliary assumption about those functions. This is clearly infeasible when wages are only observed for those who work. 13
Although given our taxonomy, there are many possible variations on which functions to impose parametric assumptions, it is too far removed from the aims of this chapter to work through those possibilities.14 We consider only the case in which all of the structural elements are parametric. Specifically, the structural elements are specified as follows:
where .15 This specification of the model leads to a latent variable function, the difference in utilities, , given by
where and now consists of , and .16
The likelihood function, incorporating the wage information for those women who work, is
The parameters to be estimated include , , , , , and .17 First, it is not possible to separately identify the child care cost, , from the effect of children on the utility of not working, ; only is potentially identified. Joint normality is sufficient to identify the wage parameters, and , as well as (Heckman, 1979). The data on work choices identify and . To identify , note that there are three possible types of variables that appear in the likelihood function, variables that appear only in , that is, only in the wage function, variables that appear only in , that is, only in the value of leisure function, and variables that appear in both and . Having identified the parameters of the wage function (the ’s), the identification of (and thus also ) requires the existence of at least one variable of the first type, that is, a variable that appears only in the wage equation.18
Goal 1: As in the NS approaches, there must be an exclusion restriction, in particular, a variable in that is not in .
Goal 2: It is possible to determine the effect on participation of a change in any of the variables within and outside of the range of the data.
Goal 3: As noted, it is possible to identify . Suppose then that a policy maker is considering implementing a child care subsidy program, where none had previously existed, in which the couple is provided a subsidy of dollars if the wife works when there is a young child in the household. The policy maker would want to know the impact of the program on the labor supply of women and the program’s budgetary implications. With such a program, the couple’s budget constraint under the child care subsidy program is
where is the net (of subsidy) cost of child care. With the subsidy, the probability that the woman works is
where is the standard normal cumulative. Given identification of from maximizing the likelihood (14), to predict the effect of the policy on participation, that is, the difference in the participation probability when is positive and when is zero, it is necessary, as seen in (16), to have identified . Government outlays on the program would be equal to the subsidy amount times the number of women with young children who work under the subsidy.
It is important to note that the policy effect is estimated without direct policy variation, i.e., we did not need to observe households in both states of the world, with and without the subsidy program. What was critical for identification was (exogenous) variation in the wage (independent of preferences). Wage variation is important in estimating the policy effect because, in the model, the child care cost is a tax on working that is isomorphic to a tax on the wage. Wage variation, independent of preferences, provides policy-relevant variation.
To summarize, testing the prediction that participation rises with the wage offer requires an exclusion restriction regardless of the approach. This requirement arises because of the non-observability of wage offers for those that choose not to work.19 With regard to the second goal, the parametric approach allows extrapolation outside of the sample range of the variables whereas nonparametric approaches do not. Finally, subject to identification, the P-S approach enables the researcher to perform counterfactual exercises, subsidizing the price of child care in the example, even in the absence of variation in the child care price.20
In the previously specified static model, there was no connection between the current participation decision and future utility. One way, among many, to introduce dynamic considerations is through human capital accumulation on the job. In particular, suppose that the woman’s wage increases with actual work experience, , as skills are acquired through learning by doing. To capture that, rewrite (11) as
where is work experience at the start of period . Given this specification, working in any period increases all future wage offers. Work experience, , evolves according to
where .21 Thus, at any period , the woman may have accumulated up to periods of work experience. We will be more specific about the evolution of the other state space elements when we work through the solution method below. For now, we assume only that their evolution is non-stochastic.
Normally distributed additive shocks
As in the static model, and again for presentation purposes, we assume that the preference shock and the wife’s wage shock () are distributed joint normal. In addition, we assume that they are mutually serially independent and independent of observables, that is, .
Assume, in this dynamic context, that the couple maximizes the expected present discounted value of remaining lifetime utility at each period starting from an initial period, , and ending at period , the assumed terminal decision period.2223 Letting be the maximum expected present discounted value of remaining lifetime utility at given the state space and discount factor ,
The state space at consists of the same elements as in the static model augmented to include the amount of accumulated work experience, .
The value function can be written as the maximum over the two alternative-specific value functions, ,
each of which obeys the Bellman equation
The expectation in (21) is taken over the distribution of the random components of the state space at and , conditional on the state space elements at .
The latent variable in the dynamic case is the difference in alternative-specific value functions, , namely24
(23)
25
Comparing the latent variable functions in the dynamic (22) and static (13) cases, the only difference is the appearance in the dynamic model of the difference in the future component of the expected value functions under the two alternatives. This observation was a key insight in the development of estimation approaches for DCDP models.
To calculate these alternative-specific value functions, note first that , the observable part of the state space at , is fully determined by and the choice at . Thus, one needs to be able to calculate at all values of that may be reached from the state space elements at and a choice at . A full solution of the dynamic programming problem consists, then, of finding for all values of at all . We denote this function by or for short.
In the finite horizon model we are considering, the solution method is by backwards recursion. However, there are a number of additional details about the model that must first be addressed. Specifically, it is necessary to assume something about how the exogenous observable state variables evolve, that is, .26 For ease of presentation, to avoid having to specify the transition processes of the exogenous state variables, we assume that and .
The number of young children, however, is obviously not constant over the life cycle. But, after the woman reaches the end of her fecund period, the evolution of is non-stochastic.27 To continue the example, we restrict attention to the woman’s post-fecund period. Thus, during that period is perfectly foreseen, although the future path of at any depends on the exact ages of the young children in the household at .28 Thus, the ages of existing young children at are elements of the state space at .
As seen in (21), to calculate the alternative-specific value functions at period for each element of , we need to calculate what we have referred to above as . Using the fact that, under normality, and , we get
(24)
29
Note that evaluating this expression requires an integration (the normal cdf) which has no closed form; it thus must be computed numerically. The right hand side of (24) is a function of and .30 Given a set of model parameters, the function takes on a scalar value for each element of its arguments. Noting that , and being explicit about the elements of , the alternative-specific value functions at are (dropping the subscript for convenience):
Thus,
As before, because enters both and additively, it drops out of and thus out of .31
To calculate the alternative-specific value functions, we will need to calculate . Following the development for period ,
The right hand side of (29) is a function of and . As with , given a set of model parameters, the function takes on a scalar value for each element of its arguments. Noting that , the alternative-specific value functions at and the latent variable function are given by
As at drops out of and thus .
We can continue to solve backwards in this fashion. The full solution of the dynamic programming problem is the set of functions for all from . These functions provide all of the information necessary to calculate the cut-off values, the ’s that are the inputs into the likelihood function.
Estimation of the dynamic model requires that the researcher have data on work experience, . More generally, assume that the researcher has longitudinal data for married couples and denote by and the first and last periods of data observed for married couple . Note that need not be the first period of marriage (although it may be, subject to the marriage occurring after the woman’s fecund period) and need not be the last (although it may be). Denoting as the vector of model parameters, the likelihood function is given by
where and .32
Given joint normality of and , the likelihood function is analytic, namely
where and where is the correlation coefficient between and .33 Estimation proceeds by iterating between the solution of the dynamic programming problem and the likelihood function for alternative sets of parameters. Maximum likelihood estimates are consistent, asymptotically normal and efficient.
Given the solution of the dynamic programming problem for the cut-off values, the ’s, the estimation of the dynamic model is in principle no different than the estimation of the static model. However, the dynamic problem introduces an additional parameter, the discount factor, , and additional assumptions about how households forecast future unobservables.34 The practical difference in terms of implementation is the computational effort of having to solve the dynamic programming problem in each iteration on the model parameters in maximizing the likelihood function.
Identification of the model parameters requires the same exclusion restriction as in the static case, that is, the appearance of at least one variable in the wage equation that does not affect the value of leisure. Work experience, , would serve that role if it does not also enter into the value of leisure . A heuristic argument for the identification of the discount factor can be made by noting that the difference in the future component of the expected value functions under the two alternatives in (22) is in general a nonlinear function of the state variables and depends on the same set of parameters as in the static case. Rewriting (22) as
where is the difference in the future component of the expected value functions, the nonlinearities in that arise from the distributional and functional form assumptions may be sufficient to identify the discount factor.35
As in the static model, identification of the model parameters implies that all three research goals previously laid out can be met. In particular, predictions of the theory are testable, the effects on participation of changes in observables that vary in the sample are estimable and a quantitative assessment of the counterfactual child care subsidy is feasible. The effect of such a subsidy will differ from that in a static model as any effect of the subsidy on the current participation decision will be transmitted to future participation decisions through the change in work experience and thus future wages. If a surprise (permanent) subsidy were introduced at some time , the effect of the subsidy on participation at would require that the couple’s dynamic programming problem be resolved with the subsidy from to and the solution compared to that without the subsidy. A pre-announced subsidy to take effect at would require that the solution be obtained back to the period of the announcement because, given the dynamics, such a program would have effects on participation starting from the date of the announcement.36
Independent additive type-1 extreme value errors
When shocks are additive and come from independent type-1 extreme value distributions, as first noted by Rust (1987), the solution to the dynamic programming problem and the choice probability both have closed forms, that is, they do not require a numerical integration as in the additive normal error case. The cdf of an extreme value random variable is with mean equal to , where is Euler’s constant, and variance .
Under the extreme value assumption, it can be shown that for period (dropping the subscript for convenience),
and for ,
where denotes the vector of values. The solution, as in the case of normal errors, consists of calculating the functions by backwards recursion. As seen, unlike the case of normal errors, the functions and the choice probabilities have closed form solutions; their calculation does not require a numerical integration.
The extreme value assumption is, however, somewhat problematic in the labor force participation model as structured. For there to be a closed form solution to the DCDP problem, the scale parameter (), and thus the error variance, must be the same for both the preference shock and the wage shock, a rather strong restriction that is unlikely to hold. The root of the problem is that the participation decision rule depends on the wage shock. Suppose, however, that the participation model was modified so that the decision rule no longer included a wage shock. Such a modification could be accomplished in either of two ways, either by assuming that the wife’s wage offer is not observed at the time that the participation decision is made or that the wage is deterministic (but varies over time and across women due to measurement error). In the former case, the wage shock is integrated out in calculating the expected utility of working. while in the latter there is no wage shock entering the decision problem. Then, by adding an independent type-1 extreme value error to the utility when the wife works, the participation decision rule will depend on the difference in two extreme value taste errors, which leads to the closed form expressions given above.
In either case, there is no longer a selection issue with respect to observed wages. Because the observed wage shock is independent of the participation decision, the wage parameters can be estimated by adding the wage density to the likelihood function for participation and any distributional assumption, such as log normality, can be assumed. In addition, as in the case of normal errors, identification of the wage parameters, along with the exclusion restriction already discussed, implies identification of the rest of the model parameters (including the scale parameter). Thus, the three research goals are achievable. Whether the model assumptions necessary to take advantage of the computational gains from adopting the extreme value distribution are warranted raises the issue how models should be judged and which model is “best,” a subject we take up later in the chapter.
We have already encountered unobserved state variables in the labor force participation model, namely the stochastic elements in that affect current choices. However, there may be unobserved state variables that have persistent effects through other mechanisms. Such a situation arises, for example, when the distribution of is not independent of past shocks, that is, when .
A specific example, commonly adopted in the literature, is when shocks have a permanent-transitory structure. For reasons of tractability, it is often assumed that the permanent component takes on a discrete number of values and follows a joint multinomial distribution. Specifically,
where there are types each of husbands and wives , and thus couple types and where and are joint normal and iid over time.37 Each wife’s type is assumed to occur with probability and each husband’s type with probability , with for . A couple’s type is defined by their value of , where the probability of a couple being of type is given by , with .38 A couple is assumed to know their own and their spouse’s type, so the state space is augmented by the husband’s and wife’s type. Even though types are not known to the researcher, it is convenient to add them to the state variables in what we previously defined as the observable elements of the state space, . The reason is that, unlike the iid shocks and , which do not enter the functions (they are integrated out), the types do enter the functions. The dynamic programming problem must be solved for each couple’s type.
The likelihood function must also be modified to account for the fact that the types are unobserved. In particular, letting be the likelihood function for a type couple, the sample likelihood is the product over individuals of the type probability weighted sum of the type-specific likelihoods, namely
A second example is where the joint errors follows an ARIMA process. To illustrate, suppose that the errors follow a first-order autoregressive process, namely that and , where and are joint normal and iid over time. Consider again the alternative-specific value functions at , explicitly accounting for the evolution of the shocks, namely
where the integration is now taken over the joint distribution of and . To calculate the alternative-specific value function at , it is necessary that the function include not only , as previously specified, but also the shocks at and . Thus, serial correlation augments the state space that enters the functions. The added complication is that these state space elements, unlike those we have so far considered, are continuous variables, an issue we discuss later. The likelihood function is also more complicated to calculate as it requires an integration for each couple of dimension equal to the number of observation periods (and there are two additional parameters, and ).39
The existence of unobserved state variables creates also a potentially difficult estimation issue with respect to the treatment of initial conditions (Heckman, 1981). Having restricted the model to the period starting at the time the wife is no longer fecund, by that time most women will have accumulated some work experience, i.e., will not be zero and will vary in the estimation sample. Our estimation discussion implicitly assumed that the woman’s “initial” work experience, that is, work experience at , could be treated as exogenous, that is, as uncorrelated with the stochastic elements of the future participation decisions. When there are unobserved initial state variables, permanent types or serially correlated shocks, this assumption is unlikely to hold.
Although we have not specified the labor force participation model governing decisions prior to this period, to avoid accounting for fertility decisions, it is reasonable to suppose that women who worked more while they were of childbearing ages come from a different type distribution than women who worked less, or, in the case in which there are serially correlated shocks, women with greater work experience during the childbearing period may have experienced shocks (to wages or preferences) that are not uncorrelated with those that arise after. Put differently, it would seem much more reasonable to assume that the same model governs the participation decision during pre- and post-childbearing ages than to assume that there are two different models in which decisions across those periods are stochastically independent (conditional on observables).
There are several possible solutions to the initial conditions problem. Suppose for the sake of exposition, though unrealistically, that all women begin marriage with zero work experience.40 At the time of marriage, in the case of permanent unobserved heterogeneity, the couple is assumed to be “endowed” with a given set of preferences. A couple who intrinsically places a low value on the wife’s leisure will be more likely to choose to have the wife work and thus accumulate work experience. Such women will have accumulated more work experience upon reaching the end of their childbearing years than women in marriages where the wife’s value of leisure is intrinsically greater. Thus, when the end of the childbearing years are reached, there will be a correlation between the accumulated work experience of wives and the preference endowment, or type, of couples.
Suppose that participation decisions during the childbearing years were governed by the same behavioral model (modified to account for fertility) as those during the infecund observation period. In particular, suppose that given a couple’s type, all shocks (the ’s in (41) and (42)) are iid. In that case, work experience can be taken as exogenous conditional on a couple’s type. To condition the likelihood (43) on initial experience, we specify a type probability function conditional on work experience at the beginning of the infecund period. Specifically, we would replace , taken to be scalar parameters in the likelihood function (43), with the type probability function , where, as previously defined, is the first (post-childbearing) period observed for couple .41
The type probability function can itself be derived using Bayes’ rule starting from the true initial decision period (taken to be the start of marriage in this example). Specifically, denoting the couple’s endowment pair as “type” and dropping the subscript, because
the type probability function is
Estimating the type probability function as a nonparametric function of provides an “exact” solution (subject to sampling error) to the initial conditions problem, yielding type probabilities for each level of experience that would be the same as those obtained if we had solved and estimated the model back to the true initial period and explicitly used (47). Alternatively, because the type probabilities must also be conditioned on all other exogenous state variables (the and variables), perhaps making nonparametric estimation infeasible, estimating a flexible functional form would provide an “approximate” solution.
If the shocks are serially correlated, work experience at the start of the infecund period is correlated with future choices not only because it affects future wages, but also because of the correlation of stochastic shocks across fecund and infecund periods. In that case, as suggested by Heckman (1981) in a nonstructural setting, we would need to have data on exogenous initial conditions at the time of the true initial period (taken here to be the start of marriage), when the labor supply decision process is assumed to begin. Given that, we can specify a density for work experience as a function of those exogenous initial conditions at the start of marriage and incorporate it in the likelihood function.42
As we have seen, the solution of the dynamic programming problem required that the functions be calculated for each point in the state space. If and take on only a finite number of discrete values (e.g., years of schooling, number of children), as does , the solution method simply involves solving for the functions at each point in the state space. However, if either or contains a continuous variable (or if the shocks follow an ARIMA process, as already discussed), the dimensionality of the problem is infinite and one obviously cannot solve the dynamic programming problem at every state point. Furthermore, one could imagine making the model more complex in ways that would increase the number of state variables and hence the size of the state space, for example, by letting the vector of taste shifters include not just number of children but the number of children in different age ranges. In general, in a finite state space problem, the size of the state space grows exponentially with the number of state variables. This is the so-called curse of dimensionality, first associated with Bellman (1957).
Estimation requires that the dynamic programming problem be solved many times—once for each trial parameter vector that is considered in the search for the maximum of the likelihood function (and perhaps at many nearby parameter vectors, to obtain gradients used in a search algorithm). This means that an actual estimation problem will typically involve solving the DP problem thousands of times. Thus, from a practical perspective, it is necessary that one be able to obtain a solution rather quickly for estimation to be feasible. In practice, there are two main ways to do this. One is just to keep the model simple so that the state space is small. But, this precludes studying many interesting problems in which there are a large set of choices that are likely to be interrelated (for example, choices of fertility, labor supply, schooling, marriage and welfare participation).
A second approach, which a number of researchers have pursued in recent years, is to abandon “exact” solutions to DP problems in favor of approximate solutions that can be obtained with greatly reduced computational time. There are three main approximate solution methods that have been discussed in the literature:43
1. Discretization: This approach is applicable when the state space is large due to the presence of continuous state variables. The idea is straightforward: simply discretize the continuous variables and solve for the functions only on the grid of discretized values. To implement this method one must either (i) modify the law of motion for the state variables so they stay on the discrete grid (e.g., one might work with a discrete AR(1) process) or (ii) employ a method to interpolate between grid points. Clearly, the finer the discretization, the closer the approximation will be to the exact solution. Discretization does not formally break the curse of dimensionality because the time required to compute an approximate solution still increases exponentially as the number of state variables increases. But it can be an effective way to reduce computation time in a model with a given number of state variables.
2. Approximation and interpolation of the functions: This approach was originally proposed by Bellman et al. (1963) and extended to the type of models generally of interest to labor economists by Keane and Wolpin (1994). It is applicable when the state space is large either due the presence of continuous state variables or because there are a large number of discrete state variables (or both). In this approach the functions are evaluated at a subset of the state points and some method of interpolation is used to evaluate at other values of the state space. This approach requires that the interpolating functions be specified parametrically. For example, they might be specified as some regression function in the state space elements or as some other approximating function such as a spline. Using the estimated values of the rather than the true values is akin to having a nonlinear model with specification error. The degree of approximation error is, however, subject to control. In a Monte Carlo study, Keane and Wolpin (1994) provide evidence on the effect of this approximation error on the bias of the estimated model parameters under alternative interpolating functions and numbers of state points. Intuitively, as the subset of the state points that are chosen is enlarged and the dimension of the approximating function is increased, the approximation will converge to the true solution.44
As with discretization, the approximation/interpolation method does not formally break the curse of dimensionality, except in special cases. This is because the curse of dimensionality applies to polynomial approximation (see Rust (1997)). As the number of state variables grows larger, the computation time needed to attain a given accuracy in a polynomial approximation to the Emax function grows exponentially.45 Despite this, the Keane and Wolpin (1994) approach (as well as some closely related variants) has proven to be a useful way to reduce computation time in models with large state spaces, and it has been widely applied in recent years. Rather than describe the method in detail here, we will illustrate the method later in a specific application.
3. Randomization: This approach was developed by Rust (1997). It is applicable when the state space is large due the presence of continuous state variables, but it requires that choice variables be discrete and that state variables be continuous. It also imposes important constraints on how the state variables may evolve over time. Specifically, Rust (1997) shows that solving a random Bellman equation can break the curse of dimensionality in the case of DCDP models in which the state space is continuous and evolves stochastically, conditional on the alternative chosen. Note that because work experience is discrete and evolves deterministically in the labor force participation model presented above, this method does not strictly apply. But, suppose instead that we modeled work experience as a continuous random variable with density function where is random variable indicating the extent to which working probabilistically augments work experience or not working depletes effective work experience (due to depreciation of skills). The random Bellman equation (ignoring and ), the analog of (20), is in that case given by
where are randomly drawn state space elements. The approximate value function converges to as at a rate. Notice that this is still true if is a vector of state variables, regardless of the dimension of the vector. Thus, the curse of dimensionality is broken here, exactly analogously to the way that simulation breaks the curse of dimensionality in approximation of multivariate integrals (while discretization methods and quadrature do not). 46
The above approach only delivers a solution for the value functions on the grid . But forming a likelihood will typically require calculating value functions at other points. A key point is that is, in Rust’s terminology, self-approximating. Suppose we wish to construct the alternative specific value function at a point that is not part of the grid . Then we simply form:
Notice that, because any state space element at can be reached from any element at with some probability given by , the value function at can be calculated from (49) at any element of the state space at . In contrast to the methods of approximation described above, the value function does not need to be interpolated using an auxiliary interpolating function.47 This “self-interpolating” feature of the random Bellman equation is also crucial for breaking the curse of dimensionality (which, as noted above, plagues interpolation methods).
Of course, the fact that the randomization method breaks the curse of dimensionality does not mean it will outperform other methods in specific problems. That the method breaks the curse of dimensionality is a statement about its behavior under the hypothetical scenario of expanding the number of state variables. For any given application with a given number of state variables, it is an empirical question whether a method based on discretization, approximation/interpolation or randomization will produce a more accurate approximation in given computation time.48 Obviously more work is needed on comparing alternative approaches.49
The structure of the labor force decision problem described above was kept simple to provide an accessible introduction to the DCDP methodology. In this section, we extend that model to allow for:
The binary choice problem considers two mutually exclusive alternatives, the multinomial problem more than two. The treatment of static multinomial choice problems is standard. The dynamic analog to the static multinomial choice problem is conceptually no different than in the binary case. In terms of its representation, it does no injustice to simply allow the number of mutually exclusive alternatives, and thus the number of alternative-specific value functions in (21), to be greater than two. Analogously, if there are mutually exclusive alternatives, there will be latent variable functions (relative to one of the alternatives, arbitrarily chosen). The static multinomial choice problem raises computational issues with respect to the calculation of the likelihood function. Having to solve the dynamic multinomial choice problem, that is, for the function that enters the multinomial version of (21) at all values of and at all , adds significant computational burden.
For concreteness, we consider the extension of DCDP models to the case with multiple discrete alternatives by augmenting the dynamic labor force participation model to include a fertility decision in each period so that the model can be extended to childbearing ages. In addition, to capture the intensive work margin, we allow the couple to choose among four labor force alternatives for the wife. We also drop the assumption that errors are additive and normal. In particular, in the binary model we assumed, rather unconventionally, that the wage has an additive error in levels. The usual specification (based on both human capital theory and on empirical fit) is that the log wage has an additive error.50 Although it is necessary to impose functional form and distributional assumptions to solve and estimate DCDP models, it is not necessary to do so to describe solution and estimation procedures. We therefore do not impose such assumptions, reflecting the fact that the researcher is essentially unconstrained in the choice of parametric and distributional assumptions (subject to identification considerations).
The following example also illustrates the interplay between model development and data. The development of a model requires that the researcher decide on the choice set, on the structural elements of the model and on the arguments of those structural elements. In an ideal world, a researcher, based on prior knowledge, would choose a model, estimate it and provide a means to validate it. However, in part because there are only a few data sets on which to do independent validations and in part because it is not possible to foresee where models will fail to fit important features of data, the process by which DCDP models are developed and empirically implemented involves a process of iterating among the activities of model specification, estimation and model validation (for example, checking model fit). Any empirical researcher will recognize this procedure regardless of whether the estimation approach is structural or nonstructural.
A researcher who wished to study the relationship between fertility and labor supply of married women would likely have in mind some notion of a model, and, in that context, begin by exploring the data. A reasonable first step would be to estimate regressions of participation and fertility as functions of “trial” state variables, interpreted as approximations to the decision rules in a DCDP model. 51 As an example, consider a sample of white married women (in their first marriage) taken from the 1979-2004 rounds of the NLSY79. Ages at marriage range from 18 to 43, with 3/4ths of these first marriages occurring before the age of 27. We adopt, as is common in labor supply models, a discrete decision period to be a year. 52 The participation measure consists of four mutually exclusive and exhaustive alternatives, working less than 500 hours during a calendar year , working between 500 and 1499 hours , working between 1500 and 2499 hours and working more than 2500 hours .53 The fertility measure is the dichotomous variable indicating whether or not the woman had a birth during the calendar year. The approximate decision rule for participation is estimated by an ordered probit and the fertility decision rule by a binary probit. The variables included in these approximate decision rules, corresponding to the original taxonomy in section II, are {total hours worked up to , hours worked in , whether a child was born in , number of children born between and , number of children ever born, (years of marriage up to )} and = {age of wife, age of spouse, schooling of wife, schooling of spouse}. Consistent with any DCDP model, the same state variables enter the approximate decision rules for participation and for fertility. As seen in Table 1, the state variables appear to be related to both decision variables and in reasonable ways.54
Employment hours (ordered probit) a | Fertility (probit) b | |
---|---|---|
Work experience (hours) | 4.09 E −05 | 8.32E −06 |
(3.22E −06) c | (4.33E −06) | |
Hours | 1.04 | −0.047 |
(0.042) | (0.051) | |
Hours = 2 | 1.90 | −0.126 |
(0.049) | (0.051) | |
Hours | 3.16 | −0.222 |
(0.110) | (0.089) | |
Age | −0.075 | 0.211 |
(0.008) | (0.035) | |
Age squared | — | (−0.004) |
(0.0005) | ||
Birth | −0.497 | −0.320 |
(0.047) | (0.778) | |
Births ( to ) | −0.349 | 0.448 |
(0.031) | (0.054) | |
Total births | 0.099 | −0.337 |
(0.028) | (0.061) | |
Schooling | 0.077 | 0.004 |
(0.009) | (0.011) | |
Age of spouse | 0.007 | −0.016 |
(0.004) | (0.004) | |
Schooling of spouse | −0.036 | 0.021 |
(0.007) | (0.010) | |
Marital duration | −0.025 | −0.015 |
(0.006) | (0.008) | |
Constant | — | −3.41 |
(0.497) | ||
Cut point | −0.888 | — |
(0.171) | ||
Cut point | 0.076 | — |
(0.172) | ||
Cut point | 2.48 | — |
(0.175) | ||
Pseudo R2 | .295 | .094 |
a 8183 person-period observations.
b 8786 person-period observations.
c Robust standard errors in parenthesis.
Suppose the researcher is satisfied that the state variables included in the approximate decision rules should be included in the DCDP model. The researcher, however, has to make a choice as to where in the set of structural relationships the specific state variables should appear: the utility function, the market wage function, the husband’s earnings function and/or the budget constraint. The researcher also must decide about whether and where to include unobserved heterogeneity and/or serially correlated errors. Some of these decisions will be governed by computational considerations. Partly because of that and partly to avoid overfitting, researchers tend to begin with parsimonious specifications in terms of the size of the state space. The “final” specification evolves through the iterative process described above.
As an example, let the married couple’s per-period utility flow include consumption , a per-period disutility from each working alternative and a per-period utility flow from the stock of children . The stock of children includes a newborn, that is a child born at the beginning of period . Thus,
where the , and are time-varying preference shocks associated with each of the four choices that are assumed to be mutually serially uncorrelated. Allowing for unobserved heterogeneity, the type specification is (following (41))
where the ’s are mutually serially independent shocks.
The household budget constraint incorporates a cost of avoiding a birth (contraceptive costs, ), which, for biological reasons, will be a function of the wife’s age (her age at marriage, , plus the duration of marriage, ) and (child) age-specific monetary costs of supplying children with consumption goods and with child care if the woman works ( per work hour). Household income is the sum of husband’s earnings and wife’s earnings, the product of an hourly wage and hours worked (1000 hours if , 2000 hours if , 3000 hours ). Specifically, the budget constraint is
where are the number of children in different age classes, e.g., 0-1, 2-5, etc.55 To simplify, we do not allow for uncertainty about births. A couple can choose to have a birth (with probability one) and thus not pay the contraceptive cost or choose not to have a birth (with probability one) and pay the avoidance cost.56
The wife’s Ben Porath-Griliches wage offer function depends on her level of human capital, , which is assumed to be a function of the wife’s completed schooling , assumed fixed after marriage, the wife’s work experience, that is, the number of hours worked up to , and on the number of hours worked in the previous period:
where the are (assumed to be time-invariant) competitively determined skill rental prices that may differ by hours worked and is a time varying shock to the wife’s human capital following a permanent (discrete type)-transitory scheme.57 Husband’s earnings depends on his human capital according to:
where is the husband’s schooling and is his age at (his age at marriage plus ).58
The time-varying state variables, the stock of children (older than one) of different ages, the total stock of children and work experience, evolve according to:
The state variables in , augmented to include type, consist of the stock of children (older than one) of different ages, the wife’s work experience and previous period work status, the husband’s and wife’s age at marriage, the husband and wife’s schooling levels and the couple’s type. The choice set during periods when the wife is fecund, assumed to have a known terminal period , consists of the four work alternatives plus the decision of whether or not to have a child. There are thus eight mutually exclusive choices, given by , where the first superscript refers to the work choice and the second to the fertility choice .59 When the wife is no longer fecund, and the choice set consists only of the four mutually exclusive alternatives, .
The objective function of the couple is, as in the binary case, to choose the mutually exclusive alternative at each that maximizes the remaining expected discounted value of the couple’s lifetime utility. Defining to be the contemporaneous utility flow for the work and fertility choices, the alternative-specific value functions for the multinomial choice problem are
where, letting be the vector of alternative specific value functions relevant at period ,
and where the expectation in (59) is taken over the joint distribution of the preference and income shocks, .60 The ’s may have a general contemporaneous correlation structure, but, as noted, are mutually serially independent.
The model is solved by backwards recursion. The solution requires, as in the binary case, that the function be calculated at each state point and for all . In the model as it is now specified, the function is a six-variate integral (over the preference shocks, the wife’s wage shock and the husband’s earnings shock). The state space at consists of all feasible values of . Notice that all of the state variables are discrete and the dimension of the state space is therefore finite. However, the state space, though finite, is huge. The reason is that to keep track of the number of children in each of the three age groups, it is necessary to keep track of the complete sequence of births. If a woman has say 30 fecund periods, the number of possible birth sequences is 230= 1,073,700,000. Even without multiplying by the dimension of the other state variables, full solution of the dynamic programming problem is infeasible, leaving aside the iterative process necessary for estimation.
It is thus necessary to use an approximation method, among those previously discussed, for solving the dynamic programming problem, that is, for solving for the functions. As an illustration, we present an interpolation method based on regression. To see how it works, consider first the calculation of the for any given state space element. At the woman is no longer fecund, so we need to calculate
where is the six-tuple vector of shocks. Although this expression is a six-variate integration, at most four of the shocks actually affect for any given choice. Given the lack of a closed form expression, must be calculated numerically. A straightforward method is Monte Carlo integration. Letting be the random draw, , from the joint distribution, an estimate of at say the th value of the state space in , is
Given the infeasibility of calculating at all points in the state space, suppose one randomly draws state points (without replacement) and calculates the function for those state space elements according to (62). We can treat these values of as a vector of dependent variables in an interpolating regression
where is a time vector of regression coefficients and is a flexible function of state variables.61 With this interpolating function in hand, estimates of the function can be obtained at any state point in the set .
Given , we can similarly calculate at a subset of the state points in . Using the draws from , the estimate of at the th state space element is
where is given by (59). Using the calculated for randomly drawn state points from as the dependent variables in the interpolating function,
provides estimated values for the function at any state point in the set .62 Continuing this procedure, we can obtain the interpolating functions for all of the functions for all from (the age at which the woman becomes infertile) through , that is, .
At , the choice set now includes the birth of a child. All of the functions from to require numerical integrations over the eight mutually exclusive choices based on the joint error distribution . At any within the fecund period, at the th state point,
Again taking random draws from the state space at , we can generate interpolating functions:63
In the binary case with additive normal errors, the cut-off values for the participation decision, which were the ingredients for the likelihood function calculation, were analytical. Moreover, although the likelihood function (35) did not have a closed form representation, it required the calculation only of a univariate cumulative normal distribution. In the multinomial choice setting we have described, the set of values of the vector determining optimal choices and serving as limits of integration in the probabilities associated with the work alternatives that comprise the likelihood function have no analytical form and the likelihood function requires a multivariate integration.
To accommodate these complications, maximum likelihood estimation of the model uses simulation methods. To describe the procedure, let the set of values of for which the th choice is optimal at be denoted by . Consider the probability that a couple chooses neither to work nor have a child, , in a fecund period :
This integral can be simulated by randomly taking draws from the joint distribution of , with draws denoted by , and determining the fraction of times that the value function for that alternative is the largest among all eight feasible alternatives, that is,
One can similarly form an estimate of the probability for other nonwork alternatives, namely for for any and for for any . Recall that for infecund periods, there are only four alternatives because is constrained to be zero.
When the wife works, the relevant probability contains the chosen joint alternative and the observed wage. For concreteness, consider the case where , . Then the likelihood contribution for an individual who works 2000 hours in period at a wage of is
For illustrative purposes, suppose that the (log) wage equation is additive in ,
and further that is joint normal.64 With these assumptions, and denoting the deterministic part of the right hand side of (72) by , we can write
where is the Jacobian of the transformation from the distribution of to the distribution of . Under these assumptions is normal and the frequency simulator for the conditional probability takes the same form as (69) except that is set equal to and the other five ’s are drawn from . Thus, denoting the fixed value of as ,
Although these frequency simulators converge to the true probabilities as , there is a practical problem in implementing this approach. Even for large , the likelihood is not smooth in the parameters, which precludes the use of derivative methods (e.g., BHHH). This lack of smoothness forces the use of non-derivative methods, which converge more slowly. However, frequency simulators can be smoothed, which makes the likelihood function differentiable and improves the performance of optimization routines. One example is the smoothed logit simulator (McFadden, 1989), namely (in the case we just considered),
where is shorthand for the value functions in (74) and is a smoothing parameter. As , the RHS converges to the frequency simulator. The other choice probabilities associated with work alternatives are similarly calculated.
Conceptually, any dynamic programming problem that admits to numerical solution can be estimated. In addition to simulated maximum likelihood, researchers have used various alternative simulation estimation methods, including minimum distance estimation, simulated method of moments and indirect inference. There is nothing in the application of these estimation methods to DCDP models that is special, other than having to iterate between solving the dynamic programming problem and minimizing a statistical objective function.
The main limiting factor in estimating DCDP models is the computational burden associated with the iterative process. It is therefore not surprising that there have been continuing efforts to reduce the computational burden of estimating DCDP models. We briefly review two such methods.
As has been discussed elsewhere (see Geweke and Keane (2000)), it is difficult to apply the Bayesian approach to inference in DCDP models because the posterior distribution of the model parameters given the data is typically intractably complex. Recently, however, computationally practical Bayesian approaches that rely on Markov Chain Monte Carlo (MCMC) methods have been developed by Imai et al. (2009) and Norets (2009). We will discuss the Imai et al. (2009) approach in the stationary case, where it is most effective. Thus, we remove time superscripts from the value functions and denote as the next period state. We also make the parameter vector explicit. Thus, corresponding to Eq. (20) and the (21), we have
where
The basic idea is to treat not only the parameters but also the values functions and expected value functions as objects that are to be updated on each iteration of the MCMC algorithm. Hence, we add the superscript to the value functions, the expected value functions and the parameters to denote the values of these objects on iteration . We use to denote the approximation to the expected value and to denote the likelihood.
The Imai et al. (2009) algorithm consists of three steps: the parameter update step (using the Metropolis-Hastings algorithm), the Dynamic Programming step, and the expected value approximation step:
(1) The Parameter Updating Step (Metropolis-Hastings algorithm)
First, draw a candidate parameter vector from the proposal density . Then, evaluate the likelihood conditional on and conditional on . Now, form the acceptance probability
We then accept with probability , that is,
(2) The Dynamic Programming (or Bellman equation iteration) Step
The following Bellman equation step is nested within the parameter updating step:
The difficulty here is in obtaining the expected value function approximation that appears on the right hand side of (81). We describe this next.
(3) Expected value approximation step.
The expected value function approximation is computed using information from earlier iterations of the MCMC algorithm. The problem is that, on iteration (s), they have not, in general, yet calculated the value functions at the specific parameter value that they have drawn on iteration (s). Intuitively, the idea is to approximate the expected value functions at by looking at value functions that were already calculated on earlier iterations of the MCMC algorithm, emphasizing parameter values that are in some sense “close” to .
Specifically, the expected value function is approximated as
where denotes a parameter value from an earlier iteration of the MCMC algorithm and is the value function at state point that was calculated on iteration .65 Finally, is a weighting function that formalizes the notion of closeness between and . Imai et al. (2009) use weighting function given by
where is a kernel with bandwidth .
Under certain conditions, as the number of iterations grows large, the output of this algorithm generates convergence to the posterior distribution of the parameter vector, as well as convergence to the correct (state and parameter contingent) value functions. One condition is “forgetting.” That is, the algorithm will typically be initialized using rather arbitrary initial value functions. Hence, the sum in (82) should be taken using a moving window of more recent iterations so early iterations are dropped. Another key point is that, as one iterates, more lagged values of become available, so more values that are “close” to the current will become available. Hence, the bandwidth in the kernel smoother in (83) should become narrower as one iterates. Note that satisfying both the “forgetting” and “narrowing” conditions simultaneously requires that the “moving window” mentioned earlier must expand as one iterates, but not too quickly. Norets (2009) and Imai et al. (2009) derive precise rates.
The Bayesian methods described here are in principle applicable to non-stationary models as well. This should be obvious given that a non-stationary model can always be represented as a stationary model with (enough) age specific variables included in the state space. However, this creates the usual curse of dimensionality, as the state space may expand substantially as a result. Unlike, say, the approximate solution algorithm proposed by Keane and Wolpin (1994), these Bayesian algorithms are not designed (or intended) to be methods for handling extremely large state space problems. Combining the two ideas is a useful avenue for future research.
It is worth noting that no DCDP work that we are aware of has ever reported a distribution of policy simulations that accounts for parameter uncertainty; and, it is also rarely done in nonstructural work.66 The Bayesian approach provides a natural way to do this, and Imai et al. (2009) have produced code that generates such a distribution.
Hotz and Miller (1993) developed a method for the implementing DCDP models that does not involve solving the DP model, that is, calculating the functions. HM prove that, for additive errors, the functions can be written solely as functions of conditional choice probabilities and state variables for any joint distribution of additive shocks. Although the method does not require that errors be distributed extreme value, the computational advantage of the method is best exploited under that assumption.
Consider again the binary choice model.67 From (38), one can see that if we have an estimate of the conditional choice probabilities at all state points, can also be calculated at all state points. Denoting the (estimate of the) conditional choice probability by ,
Consider now period and suppose we have an estimate of the conditional choice probabilities, . Then,
where, for convenience, we have included only work experience in the function. We can continue substituting the estimated conditional choice probabilities in this recursive manner, yielding at any
These functions can be used in determining the cut-off values that enter the likelihood function.
As with other approaches, there are limitations. First, the empirical strategy involves estimating the conditional choice probabilities from the data (nonparametrically if the data permit). In the case at hand, the conditional choice probabilities correspond to the proportion of women who work for given values of the state variables (for example, for all levels of work experience). To implement this procedure, one needs estimates of the conditional choice probabilities through the final decision period and for each possible value of the state space. Thus, we need longitudinal data that either extends to the end of the decision period or we need to assume that the conditional choice probabilities can be obtained from synthetic cohorts. This latter method requires an assumption of stationarity, that is, in forecasting the conditional choice probabilities of a 30 year old observed in year when reaching age 60 in year , it’s assumed that the 30 year old would face the same decision-making environment (for example, the same wage offer function, etc.) as the 60 year old observed in year . Most DCDP models in the literature which solve the full dynamic programming problem implicitly make such an assumption as well, though it is not dictated by the method.68 Moreover, it must also be assumed that there are no state variables observed to the agent but unobserved to us; otherwise, we will not be matching the 30 year olds to the 60 year olds the same unobserved state values.69 Second, the convenience of using additive extreme value errors brings with it the previously discussed limitations of that assumption. Third, the estimates are not efficient, because the fact that the functions themselves contain the parameters in the model structure are not taken into account.