6: State Space Models for Count Time Series (3/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google





 





131 State Space Models for Count Time Series

Provided −F



(α, y; θ) = (K + V) is positive denite for all α, Newton–Raphson iterates

starting at any value α

(0)

will converge to the unique modal point denoted α

∗

. A sufcient

condition is that log p(y|x

β+α) be concave in α. This is satised for the exponential family

with canonical link in which case K = diag(b



β + α)).Notethat α

∗

is a function of θ

and y but we will typically suppress this as given.

Using this expansion, we rewrite F in (6.13) as

F(α, y; θ) = F

(α, y; θ) + R(α; α

∗

), (6.18)

where

(α, y; θ) = F(α

∗

, y; θ) −

{(α − α

∗

)



∗

+ V)(α − α

∗

)}. (6.19)

Ignoring the error term R(α; α

∗

) provides an approximation to p(y, α) of the form

(y, α) = exp(F

(α, y)), which when integrated over α, gives the Laplace approximation

to the marginal distribution of y and hence to the likelihood (6.12) of the form

(θ; y

, α

∗

) = (2π)

n/2

∗

+ V|

−1/2

exp[F(α

∗

, y; θ)]

∗

1/2

exp

log p(y|Xβ + α

∗

) −

∗

Vα

∗

, (6.20)

and

L(θ) = L

(θ)E

R(α,α

∗

)

, (6.21)

where the expectation is with respect to the approximate posterior p

(α|y; θ) ∼ N(α

∗

+ V)

−1

As a result of the very high dimension of the latent process α, Newton–Raphson updates

are computationally expensive to implement naively, as each step would cost O(N

) opera-

tions for the calculations involving the inversion of the Hessian matrix. Also, the efciency

of this computation depends very strongly on the form of the autocovariance matrix 

for α. A convenient and exible choice for the {α

} process is the causal AR(p) models,

= φ

t−1

+···+ φ

t−p

+ 

, {

}∼ IIDN(0, σ

), (6.22)

in which case V in (6.16) is a banded matrix and is hence sparse. For this class of models,

Davis and Rodriguez-Yam (2005) employ the innovations algorithm for computation of the

update steps leading to a fast and stable algorithm for obtaining the mode α

∗

and |K

∗

+ V|

required in (6.19)

The above Laplace approximation is equivalent to nding the mode of the smoothing

density of the state given the observations. Durbin and Koopman (1997) and Durbin and

Koopman (2012) use a modal approximation to obtain the importance density and arrive

at an alternative expression to (6.21) shown by Davis and Rodriguez-Yam (2005) to be of

the form

L(θ) = L

(θ)E

p(y|α)

, (6.23)

g(y|α)











132 Handbook of Discrete-Valued Time Series

where the expectation is taken with respect to the previously dened Gaussian approx-

imating smoothing density for α|y ∼ N(α

∗

, (K

∗

+ V)

−1

), g(y|α) is the corresponding

Gaussian distribution for y|α,and L

(θ) = A(α

∗

(θ), where A(α

∗

) is dened in Davis

and Rodriguez-Yam (2005). As a result of (6.23), the importance sampling method based

on the Laplace approximation and that based on the Gaussian approximation of Durbin

and Koopmans will give identical results (for the same draws from the approximating

conditional of α|y).

Unlike the use of L

alone to approximate the likelihood, L

alone cannot be used as an

approximation to L(θ) and simulation is required to adjust it. An alternative approximation

to the likelihood, not requiring simulation, is proposed in Durbin and Koopman (1997)

mainly for obtaining starting values for θ

to optimize the likelihood. This is of the form

log L

a,DK

= log L

+ log

log

1 +

(4)

, (6.24)

t=1

where

l =

log p(y|α

∗

) − log p

(y|α

∗

), v

∗−1

+ V

−1

)

−1

,and

(4)

= b

(4)



β + α

∗

t t

)

is the fourth derivative of the t

element of

l with b

(4)

being the fourth derivative of b.

All of the quantities needed for the approximation (6.24) can be calculated readily for the

exponential family and require only the quantities needed for the Laplace approximation

above.

Shephard and Pitt (1997) and Durbin and Koopman (1997) use Kalman ltering and

smoothing to obtain the mode corresponding to an approximating linear, Gaussian state

space model. Details are in Davis and Rodriguez-Yam (2005) and Durbin and Koopman

(2012). For this procedure to be valid, it is required that log p(y|x

β + α) be concave in α,

which is a stronger condition than is needed for the approach based on the innovations

algorithm to obtain the Laplace approximation. In cases where log p(y|x

β + α) is not log

concave, Jungbacker and Koopman (2007) show that the Kalman lter and smoother algo-

rithms of Durbin and Koopman (1997) can still be used in conjunction with a line search

version of Newton–Raphson to obtain the required mode for the Laplace approximating

density. However, derivation of this result in Jungbacker and Koopman (2007) is based on

different arguments than presented in Durbin and Koopman (2012).

We let

= arg max

(θ) and call this the approximate MLE for θ. Use of derivative

information can substantially assist numerical optimization algorithms to nd the maxi-

mum over θ. However, expressions for the rst derivative vector involves the third order

partial derivatives of F(α(θ); θ), which in turn require implicit differentiation to be used.

Needless to say the analytic expressions are complex and need to be worked out for each

specic model being considered for the latent process autocorrelation structure. Numerical

derivatives based on nite differences are often used but these can be slow to execute and

inaccurate. An alternative to using numerical derivatives suggested in Skaug (2002) and

Skaug and Fournier (2006) is to use automatic differentiation (AD). Sometimes called “algo-

rithmic differentiation”, this method takes computer code for calculation of L

(θ; y

, α

∗

) in

(6.20) and produces new code that evaluates its derivatives. This is not the same as using a

symbolic differentiator such as used in Maple or Mathematica.

Davis and Rodriguez-Yam (2005) demonstrate the very good accuracy of Laplace

approximation without use of importance sampling for the models considered here. Use of

higher order terms (Shun and McCullagh, 1995) in the Taylor series expansion of F in (6.13)



133 State Space Models for Count Time Series

would be relatively easy to implement and are likely to improve the performance even fur-

ther, particular for highly discrete data such as in Bernoulli responses. While the Laplace

approximation appears to be very close to the required integral likelihood, convergence

of the Laplace approximation for parameter driven processes has not been established,

despite many methods that base their analysis on approximating the likelihood with the

Laplace approximation and simulating the error such as Durbin and Koopman (2000) and

Davis and Rodriguez-Yam (2005).

As yet, there is no useable asymptotic theory for the Laplace approximate MLE. Simu-

lation results in Davis and Rodriguez-Yam (2005) for reasonably simple examples suggest

that the bias is small and the distribution is approximately normal.

Despite the drawbacks of asymptotic bias and lack of a CLT for the estimates obtained

by maximizing the approximate likelihoods L

(θ) in (6.20) or L

a,DK

(θ) in (6.24), they could

be very useful in the model building phase to decide on appropriate regressors and a latent

process model form. Here, the emphasis should be on speed to arrive at a parsimonious

model in which relevant regressors are included and serial dependence is adequately taken

into account. Approximate standard errors could be obtained using numerical differentia-

tion of the approximate likelihood to obtain an approximate Hessian and hence information

matrix. For example, a standard optimizer such as ‘optim’ in R (R Core Team, 2014), the

Hessian at convergence could be requested. Another approach to this is as suggested in

Davis and Rodriguez-Yam (2005) and uses bootstrap methods to adjust for bias and to

obtain an approximate distribution for the approximate likelihood estimators. Alterna-

tively, information theoretic criteria could be used to compare models of different structures

or complexity or informal use of the likelihood ratio test for nested model structures could

be employed. Once one or several suitable model structures have been decided upon by this

approach, the approximate likelihood methods could be enhanced by importance sampling

to improve the accuracy of inference and nal model choice.

6.2.5 Importance Sampling

The general use of importance sampling methods to evaluate the likelihood for nonlinear

non-Gaussian models is well reviewed in Durbin and Koopman (2012). Importance sam-

pling is used to approximate the expectation of the error term of the Laplace approximation

(θ) to the likelihood L(θ) in (6.21) and, for the Gaussian approximation, to approxi-

mate the expectation term multiplying L

(θ) in (6.23). Both of these implementations of

importance sampling draw samples from the same modal approximation to the posterior

distribution for α|y. An alternative approach to approximating L(θ) is efcient impor-

tance sample (EIS), that uses a global approximation minimizing variance of the log of

the importance weights. We discuss these three approaches in more detail here.

6.2.5.1 Importance Sampling based on Laplace Approximation

Based on (6.21), Davis and Rodriguez-Yam (2005) use importance sampling to approximate

the expectation as

 

(θ) = E

R(α,α

∗

)

≈

R(α

(i)

;α

∗

)

=: e

(θ)

, (6.25)

i=1

where the α

(i)

are independent draws from p

(α|y). However, this calculation is slow,

needing M Monte Carlo simulations at each step of the optimization of the likelihood.



 

134 Handbook of Discrete-Valued Time Series

They also suggest using approximate importance sampling (AIS) in which e

(θ)

in (6.25)

is approximated using a rst order Taylor expansion in θ around θ

of the form

(θ;

) =

(θ) + q

(θ −

), (6.26)

where q

∂

(θ)|

is obtained using numerical differentiation. This method is consid-

∂θ

erably faster than IS because the approximate AIS approximate likelihood given by

(θ) = L

(θ) exp

(θ;

)

(6.27)

requires that q

(which uses importance sampling) only needs to be computed once before

(θ) is optimized.

Numerical experiments in Davis and Rodriguez-Yam (2005) show that this estimator has

no signicant drop in accuracy compared to traditional importance sampling. However, it

is possible to iterate over the AIS approximation of Er

(θ) if further accuracy is required,

where θ

is replaced with the updated estimate for θ at each step.

6.2.5.2 Importance Sampling based on Gaussian Approximation

Durbin and Koopman (1997) advocate using the form (6.23) to approximate the likelihood

by approximating E

[w(α, y)], where w(α, y) = p(y, α)/g(y, α) = p(y|α)/g(y|α) are the

importance weights by

 

[w(α, y)]≈



(i)

, y

, (6.28)

i=1

andasbefore,the α

(i)

are independent draws from the same p

(α|y) as used in the Davis

and Rodriguez-Yam approach to importance sampling. However, Durbin and Koopman

(2012) and Shephard and Pitt (1997) use the simulation smoother of De Jong and Shep-

hard (1995) to draw the samples from this approximating posterior distribution. In order

to improve the accuracy and numerical efciency of the Monte Carlo estimate of the error

term, Durbin and Koopman (1997) employ antithetic variables to balance location and scale

and a control variable based on a fourth order Taylor expansion of the weight function in

the error term. This is not the same as the AIS method of Davis and Rodriguez-Yam (2005).

6.2.5.3 Efficient Importance Sampling

Choice of importance sampling density is critical for efcient approximation of the likeli-

hood by simulation. Modal approximation was used in the methods of the previous two

subsections. An alternative choice of importance density leads to the EIS method—see

Jung and Liesenfeld (2001), Richard and Zhang (2007), and Jung et al. (2006) for complete

details on this approach. For EIS, the importance densities are based on the (approximate)

minimization of the variance of the log importance weight log w(α, y) dened as in the

Durbin–Koopman method. Lee and Koopman (2004) compare this with DK procedure and

conclude they have similar performance when applied to the stochastic volatility model.



135 State Space Models for Count Time Series

Recently, Koopman et al. (2015) have presented an approach called numerically accel-

erated importance sampling (NAIS) for the nonlinear non-Gaussian state space models.

The NAIS method combines fast numerical integration techniques with the Kalman l-

ter smoothing methods proposed by Shephard and Pitt (1997) and Durbin and Koopman

(1997). They demonstrate signicant computational speed improvements over standard

EIS implementations as well as improved accuracy. Additionally, by using new con-

trol variables substantial improvement in the variability of likelihood estimates can be

achieved. A key component of the NAIS method is construction of the importance sam-

pling density by numerical integration using Gauss–Hermite quadrature in contrast to

using simulated trajectories.

6.2.6 Composite Likelihood

As already noted, the lack of a tractably computable form for the likelihood is one of the

drawbacks in using SSMs for count time series. Estimation procedures described in previ-

ous sections essentially resort to simulation-based procedures for computing the likelihood

and then maximizing it. With any simulation procedure, one cannot be certain that the sim-

ulated likelihood provides a good facsimile of the actual likelihood or even the likelihood

in proximity of the maximum likelihood estimator. An alternative estimation procedure,

which has grown recently in popularity, is based on the composite likelihood. The idea is

that perhaps one does not need to compute the entire likelihood but only the likelihoods

of more manageable subcollections of the data, which are then combined. For example,

the likelihood given in (6.5) requires an n-fold integration, which is generally impractical

even for moderate sample sizes. However, a similar integral based on only 2 or 3 dimen-

sions can be computed numerically rather accurately. The objective then is to replace a large

dimensional integral with many small and manageable integrations.

The special issue of Statistical Sinica (Volume 21 (2011)) provides an excellent survey on

the use of composite likelihood methods in statistics. In the development below, we will

follow the treatment of Davis and Yau (2011) with emphasis on Example 3.4. Ng et al. (2011),

also in the special issue, consider composite likelihood for related time series models.

While we will focus this treatment on the pairwise likelihood for count time series, exten-

sions to combining weighted likelihoods based on more general subsets of the data are

relatively straightforward. The downside in our application is that numerically, it may not

be practical to compute density functions of more than just a few observations.

Suppose Y

, ..., Y

are observations from a time series for which we denote

p(y

, y

, ..., y

; θ) as the likelihood for a parameter θ based on the k-tuples of distinct

observations y

, y

, ..., y

. For ease of notation, we have suppressed the dependence

of p(·; θ) on k and the vector i

, i

, ..., i

. For k xed, the composite likelihood based on

consecutive k-tuples is given by

n−k

CPL

(θ; Y

(n)

) = log p(Y

, Y

t+1

, ..., Y

t+k−1

; θ). (6.29)

t=1

Normally, one does not take k very large due to the difculty in computing the

k-dimensional joint densities. The composite likelihood estimator

is found by maxi-

mizing CPL

. Under identiability, regularity, and suitable mixing conditions,

is con-

sistent and asymptotically normal. This was established for linear time series models in

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6: State Space Models for Count Time Series (3/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
6: State Space Models for Count Time Series (3/5)