131 State Space Models for Count Time Series
Provided F

(α, y; θ) = (K + V) is positive denite for all α, Newton–Raphson iterates
starting at any value α
(0)
will converge to the unique modal point denoted α
. A sufcient
condition is that log p(y|x
T
β+α) be concave in α. This is satised for the exponential family
with canonical link in which case K = diag(b

(X
T
β + α)).Notethat α
is a function of θ
and y but we will typically suppress this as given.
Using this expansion, we rewrite F in (6.13) as
F(α, y; θ) = F
a
(α, y; θ) + R(α; α
), (6.18)
where
F
a
(α, y; θ) = F(α
, y; θ)
1
{(α α
)
(K
+ V)(α α
)}. (6.19)
2
Ignoring the error term R(α; α
) provides an approximation to p(y, α) of the form
p
a
(y, α) = exp(F
a
(α, y)), which when integrated over α, gives the Laplace approximation
to the marginal distribution of y and hence to the likelihood (6.12) of the form
L
a
(θ; y
n
, α
) = (2π)
n/2
|K
+ V|
1/2
exp[F(α
, y; θ)]
=
|K
|V
+
|
1
V
/2
|
1/2
exp
log p(y|Xβ + α
)
1
2
α
∗
Vα
, (6.20)
and
L(θ) = L
a
(θ)E
a
e
R(α,α
)
, (6.21)
where the expectation is with respect to the approximate posterior p
a
(α|y; θ) N(α
,
(K
+ V)
1
).
As a result of the very high dimension of the latent process α, Newton–Raphson updates
are computationally expensive to implement naively, as each step would cost O(N
3
) opera-
tions for the calculations involving the inversion of the Hessian matrix. Also, the efciency
of this computation depends very strongly on the form of the autocovariance matrix
n
for α. A convenient and exible choice for the {α
t
} process is the causal AR(p) models,
α
t
= φ
1
α
t1
+···+ φ
p
α
tp
+
t
, {
t
}∼ IIDN(0, σ
2
), (6.22)
in which case V in (6.16) is a banded matrix and is hence sparse. For this class of models,
Davis and Rodriguez-Yam (2005) employ the innovations algorithm for computation of the
update steps leading to a fast and stable algorithm for obtaining the mode α
and |K
+ V|
required in (6.19)
The above Laplace approximation is equivalent to nding the mode of the smoothing
density of the state given the observations. Durbin and Koopman (1997) and Durbin and
Koopman (2012) use a modal approximation to obtain the importance density and arrive
at an alternative expression to (6.21) shown by Davis and Rodriguez-Yam (2005) to be of
the form
L(θ) = L
g
(θ)E
a
p(y|α)
, (6.23)
g(y|α)
132 Handbook of Discrete-Valued Time Series
where the expectation is taken with respect to the previously dened Gaussian approx-
imating smoothing density for α|y N(α
, (K
+ V)
1
), g(y|α) is the corresponding
Gaussian distribution for y|α,and L
g
(θ) = A(α
)L
a
(θ), where A(α
) is dened in Davis
and Rodriguez-Yam (2005). As a result of (6.23), the importance sampling method based
on the Laplace approximation and that based on the Gaussian approximation of Durbin
and Koopmans will give identical results (for the same draws from the approximating
conditional of α|y).
Unlike the use of L
a
alone to approximate the likelihood, L
g
alone cannot be used as an
approximation to L(θ) and simulation is required to adjust it. An alternative approximation
to the likelihood, not requiring simulation, is proposed in Durbin and Koopman (1997)
mainly for obtaining starting values for θ
ˆ
to optimize the likelihood. This is of the form
log L
a,DK
= log L
g
+ log
w
ˆ
+
log
1 +
1
n
ˆ
l
(4)
v
2
, (6.24)
8
t
t
t=1
where
ˆ
l =
w
ˆ
=
log p(y|α
) log p
a
(y|α
), v
t
=
(K
∗−1
+ V
1
)
1
,and
ˆ
l
(4)
= b
(4)
(x
β + α
tt
t
t t
)
is the fourth derivative of the t
th
element of
ˆ
l with b
(4)
being the fourth derivative of b.
All of the quantities needed for the approximation (6.24) can be calculated readily for the
exponential family and require only the quantities needed for the Laplace approximation
above.
Shephard and Pitt (1997) and Durbin and Koopman (1997) use Kalman ltering and
smoothing to obtain the mode corresponding to an approximating linear, Gaussian state
space model. Details are in Davis and Rodriguez-Yam (2005) and Durbin and Koopman
(2012). For this procedure to be valid, it is required that log p(y|x
T
β + α) be concave in α,
which is a stronger condition than is needed for the approach based on the innovations
algorithm to obtain the Laplace approximation. In cases where log p(y|x
T
β + α) is not log
concave, Jungbacker and Koopman (2007) show that the Kalman lter and smoother algo-
rithms of Durbin and Koopman (1997) can still be used in conjunction with a line search
version of Newton–Raphson to obtain the required mode for the Laplace approximating
density. However, derivation of this result in Jungbacker and Koopman (2007) is based on
different arguments than presented in Durbin and Koopman (2012).
We let
θ
ˆ
a
= arg max
θ
L
a
(θ) and call this the approximate MLE for θ. Use of derivative
information can substantially assist numerical optimization algorithms to nd the maxi-
mum over θ. However, expressions for the rst derivative vector involves the third order
partial derivatives of F(α(θ); θ), which in turn require implicit differentiation to be used.
Needless to say the analytic expressions are complex and need to be worked out for each
specic model being considered for the latent process autocorrelation structure. Numerical
derivatives based on nite differences are often used but these can be slow to execute and
inaccurate. An alternative to using numerical derivatives suggested in Skaug (2002) and
Skaug and Fournier (2006) is to use automatic differentiation (AD). Sometimes called “algo-
rithmic differentiation”, this method takes computer code for calculation of L
a
(θ; y
n
, α
) in
(6.20) and produces new code that evaluates its derivatives. This is not the same as using a
symbolic differentiator such as used in Maple or Mathematica.
Davis and Rodriguez-Yam (2005) demonstrate the very good accuracy of Laplace
approximation without use of importance sampling for the models considered here. Use of
higher order terms (Shun and McCullagh, 1995) in the Taylor series expansion of F in (6.13)
133 State Space Models for Count Time Series
would be relatively easy to implement and are likely to improve the performance even fur-
ther, particular for highly discrete data such as in Bernoulli responses. While the Laplace
approximation appears to be very close to the required integral likelihood, convergence
of the Laplace approximation for parameter driven processes has not been established,
despite many methods that base their analysis on approximating the likelihood with the
Laplace approximation and simulating the error such as Durbin and Koopman (2000) and
Davis and Rodriguez-Yam (2005).
As yet, there is no useable asymptotic theory for the Laplace approximate MLE. Simu-
lation results in Davis and Rodriguez-Yam (2005) for reasonably simple examples suggest
that the bias is small and the distribution is approximately normal.
Despite the drawbacks of asymptotic bias and lack of a CLT for the estimates obtained
by maximizing the approximate likelihoods L
a
(θ) in (6.20) or L
a,DK
(θ) in (6.24), they could
be very useful in the model building phase to decide on appropriate regressors and a latent
process model form. Here, the emphasis should be on speed to arrive at a parsimonious
model in which relevant regressors are included and serial dependence is adequately taken
into account. Approximate standard errors could be obtained using numerical differentia-
tion of the approximate likelihood to obtain an approximate Hessian and hence information
matrix. For example, a standard optimizer such as ‘optim’ in R (R Core Team, 2014), the
Hessian at convergence could be requested. Another approach to this is as suggested in
Davis and Rodriguez-Yam (2005) and uses bootstrap methods to adjust for bias and to
obtain an approximate distribution for the approximate likelihood estimators. Alterna-
tively, information theoretic criteria could be used to compare models of different structures
or complexity or informal use of the likelihood ratio test for nested model structures could
be employed. Once one or several suitable model structures have been decided upon by this
approach, the approximate likelihood methods could be enhanced by importance sampling
to improve the accuracy of inference and nal model choice.
6.2.5 Importance Sampling
The general use of importance sampling methods to evaluate the likelihood for nonlinear
non-Gaussian models is well reviewed in Durbin and Koopman (2012). Importance sam-
pling is used to approximate the expectation of the error term of the Laplace approximation
L
a
(θ) to the likelihood L(θ) in (6.21) and, for the Gaussian approximation, to approxi-
mate the expectation term multiplying L
g
(θ) in (6.23). Both of these implementations of
importance sampling draw samples from the same modal approximation to the posterior
distribution for α|y. An alternative approach to approximating L(θ) is efcient impor-
tance sample (EIS), that uses a global approximation minimizing variance of the log of
the importance weights. We discuss these three approaches in more detail here.
6.2.5.1 Importance Sampling based on Laplace Approximation
Based on (6.21), Davis and Rodriguez-Yam (2005) use importance sampling to approximate
the expectation as
M
Er
a
(θ) = E
a
e
R(α,α
)
1
e
R(α
(i)
;α
)
=: e
e
ˆ
(θ)
, (6.25)
M
i=1
where the α
(i)
are independent draws from p
a
(α|y). However, this calculation is slow,
needing M Monte Carlo simulations at each step of the optimization of the likelihood.
134 Handbook of Discrete-Valued Time Series
They also suggest using approximate importance sampling (AIS) in which e
e
ˆ
(θ)
in (6.25)
is approximated using a rst order Taylor expansion in θ around θ
ˆ
AL
of the form
T
e
ˆ
(θ;
θ
ˆ
AL
) =
e
ˆ
(θ) + q
AL
(θ
θ
ˆ
AL
), (6.26)
where q
AL
=
e
ˆ
(θ)|
ˆ
is obtained using numerical differentiation. This method is consid-
θ
θ
AL
erably faster than IS because the approximate AIS approximate likelihood given by
L
c
(θ) = L
a
(θ) exp
T
e
ˆ
(θ;
θ
ˆ
AL
)
(6.27)
requires that q
AL
(which uses importance sampling) only needs to be computed once before
L
c
(θ) is optimized.
Numerical experiments in Davis and Rodriguez-Yam (2005) show that this estimator has
no signicant drop in accuracy compared to traditional importance sampling. However, it
is possible to iterate over the AIS approximation of Er
a
(θ) if further accuracy is required,
where θ
c
is replaced with the updated estimate for θ at each step.
6.2.5.2 Importance Sampling based on Gaussian Approximation
Durbin and Koopman (1997) advocate using the form (6.23) to approximate the likelihood
by approximating E
a
[w(α, y)], where w(α, y) = p(y, α)/g(y, α) = p(y|α)/g(y|α) are the
importance weights by
1
M
E
a
[w(α, y)]≈
w
α
(i)
, y
, (6.28)
M
i=1
andasbefore,the α
(i)
are independent draws from the same p
a
(α|y) as used in the Davis
and Rodriguez-Yam approach to importance sampling. However, Durbin and Koopman
(2012) and Shephard and Pitt (1997) use the simulation smoother of De Jong and Shep-
hard (1995) to draw the samples from this approximating posterior distribution. In order
to improve the accuracy and numerical efciency of the Monte Carlo estimate of the error
term, Durbin and Koopman (1997) employ antithetic variables to balance location and scale
and a control variable based on a fourth order Taylor expansion of the weight function in
the error term. This is not the same as the AIS method of Davis and Rodriguez-Yam (2005).
6.2.5.3 Efficient Importance Sampling
Choice of importance sampling density is critical for efcient approximation of the likeli-
hood by simulation. Modal approximation was used in the methods of the previous two
subsections. An alternative choice of importance density leads to the EIS method—see
Jung and Liesenfeld (2001), Richard and Zhang (2007), and Jung et al. (2006) for complete
details on this approach. For EIS, the importance densities are based on the (approximate)
minimization of the variance of the log importance weight log w(α, y) dened as in the
Durbin–Koopman method. Lee and Koopman (2004) compare this with DK procedure and
conclude they have similar performance when applied to the stochastic volatility model.
135 State Space Models for Count Time Series
Recently, Koopman et al. (2015) have presented an approach called numerically accel-
erated importance sampling (NAIS) for the nonlinear non-Gaussian state space models.
The NAIS method combines fast numerical integration techniques with the Kalman l-
ter smoothing methods proposed by Shephard and Pitt (1997) and Durbin and Koopman
(1997). They demonstrate signicant computational speed improvements over standard
EIS implementations as well as improved accuracy. Additionally, by using new con-
trol variables substantial improvement in the variability of likelihood estimates can be
achieved. A key component of the NAIS method is construction of the importance sam-
pling density by numerical integration using Gauss–Hermite quadrature in contrast to
using simulated trajectories.
6.2.6 Composite Likelihood
As already noted, the lack of a tractably computable form for the likelihood is one of the
drawbacks in using SSMs for count time series. Estimation procedures described in previ-
ous sections essentially resort to simulation-based procedures for computing the likelihood
and then maximizing it. With any simulation procedure, one cannot be certain that the sim-
ulated likelihood provides a good facsimile of the actual likelihood or even the likelihood
in proximity of the maximum likelihood estimator. An alternative estimation procedure,
which has grown recently in popularity, is based on the composite likelihood. The idea is
that perhaps one does not need to compute the entire likelihood but only the likelihoods
of more manageable subcollections of the data, which are then combined. For example,
the likelihood given in (6.5) requires an n-fold integration, which is generally impractical
even for moderate sample sizes. However, a similar integral based on only 2 or 3 dimen-
sions can be computed numerically rather accurately. The objective then is to replace a large
dimensional integral with many small and manageable integrations.
The special issue of Statistical Sinica (Volume 21 (2011)) provides an excellent survey on
the use of composite likelihood methods in statistics. In the development below, we will
follow the treatment of Davis and Yau (2011) with emphasis on Example 3.4. Ng et al. (2011),
also in the special issue, consider composite likelihood for related time series models.
While we will focus this treatment on the pairwise likelihood for count time series, exten-
sions to combining weighted likelihoods based on more general subsets of the data are
relatively straightforward. The downside in our application is that numerically, it may not
be practical to compute density functions of more than just a few observations.
Suppose Y
1
, ..., Y
n
are observations from a time series for which we denote
p(y
i
1
, y
i
2
, ..., y
i
k
; θ) as the likelihood for a parameter θ based on the k-tuples of distinct
observations y
i
1
, y
i
2
, ..., y
i
k
. For ease of notation, we have suppressed the dependence
of p(·; θ) on k and the vector i
1
, i
2
, ..., i
k
. For k xed, the composite likelihood based on
consecutive k-tuples is given by
nk
CPL
k
(θ; Y
(n)
) = log p(Y
t
, Y
t+1
, ..., Y
t+k1
; θ). (6.29)
t=1
Normally, one does not take k very large due to the difculty in computing the
k-dimensional joint densities. The composite likelihood estimator
θ
ˆ
is found by maxi-
mizing CPL
k
. Under identiability, regularity, and suitable mixing conditions,
θ
ˆ
is con-
sistent and asymptotically normal. This was established for linear time series models in
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset