272 Handbook of Discrete-Valued Time Series
Thus, L
T
can be computed by using the following simple recursion:
α
1
= δ
P(x
1
);
α
t
= α
t1
P(x
t
) for t = 2, 3, ..., T;
L
T
= α
T
1
.
The number of operations involved in this computation is of order Tm
2
, which is a great
improvement on the O(Tm
T
) operations required to evaluate L
T
using (12.3). There is still
the minor complication that the recursion given earlier often results in numerical under-
ow, but that can be remedied as follows. For each t, scale up the vector α
t
so that its m
components add to 1, keep track of the sum of the logs of all the scale factors thus applied,
and adjust the resulting value of the log-likelihood by this sum.
The recursive algorithm given earlier is called the forward algorithm and the m compo-
nents of the vector α
t
are called the forward probabilities, the jth component being the
probability Pr(X
(t)
=x
(t)
, C
t
=j). The algorithm, which requires a single pass through
the data, is fast enough to enable us to apply direct numerical maximization to estimate
the parameters of the model.
12.3.2 Marginal and Joint Distributions
The matrix expression for the likelihood provides a simple means of computing various
marginal and other distributions of interest. For instance, for a stationary Poisson–HMM
(i.e., if δ
= δ) it follows that
1. Pr(X
t
= x) = δP(x)1
;
2. Pr(X
t
= v, X
t+k
= w) = δP(v)
k
P(w)1
;
3. if X
(t)
denotes the observations at all times ∈{1, 2, ..., T} other than t, which lies
strictly between 1 and T,
Pr(X
(t)
= x
(t)
) = δP(x
1
)P(x
2
) ···P(x
t1
)
2
P(x
t+1
) ···P(x
T
)1
;
4. if a and b are nonnegative integers and t lies strictly between 1 and T,
Pr(X
(t)
= x
(t)
, a X
t
b)
t1 b T
= δP(x
1
) P(x
k
)
P(x
t
)
P(x
k
)
1
.
k=2
x
t
=a
k=t+1
Statements 3 and 4 generalize in straightforward fashion to cases in which there are several
missing observations, or several interval-censored observations, or both.
Statement 4 can be shown by summing the full likelihood
T
L
T
= δP(x
1
) P(x
k
)
1
k=2
273 Hidden Markov Models for Discrete-Valued Time Series
over the appropriate range of x
t
-values. Statement 3 then follows from
x
t
=0
P(x
t
) = I
m
.
The probability Pr(X
(t)
= x
(t)
) is an example of what Little (2009, p. 411) terms an
“ignorable likelihood,” which can be used to estimate parameters when the missingness
mechanism is ignorable. Similarly, likelihoods of the form shown in statement 4 can be
used to estimate parameters when there is interval censoring but one can assume that the
censoring mechanism is ignorable.
12.3.3 Moments
The moments and autocorrelation function (ACF) of {X
t
} are also available. Let the dis-
tribution p
i
have mean μ
i
and variance σ
2
i
(both nite) and let M be the diagonal matrix
having μ = (μ
1
, μ
2
, ..., μ
m
) on the diagonal. Then if the Markov chain is stationary, we
have the following:
E(X
t
) = δμ
;
m
Var(X
t
) =
i=1
δ
i
(σ
i
2
+ μ
2
i
) (δμ
)
2
;
m
for positive integers k,E( f (X
t
, X
t+k
)) =
i,j=1
δ
i
(
k
)
ij
f (i, j);
for positive integers k,Cov(X
t
, X
t+k
) = δM
k
μ
(δμ
)
2
;
for positive integers k, the ACF of {X
t
} is given by
δM
k
μ
(δμ
)
2
ρ
k
= Corr(X
t
, X
t+k
) =
.
Var(X
t
)
If the eigenvalues of are distinct, or more generally if is diagonalizable, then the ACF
is a linear combination of the kth powers of the eigenvalues other than 1.
If the state-dependent distributions are all Poisson, it follows that
Var(X
t
) = E(X
t
) + δ
i
δ
j
(μ
i
μ
j
)
2
,
i<j
which displays the overdispersion of the marginal distribution of X
t
relative to the Poisson.
A Poisson–HMM can, therefore, represent both overdispersion and serial dependence in a
series of unbounded counts. The variance and covariance of a two-state Poisson–HMM are
Var(X
t
) = δ
1
μ
1
+ δ
2
μ
2
+ δ
1
δ
2
(μ
1
μ
2
)
2
and, for all positive integers k,
Cov(X
t
, X
t+k
) = δ
1
δ
2
(μ
1
μ
2
)
2
(1 γ
12
γ
21
)
k
.
The ACF is then of the form Aw
k
, where (apart from degenerate cases) A (0, 1) and
w (1, 1).
274 Handbook of Discrete-Valued Time Series
12.4 Parameter Estimation by Maximum Likelihood
We describe two methods for computing maximum likelihood estimates of the parameters
of an HMM: numerical maximization of likelihood by a general-purpose optimizer and
the expectation-maximization (EM) algorithm. For Bayesian estimation, see Scott (2002),
Frühwirth-Schnatter (2006), Rydén (2008), or Zucchini and MacDonald (2009, Chapter 7).
12.4.1 Maximum Likelihood via Direct Numerical Maximization
Since likelihood evaluation is straightforward and requires only O(Tm
2
) operations, one
obvious route to maximum likelihood estimates is numerical maximization by means of
a general-purpose numerical optimizer. There are of course constraints on the parameters
which the optimization must respect. For an m-state Poisson–HMM there are nonnegativity
constraints, both on the m state–dependent means λ
i
and on the transition probabilities γ
ij
,
m
and m row-sum constraints,
j=1
γ
ij
= 1. Ignoring constraints can cause the optimizer
to fail.
One strategy is to use a constrained optimizer such as the NAG routine E04UCF or
constrOptim in R. However, it has been our experience that, in the case of HMMs, con-
vergence is usually faster if one uses an unconstrained optimizer. The necessary constraints
can easily be imposed indirectly by reparametrizing the model in terms of unconstrained
parameters. As an illustration we give a mapping that transforms the constrained “nat-
ural parameters” λ
i
and γ
ij
of a (stationary) m-state Poisson–HMM to unconstrained
“working parameters” η
i
and τ
ij
.First,for i = 1, 2, …, m,set η
i
= ln(λ
i
). Second, for
i = j set
τ
ij
= ln γ
ij
/ 1 γ
ik
= ln γ
ij
/γ
ii
.
k=i
The transformations in the opposite direction are λ
i
= exp(η
i
) and
γ
ij
= exp(τ
ij
)/ 1 + exp(τ
ik
) .
k=i
Given any values of the working parameters, we can nd the corresponding natural
parameters and evaluate the (log-)likelihood. It is, therefore, possible to maximize the log-
likelihood as a function of the working parameters by using an unconstrained optimizer;
the maximizing values of the natural parameters are then easily deduced. This process does
conne the estimates of the natural parameters to the interior of the parameter space. Put
differently, it will not result in transition probabilities or state-dependent means that are
exactly zero.
An advantage of direct numerical maximization is that the model being tted can be
changed without this causing much change to the estimation procedure. The general matrix
expression (12.4) for the likelihood can be used for univariate or multivariate observations,
for models with or without covariates, for models in which not all the state-dependent
distributions are of the same type, etc.
275 Hidden Markov Models for Discrete-Valued Time Series
12.4.2 Maximum Likelihood via the EM Algorithm
There is a strong historical link between HMMs and the EM algorithm, as the Baum–Welch
algorithm for nding maximum likelihood estimators in such a model is an important fore-
runner and special case of EM (Dempster et al., 1977; Welch, 2003). The EM algorithm is
therefore regarded by many as the “method of choice” for nding maximum likelihood
estimates in HMMs. Indeed, in a very thorough comparative review of the use of EM and
Markov chain Monte Carlo in HMMs (Rydén, 2008), there appears to be no mention of
direct numerical maximization of likelihood as an alternative to EM in nding MLEs.
We now describe the application of the EM algorithm to an HMM. We do not assume
here that the Markov chain is stationary, but we indicate in Section 12.4.3 the modication
that will be needed if that assumption is made. In order to apply the EM algorithm to an
HMM we need both the forward probabilities (dened in Section 12.3.1) and the “backward
probabilities”
β
t
(j) = Pr(X
t+1
= x
t+1
, ..., X
T
= x
T
| C
t
= j).
The latter probabilities are found by the following backward recursion:
β
T
= 1,
β
t
= P(x
t+1
)β
t
+1
, for t = T 1, T 2, ...,1.
For the EM algorithm, we need the complete-data log-likelihood (CDLL), Pr(x
(T)
, c
(T)
).
Dening
u
j
(t) = 1 if and only if c
t
= j, (t = 1, 2, ..., T)
and
v
jk
(t) = 1 if and only if c
t1
= j and c
t
= k (t = 2, 3, ..., T),
we can show that the CDLL of the model is
ln Pr(x
(T)
, c
(T)
)
m m
m
T m
T
=
u
j
(1) ln δ
j
+ v
jk
(t) ln γ
jk
+ u
j
(t) ln p
j
(x
t
) (12.5)
j=1 j=1 k=1
t=2
j=1
t=1
= term 1 + term 2 + term 3.
The EM algorithm for HMMs is as follows.
EstepIn the CDLL, replace the quantities v
jk
(t) and u
j
(t) by their conditional
expectations given the observations x
(T)
and given the current parameter estimates:
u
j
(t) = Pr(C
t
= j | x
(T)
) = α
t
(j)β
t
(j)/L
T
; (12.6)
276 Handbook of Discrete-Valued Time Series
and
v
jk
(t) = Pr(C
t1
= j, C
t
= k | x
(T)
) = α
t1
(j) γ
jk
p
k
(x
t
) β
t
(k)/L
T
. (12.7)
MstepHaving replaced v
jk
(t) and u
j
(t) by
v
jk
(t) and
u
j
(t), maximize the CDLL,
expression (12.5), with respect to the three sets of parameters: the initial distri-
bution δ
, the t.p.m. , and the parameters of the state-dependent distributions
(e.g., λ
1
, ... , λ
m
in the case of a simple Poisson–HMM).
Examination of (12.5) reveals that the M step splits here into three separate
maximizations.
m
1. Set δ
j
=
u
j
(1)/
j=1
u
j
(1) =
u
j
(1).
m
T
2. Set γ
jk
= f
jk
/
k=1
f
jk
, where f
jk
=
t=2
v
jk
(t).
3. The maximization of the third term may be easy or difcult, depending on the
nature of the state-dependent distributions assumed. It is essentially the standard
problem of maximum likelihood estimation for the distributions concerned. In the
case of Poisson and normal distributions, closed-form solutions are available. In
some other cases, for example, the gamma distributions and the negative binomial,
numerical maximization—or some modication of EM—will be necessary to carry
out this part of the M step.
One starts by giving initial estimates of the model parameters. The E and M steps are
then repeated until convergence is achieved. As in the case of likelihood evaluation via
the forward recursion, precautions have to be taken to avoid underow.
12.4.3 Remarks
It seems clear from Sections 12.4.1 and 12.4.2 that, for such an HMM, direct numerical max-
imization of the observed data likelihood is at least conceptually simpler than EM. To carry
out the former we need only a likelihood evaluator and a general-purpose optimizer such
as nlm in R. But in order to apply EM, we rst compute the forward probabilities (which
are all that is needed to evaluate the likelihood) and then do considerably more, includ-
ing computation of the backward probabilities and the quantities
u
j
(t),
v
jk
(t),and f
jk
.In
both methods, it is possible to become trapped in a local maximum which is not the global
maximum, although EM seems less likely to fail in this way than direct numerical maxi-
mization; see Bulla and Berzel (2008, Table 2). However, in HMMs there are in our view
several advantages of numerical maximization over EM, as there seem to be in some other
contexts as well; see MacDonald (2014).
The assumption of stationarity often seems appropriate in time series applications.
However, for HMMs tted by EM it is almost never assumed that the underlying (homo-
geneous) Markov chain is stationary, that is, that δ
= δ. (The MLE of δ
then turns out to
be a unit vector: one element is 1 and the others 0.) One obvious reason for not assuming
stationarity is convenience. For stationary series, there is no explicit formula for the matrix
which maximizes term 1 + term 2 of the CDLL in the M step; see, for example, Bulla
and Berzel (2008, Section 2.3). It is not difcult to carry out the maximization numerically,
but that implies a numerical optimization within each M step, that is, a maximization loop
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset