12: Hidden Markov Models for Discrete-Valued Time Series (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

 

  



272 Handbook of Discrete-Valued Time Series

Thus, L

can be computed by using the following simple recursion:

= δ

∗

P(x

);

= α

t−1

�P(x

) for t = 2, 3, ..., T;

= α



The number of operations involved in this computation is of order Tm

, which is a great

improvement on the O(Tm

) operations required to evaluate L

using (12.3). There is still

the minor complication that the recursion given earlier often results in numerical under-

ow, but that can be remedied as follows. For each t, scale up the vector α

so that its m

components add to 1, keep track of the sum of the logs of all the scale factors thus applied,

and adjust the resulting value of the log-likelihood by this sum.

The recursive algorithm given earlier is called the forward algorithm and the m compo-

nents of the vector α

are called the forward probabilities, the jth component being the

probability Pr(X

(t)

, C

=j). The algorithm, which requires a single pass through

the data, is fast enough to enable us to apply direct numerical maximization to estimate

the parameters of the model.

12.3.2 Marginal and Joint Distributions

The matrix expression for the likelihood provides a simple means of computing various

marginal and other distributions of interest. For instance, for a stationary Poisson–HMM

(i.e., if δ

∗

= δ) it follows that

1. Pr(X

= x) = δP(x)1



;

2. Pr(X

= v, X

t+k

= w) = δP(v)�

P(w)1



;

3. if X

(−t)

denotes the observations at all times ∈{1, 2, ..., T} other than t, which lies

strictly between 1 and T,

Pr(X

(−t)

= x

(−t)

) = δP(x

)�P(x

) ···�P(x

t−1

)�

P(x

t+1

) ···�P(x



;

4. if a and b are nonnegative integers and t lies strictly between 1 and T,

Pr(X

(−t)

= x

(−t)

, a ≤ X

≤ b)

⎛

⎞

⎛

⎞

t−1 b T

= δP(x

) �P(x

) �

⎝

P(x

)

⎠

⎝

�P(x

)

⎠



k=2

k=t+1

Statements 3 and 4 generalize in straightforward fashion to cases in which there are several

missing observations, or several interval-censored observations, or both.

Statement 4 can be shown by summing the full likelihood





= δP(x

) �P(x

)



k=2





273 Hidden Markov Models for Discrete-Valued Time Series



∞

over the appropriate range of x

-values. Statement 3 then follows from

P(x

) = I

The probability Pr(X

(−t)

= x

(−t)

) is an example of what Little (2009, p. 411) terms an

“ignorable likelihood,” which can be used to estimate parameters when the missingness

mechanism is ignorable. Similarly, likelihoods of the form shown in statement 4 can be

used to estimate parameters when there is interval censoring but one can assume that the

censoring mechanism is ignorable.

12.3.3 Moments

The moments and autocorrelation function (ACF) of {X

} are also available. Let the dis-

tribution p

have mean μ

and variance σ

(both nite) and let M be the diagonal matrix

having μ = (μ

, μ

, ..., μ

) on the diagonal. Then if the Markov chain is stationary, we

have the following:

• E(X

) = δμ



;

• Var(X

) =

i=1

(σ

+ μ

) − (δμ



)

;

• for positive integers k,E( f (X

, X

t+k

)) =

i,j=1

(�

)

f (i, j);

• for positive integers k,Cov(X

, X

t+k

) = δM�



− (δμ



)

;

• for positive integers k, the ACF of {X

} is given by

δM�



− (δμ



)

= Corr(X

, X

t+k

) =

Var(X

)

If the eigenvalues of � are distinct, or more generally if � is diagonalizable, then the ACF

is a linear combination of the kth powers of the eigenvalues other than 1.

If the state-dependent distributions are all Poisson, it follows that

Var(X

) = E(X

) + δ

(μ

− μ

)

i<j

which displays the overdispersion of the marginal distribution of X

relative to the Poisson.

A Poisson–HMM can, therefore, represent both overdispersion and serial dependence in a

series of unbounded counts. The variance and covariance of a two-state Poisson–HMM are

Var(X

) = δ

+ δ

(μ

− μ

)

and, for all positive integers k,

Cov(X

, X

t+k

) = δ

(μ

− μ

)

(1 − γ

− γ

)

The ACF is then of the form Aw

, where (apart from degenerate cases) A ∈ (0, 1) and

w ∈ (−1, 1).



 



274 Handbook of Discrete-Valued Time Series

12.4 Parameter Estimation by Maximum Likelihood

We describe two methods for computing maximum likelihood estimates of the parameters

of an HMM: numerical maximization of likelihood by a general-purpose optimizer and

the expectation-maximization (EM) algorithm. For Bayesian estimation, see Scott (2002),

Frühwirth-Schnatter (2006), Rydén (2008), or Zucchini and MacDonald (2009, Chapter 7).

12.4.1 Maximum Likelihood via Direct Numerical Maximization

Since likelihood evaluation is straightforward and requires only O(Tm

) operations, one

obvious route to maximum likelihood estimates is numerical maximization by means of

a general-purpose numerical optimizer. There are of course constraints on the parameters

which the optimization must respect. For an m-state Poisson–HMM there are nonnegativity

constraints, both on the m state–dependent means λ

and on the transition probabilities γ

and m row-sum constraints,

j=1

= 1. Ignoring constraints can cause the optimizer

to fail.

One strategy is to use a constrained optimizer such as the NAG routine E04UCF or

constrOptim in R. However, it has been our experience that, in the case of HMMs, con-

vergence is usually faster if one uses an unconstrained optimizer. The necessary constraints

can easily be imposed indirectly by reparametrizing the model in terms of unconstrained

parameters. As an illustration we give a mapping that transforms the constrained “nat-

ural parameters” λ

and γ

of a (stationary) m-state Poisson–HMM to unconstrained

“working parameters” η

and τ

.First,for i = 1, 2, …, m,set η

= ln(λ

). Second, for

i = j set











 

= ln γ

/ 1 − γ

= ln γ

/γ

k=i

The transformations in the opposite direction are λ

= exp(η

) and

= exp(τ

)/ 1 + exp(τ

) .

k=i

Given any values of the working parameters, we can nd the corresponding natural

parameters and evaluate the (log-)likelihood. It is, therefore, possible to maximize the log-

likelihood as a function of the working parameters by using an unconstrained optimizer;

the maximizing values of the natural parameters are then easily deduced. This process does

conne the estimates of the natural parameters to the interior of the parameter space. Put

differently, it will not result in transition probabilities or state-dependent means that are

exactly zero.

An advantage of direct numerical maximization is that the model being tted can be

changed without this causing much change to the estimation procedure. The general matrix

expression (12.4) for the likelihood can be used for univariate or multivariate observations,

for models with or without covariates, for models in which not all the state-dependent

distributions are of the same type, etc.





     





275 Hidden Markov Models for Discrete-Valued Time Series

12.4.2 Maximum Likelihood via the EM Algorithm

There is a strong historical link between HMMs and the EM algorithm, as the Baum–Welch

algorithm for nding maximum likelihood estimators in such a model is an important fore-

runner and special case of EM (Dempster et al., 1977; Welch, 2003). The EM algorithm is

therefore regarded by many as the “method of choice” for nding maximum likelihood

estimates in HMMs. Indeed, in a very thorough comparative review of the use of EM and

Markov chain Monte Carlo in HMMs (Rydén, 2008), there appears to be no mention of

direct numerical maximization of likelihood as an alternative to EM in nding MLEs.

We now describe the application of the EM algorithm to an HMM. We do not assume

here that the Markov chain is stationary, but we indicate in Section 12.4.3 the modication

that will be needed if that assumption is made. In order to apply the EM algorithm to an

HMM we need both the forward probabilities (dened in Section 12.3.1) and the “backward

probabilities”

(j) = Pr(X

t+1

= x

t+1

, ..., X

= x

| C

= j).

The latter probabilities are found by the following backward recursion:

= 1,



= �P(x

t+1

)β



, for t = T − 1, T − 2, ...,1.

For the EM algorithm, we need the complete-data log-likelihood (CDLL), Pr(x

(T)

, c

(T)

Dening

(t) = 1 if and only if c

= j, (t = 1, 2, ..., T)

and

(t) = 1 if and only if c

t−1

= j and c

= k (t = 2, 3, ..., T),

we can show that the CDLL of the model is

ln Pr(x

(T)

, c

(T)

)

m m

T m

∗

(1) ln δ

+ v

(t) ln γ

+ u

(t) ln p

) (12.5)

j=1 j=1 k=1

t=2

j=1

t=1

= term 1 + term 2 + term 3.

The EM algorithm for HMMs is as follows.

• EstepIn the CDLL, replace the quantities v

(t) and u

(t) by their conditional

expectations given the observations x

(T)

and given the current parameter estimates:



(t) = Pr(C

= j | x

(T)

) = α

(j)β

(j)/L

; (12.6)



276 Handbook of Discrete-Valued Time Series

and



(t) = Pr(C

t−1

= j, C

= k | x

(T)

) = α

t−1

(j) γ

) β

(k)/L

. (12.7)

• MstepHaving replaced v

(t) and u

(t) by



(t) and



(t), maximize the CDLL,

expression (12.5), with respect to the three sets of parameters: the initial distri-

bution δ

∗

, the t.p.m. �, and the parameters of the state-dependent distributions

(e.g., λ

, ... , λ

in the case of a simple Poisson–HMM).

Examination of (12.5) reveals that the M step splits here into three separate

maximizations.

1. Set δ

∗



(1)/

j=1



(1) =



(1).



2. Set γ

= f

k=1

, where f

t=2



(t).

3. The maximization of the third term may be easy or difcult, depending on the

nature of the state-dependent distributions assumed. It is essentially the standard

problem of maximum likelihood estimation for the distributions concerned. In the

case of Poisson and normal distributions, closed-form solutions are available. In

some other cases, for example, the gamma distributions and the negative binomial,

numerical maximization—or some modication of EM—will be necessary to carry

out this part of the M step.

One starts by giving initial estimates of the model parameters. The E and M steps are

then repeated until convergence is achieved. As in the case of likelihood evaluation via

the forward recursion, precautions have to be taken to avoid underow.

12.4.3 Remarks

It seems clear from Sections 12.4.1 and 12.4.2 that, for such an HMM, direct numerical max-

imization of the observed data likelihood is at least conceptually simpler than EM. To carry

out the former we need only a likelihood evaluator and a general-purpose optimizer such

as nlm in R. But in order to apply EM, we rst compute the forward probabilities (which

are all that is needed to evaluate the likelihood) and then do considerably more, includ-

ing computation of the backward probabilities and the quantities



(t),



(t),and f

.In

both methods, it is possible to become trapped in a local maximum which is not the global

maximum, although EM seems less likely to fail in this way than direct numerical maxi-

mization; see Bulla and Berzel (2008, Table 2). However, in HMMs there are in our view

several advantages of numerical maximization over EM, as there seem to be in some other

contexts as well; see MacDonald (2014).

The assumption of stationarity often seems appropriate in time series applications.

However, for HMMs tted by EM it is almost never assumed that the underlying (homo-

geneous) Markov chain is stationary, that is, that δ

∗

= δ. (The MLE of δ

∗

then turns out to

be a unit vector: one element is 1 and the others 0.) One obvious reason for not assuming

stationarity is convenience. For stationary series, there is no explicit formula for the matrix

� which maximizes term 1 + term 2 of the CDLL in the M step; see, for example, Bulla

and Berzel (2008, Section 2.3). It is not difcult to carry out the maximization numerically,

but that implies a numerical optimization within each M step, that is, a maximization loop

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12: Hidden Markov Models for Discrete-Valued Time Series (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
12: Hidden Markov Models for Discrete-Valued Time Series (2/4)