12: Hidden Markov Models for Discrete-Valued Time Series (3/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

277 Hidden Markov Models for Discrete-Valued Time Series

nested within the EM loop. There is no such complication if direct numerical maximiza-

tion of the observed data likelihood is used; only one numerical optimization is needed,

irrespective of whether stationarity is assumed.

Second, in models of relatively complex structure, for example, those of Zucchini et al.

(2008), it is a clear advantage of direct numerical maximization that we do not have to

derive and code the E and M steps, only a likelihood evaluator. This advantage is par-

ticularly marked if details of the model are repeatedly modied in a search for suitable

structure.

12.5 Forecasting and Decoding

The matrix product expression (12.4) for the likelihood makes it possible to compute var-

ious conditional distributions associated with the observations X

and latent variables

. We list several of these distributions here and discuss their uses in forecasting and

“decoding.” We do not assume in this section that the Markov chain is stationary.

12.5.1 Forecast Distributions

First note that the forecast distribution (for h periods ahead) is a ratio of likelihoods:

| X

(T) (T)

�

P(x)1



Pr(X

T+h

= x = x ) = .

Then write φ

= α

/α



= α

, getting

Pr(X

T+h

= x | X

(T)

= x

(T)

) = φ

�

P(x)1



, (12.8)

which is a mixture of the m state–dependent distributions p

.Ash increases, �

approaches



δ and the probability (12.8) therefore tends to δP(x)1



. The speed of this convergence is

determined by the second largest eigenvalue modulus of �.

12.5.2 State Probabilities, State Prediction, and Decoding

It can be shown that, for t = 1, 2, ..., T,

(i)β

(i) = Pr(X

(T)

= x

(T)

, C

= i),

from which it follows that

Pr(C

= i | X

(T)

= x

(T)

) =

(i)β

(i)

. (12.9)

This gives the conditional distribution (under the model) of the state at time t, given all the

observations.

278 Handbook of Discrete-Valued Time Series

Similarly, it can be shown that, for h ∈ N ,

Pr(C

T+h

= i | X

(T)

= x

(T)

) = α

�



= φ

�



, (12.10)

where e

= (0, ...,0,1,0,...,0) has a one in the ith position only. This enables us to forecast

the (latent) state h periods ahead. If h tends to innity the probability (12.10) tends to δ

lim

�



= φ



δe



= 1 × δe



= δ

h→∞

The speed of convergence is once again determined by the second largest eigenvalue

modulus of �.

By using Equation (12.9) we can nd that state i ∈{1, 2, ..., m} which is at time t the

most likely, given all the observations. This is known as local decoding. If we do that for all

times t, the resulting path may in some circumstances be useful, but it is not in general the

most likely path followed by the Markov chain, that is, it is not in general that sequence of

states c

, c

, ..., c

which maximizes the conditional probability

Pr(C

(T)

= c

(T)

| X

(T)

= x

(T)

That maximizing path (which need not be unique) can be found by an application of

dynamic programming known as the Viterbi algorithm (Viterbi, 1967, 2006), and the pro-

cess is known as global decoding. The results of local and global decoding are often very

similar but not identical.

12.6 Model Selection and Checking

A question that often arises when we seek to use an HMM in practice is: How many

states should there be? A common question of another type is: Should we use, for exam-

ple, Poisson state–dependent distributions or negative binomial? In the absence of useful

subject-matter information, model selection questions such as these are not easy to answer

completely satisfactorily, but there are some simple tools that can certainly help us to select

from a group of competing models. An introduction to model selection may be found in

Zucchini (2000). But once we have chosen the “best” model according to some criterion,

we still have to consider the question of whether the chosen model can be regarded as

adequate. In this section, we describe the model selection criteria: Akaike’s information

criterion (AIC) and Bayesian information criterion (BIC) and the use of pseudo-residuals

(also known as quantile residuals) in order to check the adequacy of the model.

12.6.1 AIC and BIC

Two widely used model selection criteria are the AIC and the BIC. The AIC selects that

model which, of those under consideration, has the smallest value of the quantity

AIC =−2 ln L + 2p.

279 Hidden Markov Models for Discrete-Valued Time Series

Here, ln L is the log-likelihood of the tted model and p denotes the number of parameters

of the model. The rst term measures the lack of t, and the second term is a penalty which

increases with the number of parameters. The criterion BIC differs from AIC in the penalty

term only:

BIC =−2 ln L + p ln T.

The penalty term of BIC is greater than that of AIC if T > e

,thatis,for T ≥ 8. Thus, the

BIC generally tends to select models with fewer parameters than does the AIC.

12.6.2 Model Checking by Pseudo-Residuals

There are simple informal checks that can be made on the adequacy of an HMM. One can

(assuming stationarity) compare sample and model quantities such as the mean, variance,

ACF, and the distribution of the observations. If, for example, the data were to display

marked overdispersion but the model did not, we would discard or modify the model.

But there are additional systematic checks that can be performed. We describe here the

use of “pseudo-residuals.” For t = 1, 2, …, T, dene

−

= Pr(X

< x

| X

(−t)

= x

(−t)

), z

−

= 

−1

−

t t t

)

and

(−t)

−1

= Pr(X

≤ x

| X

(−t)

= x ), z =  (u

t t

with  denoting the standard normal distribution function. The interval [u

− +

] (on a

, u

probability scale) or [z

− +

] (on a “normal” scale) gives an indication of how extreme

, z

is relative to its conditional distribution given the other observations. The conditional

distribution of one observation given the others is therefore needed, but it is a ratio of like-

lihoods and can be found in very much the same way as were the conditional distributions

in Section 12.5.1.

− +

There are several ways in which such pseudo-residual “segments” [z

, z ] can be used.

We give one here. If the observations were continuous there would be a single quantity z

and not an interval, which would have a standard normal distribution if the model were

correct, and so a quantile–quantile (QQ) plot could be used to assess the adequacy of the

model. For discrete-valued series, we use the “mid-pseudo-residual” z

= 

−1

((u

−

)/2) to sort the pseudo-residuals in order to produce the QQ plot. Although we can

claim no more than approximate normality for mid-pseudo-residuals, they are neverthe-

less useful for identifying poor ts. An example of their application is given in Section 12.7;

see Figure 12.5.

12.7 An Example: Weekly Sales of a Soap Product

We describe here the tting of HMMs to a series of counts and demonstrate briey the

use of the forecasting, decoding, model selection, and checking techniques introduced in

Sections 12.5 and 12.6. Consider the series of weekly sales (in integer units, 242 weeks) of a

280 Handbook of Discrete-Valued Time Series

TABLE 12.1

Weekly Sales of the Soap Product; to Be Read across Rows

1 6 9 18 14 8 8 1 6 7 3 3 1 3 4 12 8 10 8 2

17 15 7 12 22 10 4 7 5 0 2 5 3 4 4 7 5 6 1 3

4 5 3 7 3 0 4 5 3 3 4 4 4 4 4 3 5 5 5 7

4 0 4 3 2 6 3 8 9 6 3 4 3 3 3 3 2 1 4 5

5 2 7 5 2 3 1 3 4 6 8 8 5 7 2 4 2 7 4 15

15 12 21 20 13 9 8 0 13 9 8 0 6 2 0 3 2 4 4 6

3 2 5 5 3 2 1 1 3 1 2 6 2 7 3 2 4 1 5 6

8 14 5 3 6 5 11 4 5 9 9 7 9 8 3 4 8 6 3 5

6 3 1 7 4 9 2 6 6 4 6 6 13 7 4 8 6 4 4 4

9 2 9 2 2 2 13 13 4 5 1 4 6 5 4 2 3 10 6 15

5 9 9 7 4 4 2 4 2 3 8 15 0 0 3 4 3 4 7 5

7 6 0 6 4 14 5 1 6 5 5 4 9 4 14 2 2 1 5 2

6 4

particular soap product in a supermarket, as shown in Table 12.1 and Figure 12.3. The

data were downloaded in 2007 from http://gsbwww.uchicago.edu/kilts/research/db/

dominicks, the Dominick’s Finer Food database at the Kilts Center for Marketing, Gradu-

ate School of Business of the University of Chicago.

∗

(The product was “Zest White Water

15 oz.,” with code 3700031165 and store number 67.)

The data display considerable overdispersion relative to Poisson: the sample mean is

5.44, the sample variance 15.40. In addition, the sample ACF (Figure 12.4) shows strong

evidence of serial dependence. Any satisfactory model has to mimic these properties.

Stationary Poisson–HMMs with one to four states were tted to the soap sales by

direct numerical maximization of the log-likelihood; the number of parameters, minus log-

likelihood, AIC and BIC values are shown in Table 12.2. Also given are the results for

two- to four-state models in which stationarity is not assumed. A one-state model (a

sequence of independent Poisson random variables) shows up as completely inadequate,

Count

0 50 100 150 200 250

Week

FIGURE 12.3

Series of weekly sales of the soap product.

∗

That database is now (April 28, 2015) at http://research.chicagobooth.edu/kilts/marketing-databases/

dominicks (University of Chicago Booth School of Business, 2015).

281 Hidden Markov Models for Discrete-Valued Time Series

Series soap_sales

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

ACF

Lag

FIGURE 12.4

Autocorrelation function of the weekly soap sales.

TABLE 12.2

Soap Sales Data: Comparison, by AIC and BIC, of Poisson–HMMs with 1–4 States

Type of Model

Stationary

No. of States

No. of Parameters

−ln L

711.8

618.7

AIC

1425.5

1245.3

BIC

1429.0

1259.3

3 9 610.5 1239.0 1270.4

4 16 604.2 1240.4 1296.2

Nonstationary 2

618.5

610.2

1246.9

1242.4

1264.4

1280.8

4 19 602.8 1243.5 1309.8

Note: Bold type indicates the relevant minima.

but this is not surprising in view of the overdispersion and autocorrelation already noted.

For both the stationary and the nonstationary case, the AIC is minimized by a three-state

model and the BIC by a two-state model.

Figure 12.5 displays QQ plots of (mid-)quantile residuals for stationary models with

1–3 states. The plots indicate that the one-state model is clearly inferior to the two- and

three-state models and that there is little to choose between the other two.

In passing, it is interesting to note that, in the three- and four-state cases, it is not difcult

to nd, both by direct numerical maximization and by EM, a local maximum of the log-

likelihood that is not a global maximum. This phenomenon underscores the importance of

trying several sets of starting values for the iterations.

The best stationary three-state model found has state-dependent means given by

�

λ =

(3.74, 8.44, 14.93), t.p.m.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12: Hidden Markov Models for Discrete-Valued Time Series (3/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
12: Hidden Markov Models for Discrete-Valued Time Series (3/4)