277 Hidden Markov Models for Discrete-Valued Time Series
nested within the EM loop. There is no such complication if direct numerical maximiza-
tion of the observed data likelihood is used; only one numerical optimization is needed,
irrespective of whether stationarity is assumed.
Second, in models of relatively complex structure, for example, those of Zucchini et al.
(2008), it is a clear advantage of direct numerical maximization that we do not have to
derive and code the E and M steps, only a likelihood evaluator. This advantage is par-
ticularly marked if details of the model are repeatedly modied in a search for suitable
structure.
12.5 Forecasting and Decoding
The matrix product expression (12.4) for the likelihood makes it possible to compute var-
ious conditional distributions associated with the observations X
t
and latent variables
C
t
. We list several of these distributions here and discuss their uses in forecasting and
“decoding.” We do not assume in this section that the Markov chain is stationary.
12.5.1 Forecast Distributions
First note that the forecast distribution (for h periods ahead) is a ratio of likelihoods:
| X
(T) (T)
α
T
h
P(x)1
Pr(X
T+h
= x = x ) = .
L
T
Then write φ
T
= α
T
/α
T
1
= α
T
/L
T
, getting
Pr(X
T+h
= x | X
(T)
= x
(T)
) = φ
T
h
P(x)1
, (12.8)
which is a mixture of the m state–dependent distributions p
i
.Ash increases,
h
approaches
1
δ and the probability (12.8) therefore tends to δP(x)1
. The speed of this convergence is
determined by the second largest eigenvalue modulus of .
12.5.2 State Probabilities, State Prediction, and Decoding
It can be shown that, for t = 1, 2, ..., T,
α
t
(i)β
t
(i) = Pr(X
(T)
= x
(T)
, C
t
= i),
from which it follows that
Pr(C
t
= i | X
(T)
= x
(T)
) =
α
t
(i)β
t
(i)
. (12.9)
L
T
This gives the conditional distribution (under the model) of the state at time t, given all the
observations.
278 Handbook of Discrete-Valued Time Series
Similarly, it can be shown that, for h N ,
Pr(C
T+h
= i | X
(T)
= x
(T)
) = α
T
h
e
i
/L
T
= φ
T
h
e
i
, (12.10)
where e
i
= (0, ...,0,1,0,...,0) has a one in the ith position only. This enables us to forecast
the (latent) state h periods ahead. If h tends to innity the probability (12.10) tends to δ
i
:
lim
φ
T
h
e
i
= φ
T
1
δe
i
= 1 × δe
i
= δ
i
.
h→∞
The speed of convergence is once again determined by the second largest eigenvalue
modulus of .
By using Equation (12.9) we can nd that state i ∈{1, 2, ..., m} which is at time t the
most likely, given all the observations. This is known as local decoding. If we do that for all
times t, the resulting path may in some circumstances be useful, but it is not in general the
most likely path followed by the Markov chain, that is, it is not in general that sequence of
states c
1
, c
2
, ..., c
T
which maximizes the conditional probability
Pr(C
(T)
= c
(T)
| X
(T)
= x
(T)
).
That maximizing path (which need not be unique) can be found by an application of
dynamic programming known as the Viterbi algorithm (Viterbi, 1967, 2006), and the pro-
cess is known as global decoding. The results of local and global decoding are often very
similar but not identical.
12.6 Model Selection and Checking
A question that often arises when we seek to use an HMM in practice is: How many
states should there be? A common question of another type is: Should we use, for exam-
ple, Poisson state–dependent distributions or negative binomial? In the absence of useful
subject-matter information, model selection questions such as these are not easy to answer
completely satisfactorily, but there are some simple tools that can certainly help us to select
from a group of competing models. An introduction to model selection may be found in
Zucchini (2000). But once we have chosen the “best” model according to some criterion,
we still have to consider the question of whether the chosen model can be regarded as
adequate. In this section, we describe the model selection criteria: Akaike’s information
criterion (AIC) and Bayesian information criterion (BIC) and the use of pseudo-residuals
(also known as quantile residuals) in order to check the adequacy of the model.
12.6.1 AIC and BIC
Two widely used model selection criteria are the AIC and the BIC. The AIC selects that
model which, of those under consideration, has the smallest value of the quantity
AIC =−2 ln L + 2p.
279 Hidden Markov Models for Discrete-Valued Time Series
Here, ln L is the log-likelihood of the tted model and p denotes the number of parameters
of the model. The rst term measures the lack of t, and the second term is a penalty which
increases with the number of parameters. The criterion BIC differs from AIC in the penalty
term only:
BIC =−2 ln L + p ln T.
The penalty term of BIC is greater than that of AIC if T > e
2
,thatis,for T 8. Thus, the
BIC generally tends to select models with fewer parameters than does the AIC.
12.6.2 Model Checking by Pseudo-Residuals
There are simple informal checks that can be made on the adequacy of an HMM. One can
(assuming stationarity) compare sample and model quantities such as the mean, variance,
ACF, and the distribution of the observations. If, for example, the data were to display
marked overdispersion but the model did not, we would discard or modify the model.
But there are additional systematic checks that can be performed. We describe here the
use of “pseudo-residuals.” For t = 1, 2, …, T, dene
u
= Pr(X
t
< x
t
| X
(t)
= x
(t)
), z
=
1
(u
t t t
)
and
+
(t)
+
1
+
u
= Pr(X
t
x
t
| X
(t)
= x ), z = (u
t
),
t t
with denoting the standard normal distribution function. The interval [u
+
] (on a
t
, u
t
probability scale) or [z
+
] (on a “normal” scale) gives an indication of how extreme
t
, z
t
x
t
is relative to its conditional distribution given the other observations. The conditional
distribution of one observation given the others is therefore needed, but it is a ratio of like-
lihoods and can be found in very much the same way as were the conditional distributions
in Section 12.5.1.
+
There are several ways in which such pseudo-residual “segments” [z
t
, z ] can be used.
t
We give one here. If the observations were continuous there would be a single quantity z
t
and not an interval, which would have a standard normal distribution if the model were
correct, and so a quantile–quantile (QQ) plot could be used to assess the adequacy of the
model. For discrete-valued series, we use the “mid-pseudo-residual” z
m
t
=
1
((u
t
+
+
u
t
)/2) to sort the pseudo-residuals in order to produce the QQ plot. Although we can
claim no more than approximate normality for mid-pseudo-residuals, they are neverthe-
less useful for identifying poor ts. An example of their application is given in Section 12.7;
see Figure 12.5.
12.7 An Example: Weekly Sales of a Soap Product
We describe here the tting of HMMs to a series of counts and demonstrate briey the
use of the forecasting, decoding, model selection, and checking techniques introduced in
Sections 12.5 and 12.6. Consider the series of weekly sales (in integer units, 242 weeks) of a
280 Handbook of Discrete-Valued Time Series
TABLE 12.1
Weekly Sales of the Soap Product; to Be Read across Rows
1 6 9 18 14 8 8 1 6 7 3 3 1 3 4 12 8 10 8 2
17 15 7 12 22 10 4 7 5 0 2 5 3 4 4 7 5 6 1 3
4 5 3 7 3 0 4 5 3 3 4 4 4 4 4 3 5 5 5 7
4 0 4 3 2 6 3 8 9 6 3 4 3 3 3 3 2 1 4 5
5 2 7 5 2 3 1 3 4 6 8 8 5 7 2 4 2 7 4 15
15 12 21 20 13 9 8 0 13 9 8 0 6 2 0 3 2 4 4 6
3 2 5 5 3 2 1 1 3 1 2 6 2 7 3 2 4 1 5 6
8 14 5 3 6 5 11 4 5 9 9 7 9 8 3 4 8 6 3 5
6 3 1 7 4 9 2 6 6 4 6 6 13 7 4 8 6 4 4 4
9 2 9 2 2 2 13 13 4 5 1 4 6 5 4 2 3 10 6 15
5 9 9 7 4 4 2 4 2 3 8 15 0 0 3 4 3 4 7 5
7 6 0 6 4 14 5 1 6 5 5 4 9 4 14 2 2 1 5 2
6 4
particular soap product in a supermarket, as shown in Table 12.1 and Figure 12.3. The
data were downloaded in 2007 from http://gsbwww.uchicago.edu/kilts/research/db/
dominicks, the Dominick’s Finer Food database at the Kilts Center for Marketing, Gradu-
ate School of Business of the University of Chicago.
(The product was “Zest White Water
15 oz.,” with code 3700031165 and store number 67.)
The data display considerable overdispersion relative to Poisson: the sample mean is
5.44, the sample variance 15.40. In addition, the sample ACF (Figure 12.4) shows strong
evidence of serial dependence. Any satisfactory model has to mimic these properties.
Stationary Poisson–HMMs with one to four states were tted to the soap sales by
direct numerical maximization of the log-likelihood; the number of parameters, minus log-
likelihood, AIC and BIC values are shown in Table 12.2. Also given are the results for
two- to four-state models in which stationarity is not assumed. A one-state model (a
sequence of independent Poisson random variables) shows up as completely inadequate,
25
20
15
10
5
0
Count
0 50 100 150 200 250
Week
FIGURE 12.3
Series of weekly sales of the soap product.
That database is now (April 28, 2015) at http://research.chicagobooth.edu/kilts/marketing-databases/
dominicks (University of Chicago Booth School of Business, 2015).
281 Hidden Markov Models for Discrete-Valued Time Series
Series soap_sales
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
ACF
Lag
FIGURE 12.4
Autocorrelation function of the weekly soap sales.
TABLE 12.2
Soap Sales Data: Comparison, by AIC and BIC, of Poisson–HMMs with 1–4 States
Type of Model
Stationary
No. of States
1
2
No. of Parameters
1
4
ln L
711.8
618.7
AIC
1425.5
1245.3
BIC
1429.0
1259.3
3 9 610.5 1239.0 1270.4
4 16 604.2 1240.4 1296.2
Nonstationary 2
3
5
11
618.5
610.2
1246.9
1242.4
1264.4
1280.8
4 19 602.8 1243.5 1309.8
Note: Bold type indicates the relevant minima.
but this is not surprising in view of the overdispersion and autocorrelation already noted.
For both the stationary and the nonstationary case, the AIC is minimized by a three-state
model and the BIC by a two-state model.
Figure 12.5 displays QQ plots of (mid-)quantile residuals for stationary models with
1–3 states. The plots indicate that the one-state model is clearly inferior to the two- and
three-state models and that there is little to choose between the other two.
In passing, it is interesting to note that, in the three- and four-state cases, it is not difcult
to nd, both by direct numerical maximization and by EM, a local maximum of the log-
likelihood that is not a global maximum. This phenomenon underscores the importance of
trying several sets of starting values for the iterations.
The best stationary three-state model found has state-dependent means given by
λ =
(3.74, 8.44, 14.93), t.p.m.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset