204 Handbook of Discrete-Valued Time Series
a multiplicity of different model specications for the same data set and it is to issues of
this nature that we now turn.
9.4.2 Scoring Rules and Model Selection
The relative performance of a model within a given group of competing models can be
assessed using scoring rules. Scoring rules, or functions, are regularly used in decision
analysis to measure the quality of probabilistic predictions by assigning a numerical score
based on the predictive distribution and the observed data. They are closely related to (gen-
eralized) entropy measures; see, for example, Jose et al. (2008). An important property of a
scoring rule is propriety. A scoring rule is said to be proper if a forecaster achieves their best
score by predicting according to their true belief about the predictive distribution. A for-
mal denition of this concept and further discussion can be found in Gneiting and Raftery
(2007). Boero et al. (2011) provide a comprehensive evaluation of scoring rules along with
some historical background, and Czado et al. (2009) offer an account of scoring rules in the
context of count data.
In the present framework, scoring rules are used as a model selection tool and are
computed as averages over the relevant set of (in-sample) predictions, say (T p)
1
T
= p + 1
s [F(x
t
)], where s [·] denotes a generic scoring rule and observed count x
t
and F(x
t
)
t
is dened in the text following (9.10). Scoring rules are, generally, negatively oriented
penalties that one seeks to minimize. The literature has developed a large number of scor-
ing rules and, unless there is a unique and clearly dened underlying decision problem,
there is no automatic choice of a (proper) scoring rule to be used in any given situation.
Therefore, the use of a variety of scoring rules may be appropriate to take advantage of
specic emphases and strengths.
We have found three proper scoring rules to be particularly useful in comparing time
series models for counts, and these are now introduced as scores per observation to be
aggregated as indicated in the previous paragraph. The rst scoring rule we consider is the
logarithmic score. It is dened as the negative of the logarithm of the predictive distribution
evaluated at the observed count, and it is closely related to the classical Shannon entropy
logs(F(x
t
|F
t1
)) =− log p(x
t
|F
t1
),
where p(x
t
|F
t1
) is the probability mass of the predictive distribution at the observed count.
In contrast to the other scoring rules discussed in the following text, the logarithmic score is
what is called a local scoring rule in that it provides a small value if the observed count is in
the high-density region of the predictive distribution and large values otherwise. Geweke
and Amisano (2011) have a careful analysis of the properties of weighted linear combina-
tions of prediction models and base their model choice procedure on the minimization of
the logarithmic score.
The quadratic score has been specically proposed in the assessment of time series
predictions of counts. It involves an augmentation of the information contained in the log-
arithmic score by a summary measure from all probability ordinates, denoted by ||p||
2
=
j=0
p(j)
2
, where p(j) represents the probability that x
t
= j in the probability mass function
of the predictive distribution and is given by
qs(F(x
t
|F
t1
)) =−2p(x
t
|F
t1
) +||p||
2
.
205 Model Validation and Diagnostics
The quadratic score was proposed by Wecker (1989) in the specic context of predictions
for time series of counts.
The nal scoring measure we consider is the ranked probability score dened by
rps(F(x
t
|F
t1
)) =
[F(j) 1(x
t
j)]
2
,
j=0
where 1(·) is an indicator function. This rule assesses the sum of squared differences of
the cumulative probabilities using the modeled conditional distribution from the observa-
tions. Hence, it penalizes more severely when the predictions are far from the observed
outcomes.
Some authors (including Weiß 2009 and Zhu 2011) proposed the use of information crite-
ria, such as the popular Akaike information criterion (AIC), as means of choosing between
nonnested time series models for counts, despite the fact that little is known about their
ability to do so in this framework. In addition, Psaradakis et al. (2009) examined the abil-
ity of some popular information criteria, like AIC, the Bayesian information criterion (BIC)
and the Hannan and Quinn criterion (HQ) to distinguish between some nonlinear times
series models. They argued that all three criteria have a useful role to play in a time series
model selection exercise. Although their study was not based on count time series models
directly, it may serve as some justication for using such model selection devices in the
present context.
Scoring rules and values for two information criteria for a PINAR(1) model tted to the
cuts data are as follows: logarithmic score 2.4549, quadratic score 0.9001, ranked probability
score 1.5932, AIC = 290.2678, and BIC = 294.4490. For the iceberg order data, the corre-
sponding values are as follows: logarithmic score 1.3014, quadratic score 0.6500, ranked
probability score 0.5122, AIC = 1414.9705, and BIC = 1422.4586. These values are simply
reported here, but will be employed in the following section to facilitate comparison of the
simple PINAR(1) model to others tted to each data set.
9.4.3 Cuts and Iceberg Data Revisited
It will now be very clear that the basic PINAR(1) models tted to the two real life data sets
introduced in Section 9.2 have been revealed to be decient according to a range of criteria.
We are now in a position to reconsider specication of appropriate count time series models
forthesedata.
Consider rst the cuts data. Diagnostic analyses provided in earlier sections after tting
a simple PINAR(1) model to these data reveal a number of difculties. First, while the mean
of the Pearson residuals is close to zero, their variance is considerably larger than unity (at
1.607). Next, the dependence structure in the data is not well captured by the model. This is
evident in the left-hand panels of two gures: from the correlogram of the Pearson residuals
in Figure 9.4 and from the parametric resampling exercise depicted in Figure 9.3. Possible
evidence of distributional misspecication is available in the PIT in the relevant panel of
Figure 9.6. In addition, an analysis of the component residuals of Section 9.3 reveals unac-
counted variation in the data and the graphical evidence in Figure 9.5 may be indicative
of unmodeled seasonal variation. From a pragmatic point of view, given the lower arrivals
p-value for the IM test reported previously and the fact that seasonal arrivals could very
well induce seasonal departures, it seems reasonable to account for this variation by
modifying the arrival process rst.
206 Handbook of Discrete-Valued Time Series
In seeking to remedy the aforementioned deciencies in the PINAR(1) model with no
covariates, we undertake a limited specication search. This leads us to propose tting a
GP(1) model of the form (9.1) with time-varying innovation rate λ
t
to the data. The resul-
tant tted model is (estimated asymptotic standard errors are given in parentheses below
parameter estimates)
ˆ
X
t
= R
t
(X
t1
; 0.478
(0.072)
) ε, where ˆε GP(
ˆ
λ
t
, 0.165
(0.066)
),
and λ
ˆ
t
= exp 0.942 0.216 sin(2πt/12) 0.333 cos(2πt/12) .
(0.190)
(0.106) (0.110)
It can be seen that estimated coefcients relating to seasonal effects are both statistically
different from zero at most conventional signicance levels, as is the dispersion parameter
of the GP distribution (p-value 0.0125). The values for the various scoring rules and the
information criteria are as follows: logarithmic score 2.3252, quadratic score 0.8840, ranked
probability score 1.4645, AIC =277.0928, and BIC =284.0615. All of these are lower than
their counterparts provided toward the end of Section 9.4.2 for the PINAR(1) model with
no covariates. Further summary statistics are as follows: variance of the Pearson residuals:
1.0165 and uniformity test, G, of the PIT histogram: 1.9107 (p-value 0.9928). On the evidence
presented, a researcher would clearly prefer the GP(1) model with deterministic seasonality
to the original PINAR(1) model.
Some diagnostic plots relating to this new model specication are the subject of
Figure 9.7. It is readily seen from all three panels in the gure that there is no evidence of
model misspecication. These should be compared and contrasted with comparable pan-
els in Figures 9.3, 9.4, and 9.6. Hence, a simple change in innovation distribution, together
with allowing a time-varying innovation mean, leads to a marked improvement in the suit-
ability of the (new) model for which the methods discussed do not reveal any statistical
inadequacies.
Turning to the iceberg order data, diagnostic results reported in earlier sections after t-
ting a PINAR(1) model clearly reject this initial model. However, evidence for distributional
misspecication is not as clear-cut as for the cuts data set. The variance of the Pearson resid-
uals is larger than unity at 1.289. But the PIT histogram in the lower left panel of Figure 9.6
displays only limited departure from uniformity and the G-statistic corroborates this by
not rejecting the null of a uniform PIT histogram (p-value = 0.395). A misspecication of
the dependence structure is evident from the right-hand panels of Figures 9.3, 9.4 and the
bottom right panel in Figure 9.6. In contrast to the previous data set, we do not infer that
a seasonal pattern is unaccounted for, as some experimentation (not reported here) shows
no improvement over the basic PINAR(1) model.
A limited specication search (details of which again go unreported to save space) leads
us to propose a model of the form (9.1) with no covariates for the iceberg order data, but
with GP innovations and associated random operator of Joe (1996). The proposed DGP
sets p = 2 and is denoted a GP(2) model. Note that the model cannot be written in the
form (9.2), since the random operator R
t
(F
t1
, α) of (9.1) has two lags in F
t1
,but the
dependence parameter vector α has three elements. By closure under convolution, this
leads to the marginal distribution of the counts being taken to be GP. The resultant tted
model is (estimated asymptotic standard errors again in parentheses)
X
ˆ
t
= R
t
(X
t1
; 0.1954 , 0.046 , 0.4671 ) ε, where εˆ ∼ GP(0.3259, 0.1696 ).
(0.0129) (0.0139) (0.0268) (0.0262) (0.0255)
207 Model Validation and Diagnostics
ACF Pearson residuals PIT histogram Parametric bootstrap
0.3
0.20
1.1
0.2
0.9
0.15
0.7
0.1
0.5
–0.0 0.10
0.3
–0.1
0.1
0.05
–0.1
–0.2
–0.3
–0.3
13 5 7 9
Lags
11 13 15
0.00
12 3 4 6 7 8 10
–0.5
13 5
Lags
7 9 11 13 15
FIGURE 9.7
Diagnostics for a GP(1) model with exogenous variables tted to the cuts data.
All the parameter estimates in α are positive and signicantly different from zero. The
overdispersion parameter η is also signicant (the p-value is zero to four decimal places).
The values for the various scoring rules and information criteria are as follows: logarithmic
score 1.2592, quadratic score 0.6376, ranked probability score 0.5024, AIC = 1370.00, and
BIC = 1382.48. The scoring rules and information criteria, as compared to those reported
in Section 9.4.2, uniformly favor the GP(2) specication over the PINAR(1) model. Further
summary statistics are as follows: variance of the Pearson residuals: 1.0001 and unifor-
mity test of the PIT histogram: 0.9958 (p-value 0.9994). The diagnostic plots are provided
in Figure 9.8. None of the three panels indicate model misspecication.
9.5 Evidence with Artical Data
The evidence on the use of model validation and diagnostic methods provided in Section
9.4.3 relates only to two data sets for which deciencies in a PINAR(1) model can be high-
lighted and, perhaps, rectied by simple model respecications. To further gauge and
illustrate the performance of the diagnostic tools presented in previous sections, we report
208
Handbook of Discrete-Valued Time Series
ACF Pearson residuals PIT histogram Parametric bootstrap
0.3 0.20 1.1
0.9
0.2
0.15
0.7
0.1
0.5
–0.0
0.10
0.3
0.1
–0.1
–0.1
0.05
–0.2
–0.3
–0.3
0.00
–0.5
13 5 7 9 12 15 18 1 2 3 4 6 7 8 10
13 5 7 9 12 15 18 21
Lags Lags
FIGURE 9.8
Diagnostics for a GP(2) model tted to the iceberg order data.
the results of some simulation experiments seeking to reect additional common situa-
tions that might be faced by applied workers wishing to assess the adequacy of a tted
model for count time series. The purpose of this section is to report the results of four such
experiments.
The rst experiment aims to analyze the ability of the diagnostic devices, including scor-
ing rules, to detect a misspecication in the distributional assumption of the innovations in
a proposed model. In particular, it reects a common situation in applied work where the
count time series exhibits marginal overdispersion reected by a variance-to-mean ratio
greater than one.
Data are generated from a rst-order integer autoregressive process (9.1) with innova-
tions ε
t
GP(λ, η) and the random operator R
t
(·) proposed by Joe (1996); this is the setup
denoted GP(1) in the Section 9.1. The specic form of the GP distribution we employ is
p(ε) = λ(λ + εη)
ε1
exp(λ εη)/ε!,with ε = 0, 1, 2, ..., λ > 0and η ∈[0, 1). Further
details can be found, inter alia, in Jung and Tremayne (2011b).
For the simulation experiment, we use the following parameter values: α
1
= 0.6; λ = 0.8,
and η = 0.2. This leads to a GP-distributed count time series with (theoretical) mean and
variance of 2.5 and 3.9, respectively, and moderate dependence structure (the τ’th auto-
correlation function ordinate ρ(τ) = 0.6
τ
,for τ = 1, 2, ...). Here, and in conjunction with
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset