292 Handbook of Discrete-Valued Time Series
f
Y
(ω; β) = β
f
Y
(ω)β = β
f
Y
re
(ω)β, where f
Y
re
(ω) denotes the real part of f
Y
(ω).The
optimality criterion can thus be expressed as
λ(ω) = sup
β
f
Y
re
(ω)β
(13.4)
β
Vβ
β
where V is the variance–covariance matrix of Y
t
. The resulting scaling β(ω) is called the
optimal scaling.
The Y
t
process is a multivariate point process, and any particular component of Y
t
is
the individual point process for the corresponding state (e.g., the rst component of Y
t
indicates whether or not the process is in state c
1
at time t). For any xed t, Y
t
represents
a single observation from a simple multinomial sampling scheme. It readily follows that
V = D pp
, where p = (p
1
, ..., p
k+1
)
,and D is the (k + 1) × (k + 1) diagonal matrix
D = diag{p
1
, ..., p
k+1
}. Since, by assumption, p
j
> 0for j = 1, 2, ..., k + 1, it follows that
rank(V) = k with the null space of V being spanned by 1
k+1
. For any (k + 1) × k full rank
matrix Q whose columns are linearly independent of 1
k+1
, Q
VQ is a k ×k positive denite
symmetric matrix.
With the matrix Q as previously dened, and for 1/2 < ω 1/2, dene λ(ω) to be the
largest eigenvalue of the determinantal equation
|Q
f
Y
re
(ω)Q λQ
VQ|=0,
and let b(ω) R
k
be any corresponding eigenvector, that is,
Q
f
Y
re
(ω)Qb(ω) = λ(ω)Q
VQb(ω).
The eigenvalue λ(ω) 0 does not depend on the choice of Q. Although the eigenvector
b(ω) depends on the particular choice of Q, the equivalence class of scalings associated
with β(ω) = Qb(ω) does not depend on Q. A convenient choice of Q is Q =[I
k
| 0 ]
, where
I
k
is the k × k identity matrix and 0 is the k × 1 vector of zeros. For this choice, Q
f
Y
re
(ω)Q
and Q
VQ are the upper k ×k blocks of f
Y
re
(ω) and V, respectively. This choice corresponds
to setting the last component of β(ω) to zero.
The value λ(ω) itself has a useful interpretation; specically, λ(ω)dω represents the
largest proportion of the total power that can be attributed to the frequencies ωdω for any
particular scaled process X
t
(β), with the maximum being achieved by the scaling β(ω).
This result is demonstrated in Figure 13.3. Because of its central role, λ(ω) was dened to
be the spectral envelope of a stationary categorical time series.
The name spectral envelope is appropriate since λ(ω) envelopes the standardized spec-
trum of any scaled process. That is, given any β normalized so that X
t
(β) has total power
one, f (ω; β) λ(ω) with equality if and only if β is proportional to β(ω).
Although the law of the process X
t
(β) for any one-to-one scaling β completely deter-
mines the law of the categorical process X
t
, information is lost when one restricts attention
to the spectrum of X
t
(β). Less information is lost when one considers the spectrum of Y
t
.
Dealing directly with the spectral density f
Y
(ω) itself is somewhat cumbersome since it is a
function into the set of complex Hermitian matrices. Alternatively, one can view the spec-
tral envelope as an easily understood, parsimonious tool for exploring the periodic nature
of a categorical time series with a minimal loss of information.
293 Spectral Analysis of Qualitative Time Series
Power
The spectral envelope
35
30
25
20
15
10
5
0
Spec env
Scaling 1
Scaling 2
0.4 0.5
0.0
0.1 0.2 0.3
Frequency
FIGURE 13.3
Demonstration of the spectral envelope. The short dashed line indicates a spectral density corresponding to some
scaling. The long dashed line indicates a spectral density corresponding to a different scaling. The thick solid
line is the spectral envelope, which can be thought of as throwing a blanket over all possible spectral densities
corresponding to all possible scalings of the sequence. Because the exhibited spectral densities attain the value of
the spectral envelope at the frequency near 0.1, the corresponding scaling is optimal at that frequency. The scaling
at the frequency near 1/3 is close to optimal, but the spectral envelope indicates that there is a scaling that can get
more power at frequency 1/3. In addition to nding interesting frequencies (e.g., there is something interesting
near the frequency of 0.2 that neither scaling 1 nor 2 discovers), the spectral envelope reveals frequencies for which
nothing is interesting (e.g., no matter which scaling is used, there is nothing interesting in this sequence in the
frequency range above 0.4).
In view of (13.4), there is an apparent relationship of the spectral envelope and principal
components. This relationship is discussed in Section 13.7
13.4 Estimation
In view of the dimension reduction mentioned in the previous section, the easiest way to
estimate the spectral envelope is to x the scale of the last state at 0, and then select the
indicator vectors to be k-dimensional. More precisely, to estimate the spectral envelope
and the optimal scalings given a stationary categorical sequence, {X
t
; t = 1, ..., n},with
state-space C ={c
1
, ..., c
k+1
}, perform the following tasks.
(1) Form k × 1 vectors {Y
t
, t = 1, ..., n} as follows:
Y
t
= e
j
if X
t
= c
j
, j = 1, ..., k;
Y
t
= 0
k
if X
t
= c
k+1
,
294 Handbook of Discrete-Valued Time Series
where e
j
is a k × 1 vector with 1 in the jth position and zeros elsewhere and 0
k
is
the k × 1 vector of zeros.
(2) Calculate the (fast) Fourier transform of the data,
d
j
= n
1/2
n
Y
t
exp
2π it
j
.
n n
t=1
Note that d(j/n) is a k × 1 complex-valued vector. Calculate the periodogram,
I(j/n) = d(j/n)d
(j/n),for j = 1, ..., n/2, and retain only the real part, say
I
re
(j/n).
(3) Smooth the real part of the periodogram as preferred to obtain
f
re
(j/n), a consistent
estimator of the real part of the spectral matrix. Time series texts such as Shumway
and Stoffer (2011) that cover the spectral domain will have an extensive discussion
on consistent estimation of the spectral density.
n
(4) Calculate the k × k covariance matrix of the data, S = n
1
t=1
(Y
t
Y)(Y
t
Y)
,
where
Y is the sample mean of the data.
(5) For each ω
j
= j/n, j = 1, ..., n/2, determine the largest eigenvalue and the
corresponding eigenvector of the matrix 2n
1
S
1/2
f
ˆ
re
(ω
j
)S
1/2
.Notethat S
1/2
is the inverse of the unique square root matrix
of S.
(6) The sample spectral envelope λ(ω
j
) is the eigenvalue obtained in the previous
step. If b(ω
j
) denotes the eigenvector obtained in the previous step, the optimal
sample scaling is β(ω
j
) = S
1/2
b(ω
j
); this will result in k values, the (k + 1)-st
value being held xed at zero.
Any standard programming language can be used to do the calculations; basically, one
only has to be able to compute fast Fourier transforms and eigenvalues and eigenvectors
of real symmetric matrices. Some examples using the R Statistical Programming Language
R Core Team (2013) may be found in Section 13.8; also, see Shumway and Stoffer (2011,
Chapter 7). Inference for the sample spectral envelope and the sample optimal scalings are
described in detail in Stoffer et al. (1993a). A few of the main results of that paper are as
follows.
If X
t
is an i.i.d. sequence and λ(ω) is the largest eigenvalue of the periodogram matrix,
I(ω
j
), then the following large sample approximation based on the chi-square distribution
is valid for x > 0:
Pr χ
k
2
+1
< 2x
Pr n2
1
λ ω
j
< x =
.
Pr χ
2
2k
< 4x π
1
/
2
x
(k1)/2
e
x
, (13.5)
k
2
where k + 1 is the size of the alphabet being considered. Note that I(ω
j
) has at most one
positive eigenvalue and consequently, λ(ω
j
) = tr I(ω
j
).
If S = PP
is the spectral decomposition of S,then S
1/2
= P
1/2
P
,where
1/2
is the diagonal matrix
with the reciprocal of the root eigenvalues along the diagonal.
295 Spectral Analysis of Qualitative Time Series
In the general case, if a smoothed estimator is used and λ(ω) is a distinct root (which
implies that λ(ω)>0), then, independently, for any collection of Fourier frequencies
{ω
i
; i = 1, ..., M}, M xed, and for large n and m,
λ(ω
i
) λ(ω
i
)
ν
m
λ(ω
i
)
AN(0, 1) (13.6)
and
ν
m
β(ω
i
) β(ω
i
) AN(0,
i
), (13.7)
where
i
= V
1/2
i
V
1/2
with
{λ(ω
i
)H(ω
i
)
+
f
re
(ω
i
)H(ω
i
)
+
a(ω
i
)a(ω
i
)
}
i
=
,
2
and H(ω
i
) = f
re
(ω
i
) λ(ω
i
)I
k1
, a(ω
i
) = H(ω
i
)
+
f
im
(ω
i
)V
1/2
u(ω
i
),and H(ω
i
)
+
refers to
the Moore–Penrose inverse of H(ω
i
).
The term ν
m
depends on the type of estimator being used. For example, in the case of
estimation via weighted averaging of the periodogram, that is,
m
f (ω) =
h
q
I(ω
j+q
),
q=−m
where {ω
j+q
; q = 0, ±1, ..., ±m} is a band of frequencies where ω
j
is the fundamental
m
frequency closet to ω, and such that the weights h
q
= h
q
are positive and
q=−m
h
q
= 1,
m
then ν
m
2
=
q=−m
h
q
2
. If a simple average is used, h
q
= 1/(2m + 1), then ν
2
= (2m + 1).
m
Based on these results, asymptotic normal condence intervals and tests for λ(ω) can be
readily constructed. Similarly, for β(ω), asymptotic condence ellipsoids and chi-square
tests can be constructed; details can be found in Stoffer et al. (1993a, Theorems 3.1–3.3). As a
note, we mention that this technique is not restricted to the use of sinusoids. In Stoffer et al.
(1993b), the use of the Walsh basis
of square-waves functions that take only the values ±1
only is described.
A simple asymptotic test statistic for β(ω) can be obtained. Let H(ω) =
f
Y
re
(ω)
λ(ω)I
k
,and
f
Y
re
(ω)
1/2
2 ν
m
H(ω) β(ω) β(ω)
ξ
m
(ω) = .
λ(ω)
1/2
Then,
ξ
m
(ω)
ξ
m
(ω) (13.8)
The Walsh functions are a completion of the Haar functions; a summary of their use in statistics is given in
Stoffer (1991).
296 Handbook of Discrete-Valued Time Series
converges (m →∞) in distribution to a distribution that is stochastically less than χ
2
k
and
stochastically greater than χ
2
k1
. Note that the test statistic (13.8) is zero if β(ω) is replaced
by β(ω). One can check whether or not a particular element of β(ω) is zero by insert-
ing β(ω) in for β(ω), but with the particular element zeroed out and the resulting vector
rescaled to be of unit length, into (13.8).
Signicance thresholds for a consistent spectral envelope estimate can easily be com-
puted using the following approximations. Using a rst-order Taylor expansion, we
have
λ(ω) λ(ω)
log λ(ω) log λ(ω) +
,
λ(ω)
so that (n, m →∞)
ν
m
log λ(ω) log λ(ω) AN(0, 1). (13.9)
It also follows that E[log λ(ω)]≈ log λ(ω) and Var[log λ(ω)]≈ ν
2
. If there is no signal
m
present in a sequence of length n, we expect λ(j/n) 2/n for 1 < j < n/2, and hence
approximately (1α)×100% of the time, log λ(ω) will be less than log(2/n)+(z
α
/ν
m
) where
z
α
is the (1 α) upper tail cutoff of the standard normal distribution. Exponentiating, the
α critical value for λ(ω) becomes (2/n) exp(z
α
/ν
m
). Although this method is a bit crude,
from our experience, thresholding at very small α-levels (say, α = 10
4
to 10
6
, depending
on the size of n) works well. Some further insight into choosing α will be given in the
numerical examples.
13.5 Numerical Examples
As a simple example of the kind of analysis that can be accomplished, we consider the
gene BNRF1 (bp 1736–5689) of the EBV. Since we are considering the nucleotide sequence
consisting of four bp, we use the following indicator vectors to represent the data:
Y
t
= (1, 0, 0)
if X
t
= A; Y
t
= (0, 1, 0)
if X
t
= C;
Y
t
= (0, 0, 1)
if X
t
= G; Y
t
= (0, 0, 0)
if X
t
= T,
so that the scale for the thymine nucleotide, T, is set to zero. Figure 13.4 shows the spec-
tral envelope estimate of the entire coding sequence (3954 bp long). The gure also shows
a strong signal at frequency 1/3; the corresponding optimal scaling was A = 0.10, C =
0.61, G = 0.78, T = 0, which indicates that the signal is in the strong–weak bonding
alphabet, S ={C, G} and W ={A, T}.
In Shumway and Stoffer (2011, Example 7.18), there is evidence that the gene is not
homogeneous, and in fact, the last fourth of the gene is unlike the rst three quarters of
the gene. For example, Figure 13.5 shows a dynamic spectral envelope with a block size of
500. Precise details of the analysis can be found in the R code, Section 13.8. It is immediately
W refers to adenine (A) or thymine (T)forthe weak hydrogen bonding interaction between the base pairs. S
refers to guanine (G)orcytosine(C)forthe strong hydrogen bonding interaction between the base pairs.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset