13
Spectral Analysis of Qualitative Time Series
David Stoffer
CONTENTS
13.1 Introduction...................................................................................287
13.2 Scaling Categorical Time Series.............................................................287
13.3 Denition of Spectral Envelope.............................................................291
13.4 Estimation......................................................................................293
13.5 Numerical Examples.........................................................................296
13.6 Enveloping Spectral Surfaces...............................................................298
13.7 Principal Components. .. ....................................................................305
13.8 R Code..........................................................................................307
References............................................................................................309
13.1 Introduction
Qualitative-valued time series are frequently encountered in diverse applications such as
economics, medicine, psychology, geophysics, and genomics, to mention a few. The fact
that the data are categorical does not preclude the need to extract pertinent information in
the same way that is done with quantitative-valued time series. One particular area that
was neglected was the frequency domain, or spectral analysis, of categorical time series.
In this chapter, we explore an approach based on scaling and the spectral envelope, which
was introduced by Stoffer et al. (1993a).
First, we discuss the concept of scaling categorical variables, and then we use the idea
to develop spectral analysis of qualitative time series. In doing so, the spectral envelope
and optimal scaling are introduced, and their properties are discussed. Spectral envelope
and the corresponding optimal scaling are a population idea; consequently, efcient esti-
mation is presented. Pertinent theoretical results are also summarized. Examples of using
the methodology on a DNA sequence are given. The examples include a piecewise analysis
of a gene in the Epstein–Barr virus (EBV). Often, one collects qualitative-valued time series
in experimental designs. This problem is also explored as we discuss the analysis of repli-
cated series that depend on covariates. The spectral envelope is intimately associated with
the concept of principal component analysis of time series, and this relationship is explored
in a separate section. Finally, we list an R script that can be used to calculate the spectral
envelope and optimal scalings. The script can also be used to perform the aforementioned
dynamic analysis.
287
288 Handbook of Discrete-Valued Time Series
13.2 Scaling Categorical Time Series
Our work on the spectral envelope was motivated by collaborations with researchers who
collected categorical-valued time series with an interest in the cyclic behavior of the data.
For example, Table 13.1 shows the per minute sleep state of an infant taken from a study on
the effects of prenatal exposure to alcohol. Details can be found in Stoffer et al. (1988), but
briey, an electroencephalographic () sleep recording of approximately 2 hours (h) is
obtained on a full-term infant 24–36 h after birth, and the recording is scored by a pediatric
neurologist for sleep state. There are two main types of sleep, non-rapid eye movement
(on-), also known as quiet sleep and rapid eye movement (), also known as active
sleep. In addition, there are four stages of on- (NR1NR4), with NR1 being the “most
active” of the four states, and nally awake (AW), which naturally occurs briey through
the night. This particular infant was never awake during the study.
It is not too difcult to notice a pattern in the data if one concentrates on  versus
on- sleep states. But it would be difcult to try to assess patterns in a longer sequence,
or if there were more categories, without some graphical aid. One simple method would
be to scale the data, that is, assign numerical values to the categories and then draw a time plot
of the scales. Since the states have an order,
one obvious scaling is as follows:
NR4 = 1, NR3 = 2, NR2 = 3, NR1 = 4, REM = 5, AW = 6, (13.1)
and Figure 13.1 (often referred to as a hypnogram) shows the time plot using this scaling.
Another interesting scaling might be to combine the quiet states and the active states:
NR4 = NR3 = NR2 = NR1 = 0, REM = 1, AW = 2. (13.2)
TABLE 13.1
Per Minute Infant  Sleep States (Read Down and Across)
REM NR2 NR4 NR2 NR1 NR2 NR3 NR4 NR1 NR1 REM
REM REM NR4 NR1 NR1 NR2 NR4 NR4 NR1 NR1 REM
REM REM NR4 NR1 NR1 REM NR4 NR4 NR1 NR1 REM
REM NR3 NR4 NR1 REM REM NR4 NR4 NR1 NR1 REM
REM NR4 NR4 NR1 REM REM NR4 NR4 NR1 NR1 REM
REM NR4 NR4 NR1 REM REM NR4 NR4 NR1 NR1 REM
REM NR4 NR4 NR2 REM NR2 NR4 NR4 NR1 NR1 NR2
REM NR4 NR4 REM REM NR2 NR4 NR4 NR1 REM
NR2 NR4 NR4 NR1 REM NR2 NR4 NR4 NR1 REM
REM NR2 NR4 NR1 REM NR3 NR4 NR2 NR1 REM
The so-called “ordering” of sleep states is somewhat tenuous. For example, sleep does not progress through
these stages in sequence. For a typical normal healthy adult, sleep begins in stage NR1 and progresses into
stages NR2, NR3,and NR4. Sleep moves through these stages repeatedly before entering REM sleep. Moreover,
sleep typically transitions between REM and stage NR2. Sleep cycles through these stages approximately four
or ve times throughout the night. On average, adults enter the REM stage approximately 90 min after falling
asleep. The rst cycle of REM sleep might last only a short amount of time, but each cycle becomes longer.
289 Spectral Analysis of Qualitative Time Series
REM
NR1
NR2
Sleep state
NR3
NR4
0 20 40 60 80 100
Minute
FIGURE 13.1
Time plot of the  sleep state data in Table 13.1 using the scaling in (13.1).
Periodogram ordinates
60
40
20
0
1/60
0.0 0.1 0.2 0.3 0.4 0.5
Frequency
FIGURE 13.2
Periodogram of the  sleep state data in Table 13.1 based on the scaling in (13.1). The peak corresponds to a
frequency of approximately one cycle every 60 min.
The time plot using (13.2) would be similar to Figure 13.1 as far as the cyclic (in and out
of  sleep) behavior of this infant’s sleep pattern. Figure 13.2 shows the periodogram of
the sleep data using the scaling in (13.1). Note that there is a large peak at the frequency
corresponding to one cycle every 60 min. As one might imagine, the general appearance of
the periodogram using the scaling (13.2) (not shown) is similar to Figure 13.2. Most of us
would feel comfortable with this analysis even though we made an arbitrary and ad hoc
choice about the particular scaling. It is evident from the data (without any scaling) that,
if the interest is in infant’s sleep cycling, this particular sleep study indicates that an infant
cycles between  and on- sleep at a rate of about one cycle per hour.
The intuition used in the previous example is lost when one considers a long DNA
sequence. Briey, a DNA strand can be viewed as a long string of linked nucleotides. Each
nucleotide is composed of a nitrogenous base, a ve carbon sugar, and a phosphate group.
There are four different bases that can be grouped by size, the pyrimidines, thymine (T)
and cytosine (C), and the purines, adenine (A) and guanine (G). The nucleotides are linked
together by a backbone of alternating sugar and phosphate groups with the 5
carbon of
one sugar linked to the 3
carbon of the next, giving the string direction. DNA molecules
occur naturally as a double helix composed of polynucleotide strands with the bases facing
inward. The two strands are complementary, so it is sufcient to represent a DNA molecule
by a sequence of bases on a single strand. Thus, a strand of DNA can be represented as a
sequence of letters, termed base pairs (bp), from the nite alphabet {A, C, G, T}.Theorder
290 Handbook of Discrete-Valued Time Series
of the nucleotides contains the genetic information specic to the organism. Expression of
information stored in these molecules is a complex multistage process. One important task
is to translate the information stored in the protein-coding sequences (CDS) of the DNA.
A common problem in analyzing long DNA sequence data is in identifying CDS that are
dispersed throughout the sequence and separated by regions of noncoding (which makes
up most of the DNA). Table 13.2 shows part of the EBV DNA sequence. The entire EBV
DNA sequence consists of approximately 172,000 bp.
One could try scaling according to the pyrimidine–purine alphabet, that is, A = G = 0
and C = T = 1, but this is not necessarily of interest for every CDS of EBV. There are
numerous possible alphabets of interest, for example, one might focus on the strong–weak
hydrogen bonding alphabet
S ={C, G}=0and W ={A, T}=1. While model calcula-
tions as well as experimental data strongly agree that some kind of periodic signal exists in
certain DNA sequences, there is a large disagreement about the exact type of periodicity.
In addition, there is disagreement about which nucleotide alphabets are involved in the
signals; for example, compare Ioshikhes et al. (1996) with Satchwell et al. (1986).
If we consider the naive approach of arbitrarily assigning numerical values (scales) to
the categories and then proceeding with a spectral analysis, the result will depend on the
particular assignment of numerical values. The obvious problem of being arbitrary is illus-
trated as follows: Suppose we observe the sequence ATCTACATG ..., then setting A = G = 0
TABLE 13.2
Part of the Epstein–Barr Virus DNA Sequence (Read Across and Down)
AGAATTCGTC TTGCTCTATT CACCCTTACT TTTCTTCTTG CCCGTTCTCT TTCTTAGTAT
GAATCCAGTA TGCCTGCCTG TAATTGTTGC GCCCTACCTC TTTTGGCTGG CGGCTATTGC
CGCCTCGTGT TTCACGGCCT CAGTTAGTAC CGTTGTGACC GCCACCGGCT TGGCCCTCTC
ACTTCTACTC TTGGCAGCAG TGGCCAGCTC ATATGCCGCT GCACAAAGGA AACTGCTGAC
ACCGGTGACA GTGCTTACTG CGGTTGTCAC TTGTGAGTAC ACACGCACCA TTTACAATGC
ATGATGTTCG TGAGATTGAT CTGTCTCTAA CAGTTCACTT CCTCTGCTTT TCTCCTCAGT
CTTTGCAATT TGCCTAACAT GGAGGATTGA GGACCCACCT TTTAATTCTC TTCTGTTTGC
ATTGCTGGCC GCAGCTGGCG GACTACAAGG CATTTACGGT TAGTGTGCCT CTGTTATGAA
ATGCAGGTTT GACTTCATAT GTATGCCTTG GCATGACGTC AACTTTACTT TTATTTCAGT
TCTGGTGATG CTTGTGCTCC TGATACTAGC GTACAGAAGG AGATGGCGCC GTTTGACTGT
TTGTGGCGGC ATCATGTTTT TGGCATGTGT ACTTGTCCTC ATCGTCGACG CTGTTTTGCA
GCTGAGTCCC CTCCTTGGAG CTGTAACTGT GGTTTCCATG ACGCTGCTGC TACTGGCTTT
CGTCCTCTGG CTCTCTTCGC CAGGGGGCCT AGGTACTCTT GGTGCAGCCC TTTTAACATT
GGCAGCAGGT AAGCCACACG TGTGACATTG CTTGCCTTTT TGCCACATGT TTTCTGGACA
CAGGACTAAC CATGCCATCT CTGATTATAG CTCTGGCACT GCTAGCGTCA CTGATTTTGG
GCACACTTAA CTTGACTACA ATGTTCCTTC TCATGCTCCT ATGGACACTT GGTAAGTTTT
CCCTTCCTTT AACTCATTAC TTGTTCTTTT GTAATCGCAG CTCTAACTTG GCATCTCTTT
TACAGTGGTT CTCCTGATTT GCTCTTCGTG CTCTTCATGT CCACTGAGCA AGATCCTTCT
GGCACGACTG TTCCTATATG CTCTCGCACT CTTGTTGCTA GCCTCCGCGC TAATCGCTGG
TGGCAGTATT TTGCAAACAA ACTTCAAGAG TTTAAGCAGC ACTGAATTTA TACCCAGTGA
S refers to guanine (G)orcytosine(C)for the strong hydrogen bonding interaction between the base pairs.
W refers to adenine (A) or thymine (T)forthe weak hydrogen bonding interaction between the base pairs.
291 Spectral Analysis of Qualitative Time Series
and C = T = 1 yields the numerical sequence 011101010 ..., which is not very interesting.
However, if we use the strong–weak bonding alphabet, W ={A, T}=0and S ={C, G}=1,
then the sequence becomes 001001001 ..., which is very interesting. It should be clear, then,
that one does not want to focus on only one scaling. Instead, the focus should be on nding
scalings that bring out all of the interesting features in the data. Rather than choose values
arbitrarily, the spectral envelope approach selects scales that help emphasize any periodic
feature that exists in a categorical time series of virtually any length in a quick and auto-
mated fashion. In addition, the technique can help in determining whether a sequence is
merely a random assignment of categories.
13.3 Denition of Spectral Envelope
As a general description, the spectral envelope is a frequency-based, principal component
technique applied to a multivariate time series. In this section, we will focus on the basic
concept and its use in the analysis of categorical time series. Technical details can be found
in Stoffer et al. (1993a).
Briey, in establishing the spectral envelope for categorical time series, we addressed
the basic question of how to efciently discover periodic components in categorical
time series. This was accomplished via nonparametric spectral analysis as follows. Let
{X
t
; t = 0, ±1, ±2, ...} be a categorical-valued time series with nite state-space C =
{c
1
, c
2
, ..., c
k+1
}. Assume that X
t
is stationary and p
j
= Pr{X
t
=c
j
}> 0for j =1, 2, ..., k + 1.
For β =(β
1
, β
2
, ..., β
k+1
)
R
k+1
, denote by X
t
(β) the real-valued stationary time series
corresponding to the scaling that assigns the category c
j
the numerical value β
j
,for
j = 1, 2, ..., k + 1. Our goal was to nd scaling β so that the spectral density is in some
sense interesting and to summarize the spectral information by what we called the spectral
envelope.
We chose β to maximize the power (variance) at each frequency ω, across frequencies
ω (1/2, 1/2], relative to the total power σ
2
(β) = Var{X
t
(β)}. That is, we chose β(ω),at
each ω of interest, so that
f (ω; β)
λ(ω) = sup
σ
2
(β)
, (13.3)
β
over all β not proportional to 1
k+1
,the(k+1)×1 vector of ones. Note that λ(ω) is not dened
if β = a1
k+1
for a R because such a scaling corresponds to assigning each category
the same value a; in this case, f (ω; β) 0and σ
2
(β) = 0. The optimality criterion λ(ω)
possesses the desirable property of being invariant under location and scale changes of β.
As in most scaling problems for categorical data, it was useful to represent the categories
in terms of the vectors e
1
, e
2
, ..., e
k+1
, where e
j
represents the (k + 1) × 1 vector with one
in the jth row and zeros elsewhere. We then dened a (k + 1)-dimensional stationary time
series Y
t
by Y
t
= e
j
when X
t
= c
j
. The time series X
t
(β) can be obtained from the Y
t
time
series by the relationship X
t
(β) = β
Y
t
. Assume that the vector process Y
t
has a continu-
ous spectral density denoted by f
Y
(ω). For each ω, f
Y
(ω) is, of course, a (k + 1) × (k + 1)
complex-valued Hermitian matrix. Note that the relationship X
t
(β) = β
Y
t
implies that
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset