370 ◾ Jon Atle Gulla et al.
worry about dates of documents. In some cases, though, dates may severely aect
document retrieval and ranking; for example, a user is seeking only published infor-
mation or information from particular time periods, or a user may post queries that
make use of terms that are not used at the time of publication, even though the
document content is relevant to the query.
One observation with ranking based on cosine similarity and PageRank is
that older documents tend to receive higher scores than newer ones. Since older
documents are more likely to be known to a large community, often more links
lead to them, giving them a higher PageRank and thereby higher relevance score
than recently published documents. For developing domains like scientic dis-
ciplines in which new information tends to be more reliable and relevant than
old information, this eect of PageRank may be unfortunate. Yu et al. (2004)
present a timed page rank score TPR that boosts documents that have been
recently published:
TPR(A) = Aging(A) * PR
T
(A)
PR A d d
w PR p
C p
w PR p
T
n
T
n
T
( ) ( ) *
* ( )
( )
* ( )
= − + + +1
1 1
1
CC p
n
( )
⎛
⎝
⎜
⎞
⎠
⎟
where PR
T
(A) is the time-weighted PageRank score of paper A, PR
T
(p
i
) is the time-
weighted PageRank score of paper p
i
that links to paper A, C(p
i
) is the number of
outbound links of paper p
i
, and d is a dampening factor set to 0.85. A time weight,
w
i
, has a value that reduces exponentially with citation age. Aging(A) is an aging
factor for paper A set to 1 for brand new papers, and which declines linearly with
time down to 0.5. Initial experiments with papers about particle physics suggest
that the new score better reects the relevance of new papers and better captures
the likelihood that later papers include citations to them.
More generally, a user may want to retrieve documents from particular time
periods. Systems like GEIN support this by storing temporal data as part of the
metadata of documents (see Figure 13.17). Every document or catalog item in
GEIN is associated with a number of valid time periods, a number of semantic cat-
egories, a number of geographical locations, and a specic item class (Tochtermann
et al., 1997). When searching for information sources, a user may add a temporal
restriction to the query that species the relevant period, e.g., all documents about
emission levels in Munich from 1990 to 1995. Grandi et al. (2005) developed a
similar approach, but included a more ne-grained temporal system with attributes
for validity time, ecacy time, transaction time, and publication time.
If a document date is not known, it may be feasible to estimate it using tem-
poral mining or information extraction. Statistical techniques for estimating dates
of documents were suggested by Kanhabua and Nørvåg (2009) and deJong et al.
(2005). ese methods rely on extensive training data and are suitable for documents
that contain no direct or indirect time indications. Sometimes temporal phrases in