As discussed in Chapter 9, Advanced Machine Learning with Streaming and Graph Data, we will see how to configure the Spark-TS package developed by Cloudera. Mainly, we will talk about the TimeSeriesRDD in this section.
Time series data consists of sequences of measurements, each occurring at a point in time. A variety of terms are used to describe time series data, and many of them apply to conflicting or overlapping concepts. In the interest of clarity, in Spark-TS, Cloudera sticks to a particular vocabulary. Three objects are important in time series data analysis: time series, instant, and observation:
DateTimeIndex
as shown at https://github.com/sryza/spark-timeseries/blob/master/src/main/scala/com/cloudera/sparkts/DateTimeIndex.scala.However, not all data with timestamps are time series data. For example, logs don't fit directly into time series since they consist of discrete events, not scalar measurements taken at intervals. However, measurements of log messages per hour would constitute a time series.
The most straightforward way to access Spark-TS from Scala is to depend on it in a Maven project. Do this by including the following repo in pom.xml
:
<repositories> <repository> <id>cloudera-repos</id> <name>Cloudera Repos</name> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> And including the following dependency in the pom.xml: <dependency> <groupId>com.cloudera.sparkts</groupId> <artifactId>sparkts</artifactId> <version>0.1.0</version> </dependency>
To get the raw pom.xml
file, interested readers should go to the following URL:
https://github.com/sryza/spark-timeseries/blob/master/pom.xml
Alternatively, to access it in a spark-shell, download the JAR from https://repository.cloudera.com/cloudera/libs-release-local/com/cloudera/sparkts/sparkts/0.1.0/sparkts-0.1.0-jar-with-dependencies.jar, and then launch the shell with the following command as discussed in the Using external libraries with Spark Core section in this chapter:
spark-shell --jars sparkts-0.1.0-jar-with-dependencies.jar --driver-class-path sparkts-0.1.0-jar-with-dependencies.jar
According to the Spark-TS engineering blog written on the Cloudera website at http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/, TimeSeriesRDD is central to Spark-TS, where each object in the RDD stores a full univariate series. Operations that tend to apply exclusively to time series are much more efficient. For example, if you want to generate a set of lagged time series from your original collection of time series, each lagged series can be computed just by looking at a single record in the input RDD.
Similarly, with imputing missing values based on surrounding values, or fitting time series models to each series, all of the data needed is present in a single array. Therefore, the central abstraction of the Spark-TS library is TimeSeriesRDD, which is simply a collection of time series on which you can operate in a distributed fashion. This approach allows you to avoid storing timestamps for each series and instead store a single DateTimeIndex
to which all the series vectors conform. TimeSeriesRDD[K]
extends RDD[(K, Vector[Double])]
, where K is the key type (usually a string), and the second element in the tuple is a Breeze vector representing the time series.
A more technical discussion can be found in the GitHub URL: https://github.com/sryza/spark-timeseries. Since this is a Third Party Package, a detailed discussion is out of the scope of this book, we believe.