Time series analysis

While the imdb data contained movie release years, fundamentally the objects of interest were the individual films and the ratings, not a linked series of events over time that might be correlated with one another. This latter type of data – a time series – raises a different set of questions. Are datapoints correlated with one another? If so, over what timeframe are they correlated? How noisy is the signal? Pandas DataFrames have many built-in tools for time series analysis, which we will examine in the next section.

Cleaning and converting

In our previous example, we were able to use the data more or less in the form in which it was supplied. However, there is not always a guarantee that this will be the case. In our second example, we'll look at a time series of oil prices in the US by year over the last century (Makridakis, Spyros, Steven C. Wheelwright, and Rob J. Hyndman. Forecasting methods and applications, John Wiley & Sons. Inc, New York(1998). We'll start again by loading this data into the notebook, and inspecting it visually using tail() by typing:

>>> oil_prices = pd.read_csv('oil.csv')
>>> oil_prices.tail()

Which gives the output:

Cleaning and converting

The last row is unexpected, since it does not look like a year at all. In fact, it is a footer comment in the spreadsheet. As it is not actually part of the data, we will need to remove it from the dataset, which we can do with the following commands:

>>> oil_prices = oil_prices[~np.isnan(oil_prices[oil_prices.columns[1]])] 

This will remove from the dataset and rows in which the second column is NaN (not a correctly formatted number). We can verify that we have cleaned up the dataset by using the tail command again

The second aspect of this data that we would like to clean up is the format. If we look at the format of the columns using:

>>> oil_prices.dtypes

we see that the year is not by default interpreted as a Python date type:

Cleaning and converting

We would like the Year column to be a Python date. type Pandas provides the built-in capability to perform this conversion using the convert_object() command:

>>> oil_prices = oil_prices.convert_objects(convert_dates='coerce')

At the same time, we can rename the column with prices something a little less verbose using the rename command:

>>> oil_prices.rename(columns = {oil_prices.columns[1]: 'Oil_Price_1997_Dollars'},inplace=True)

We can then verify that the output from using the head() command shows these changes:

Cleaning and converting

We now have the data in a format in which we can start running some diagnostics on this time series.

Time series diagnostics

We can plot this data using the matplotlib commands covered in the previous section using the following:

>>> oil_prices.plot(x='Year',y='Oil_Price_1997_Dollars')

This produces the time series plot as follows:

Time series diagnostics

There are a number of natural questions we might ask of this data. Are the fluctuations in oil prices per year completely random, or do year-by-year measurements correlate with one another? There seem to be some cycles in the data, but it is difficult to quantify the degree of this correlation. A visual tool we can use to help diagnose this feature is a lag_plot, which is available in Pandas using the following commands:

>>> from pandas.tools.plotting import lag_plot
>>> lag_plot(oil_prices.Oil_Price_1997_Dollars)
Time series diagnostics

A lag plot simply plots a yearly oil price (x-axis) versus the oil price in the year immediately following it (y-axis). If there is no correlation, we would expect a circular cloud. The linear pattern here shows that there is some structure in the data, which fits with the fact that year-by-year prices go up or down. How strong is this correlation compared to expectation? We can use an autocorrelation plot to answer this question, using the following commands:

>>> from pandas.tools.plotting import autocorrelation_plot
>>> autocorrelation_plot(oil_prices['Oil_Price_1997_Dollars'])

Which gives the following autocorrelation plot:

Time series diagnostics

In this plot, the correlation between points at different lags (difference in years) is plotted along with a 95% confidence interval (solid) and 99% confidence interval (dashed) line for the expected range of correlation on random data. Based on this visualization, there appears to be exceptional correlation for lags of <10 years, which fits with the approximate duration of the peak price periods in the first plot of this data above.

Joining signals and correlation

Lastly, let us look at an example of comparing the oil price time series to another dataset, the number of car crash fatalities in the US for the given years (List of Motor Vehicle Deaths in U.S. by Year. Wikipedia. Wikimedia Foundation. Web. 02 May 2016. https://en.wikipedia.org/wiki/List_of_motor_vehicle_deaths_in_U.S._by_year).

We might hypothesize, for instance, that as the price of oil increases, on average consumers will drive less, leading to future car crashes. Again, we will need to convert the dataset time to date format, after first converting it from a number to a string, using the following commands:

>>> car_crashes=pd.read_csv("car_crashes.csv")
>>> car_crashes.Year=car_crashes.Year.astype(str)
>>> car_crashes=car_crashes.convert_objects(convert_dates='coerce') 

Checking the first few lines with the head() command confirms that we have successfully formatted the data:

Joining signals and correlation

We can join this data to the oil prices statistics and compare the two trends over time. Notice that we need to rescale the crash data by dividing by 1000 so that it can be easily viewed on the same axis in the following command:

>>> car_crashes['Car_Crash_Fatalities_US']=car_crashes['Car_Crash_Fatalities_US']/1000

We then use merge() to join the data, specifying the column to use to match rows in each dataset through the on variable, and plot the result using:

>>> oil_prices_car_crashes = pd.merge(oil_prices,car_crashes,on='Year')
>>> oil_prices_car_crashes.plot(x='Year')

The resulting plot is shown below:

Joining signals and correlation

How correlated are these two signals? We can again use an auto_correlation plot to explore this question:

>>> autocorrelation_plot(oil_prices_car_crashes[['Car_Crash_Fatalities_US','Oil_Price_1997_Dollars']])

Which gives:

Joining signals and correlation

So it appears that the correlation is outside the expected fluctuation at 20 years or less, a longer range of correlation than appears in the oil prices alone.

Tip

Working with large datasets

The examples we give in this section are of modest size. In real-world applications, we may deal with datasets that will not fit on our computer, or require analyses that are so computationally intensive that they must be split across multiple machines to run in a reasonable timeframe. For these use cases, it may not be possible to use IPython Notebook in the form we have illustrated using Pandas DataFrames. A number of alternative applications are available for processing data at this scale, including PySpark, (http://spark.apache.org/docs/latest/api/python/), H20 (http://www.h2o.ai/), and XGBoost (https://github.com/dmlc/xgboost). We can also use many of these tools through a notebook, and thus achieve interactive manipulation and modeling for extremely large data volumes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset