In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.
Let's start with hourly data for a single day:
>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H') >>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng) >>> ts.head() 2015-04-29 08:00:00 30 2015-04-29 09:00:00 27 2015-04-29 10:00:00 54 2015-04-29 11:00:00 9 2015-04-29 12:00:00 48 Freq: H, dtype: int64
If we upsample to data points taken every 15 minutes, our time series will be extended with NaN
values:
>>> ts.resample('15min') >>> ts.head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 NaN 2015-04-29 08:30:00 NaN 2015-04-29 08:45:00 NaN 2015-04-29 09:00:00 27
There are various ways to deal with missing values, which can be controlled by the fill_method
keyword argument to resample. Values can be filled either forward or backward:
>>> ts.resample('15min', fill_method='ffill').head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 30 2015-04-29 08:30:00 30 2015-04-29 08:45:00 30 2015-04-29 09:00:00 27 Freq: 15T, dtype: int64 >>> ts.resample('15min', fill_method='bfill').head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 27 2015-04-29 08:30:00 27 2015-04-29 08:45:00 27 2015-04-29 09:00:00 27
With the limit
parameter, it is possible to control the number of missing values to be filled:
>>> ts.resample('15min', fill_method='ffill', limit=2).head() 2015-04-29 08:00:00 30 2015-04-29 08:15:00 30 2015-04-29 08:30:00 30 2015-04-29 08:45:00 NaN 2015-04-29 09:00:00 27 Freq: 15T, dtype: float64
If you want to adjust the labels during resampling, you can use the loffset
keyword argument:
>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head() 2015-04-29 08:05:00 30 2015-04-29 08:20:00 30 2015-04-29 08:35:00 30 2015-04-29 08:50:00 NaN 2015-04-29 09:05:00 27 Freq: 15T, dtype: float64
There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.
We can ask pandas to interpolate a time series for us:
>>> tsx = ts.resample('15min') >>> tsx.interpolate().head() 2015-04-29 08:00:00 30.00 2015-04-29 08:15:00 29.25 2015-04-29 08:30:00 28.50 2015-04-29 08:45:00 27.75 2015-04-29 09:00:00 27.00 Freq: 15T, dtype: float64
We saw the default interpolate
method – a linear interpolation – in action. pandas assumes a linear relationship between two existing points.
pandas supports over a dozen interpolation
functions, some of which require the scipy
library to be installed. We will not cover interpolation
methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation
method will depend on the requirements of your application.
While, by default, pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz
and dateutil
:
>>> t = pd.Timestamp('2000-01-01') >>> t.tz is None True
To supply time zone information, you can use the tz
keyword argument:
>>> t = pd.Timestamp('2000-01-01', tz='Europe/Berlin') >>> t.tz <DstTzInfo 'Europe/Berlin' CET+1:00:00 STD>
This works for ranges
as well:
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz='Europe/London') >>> rng DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04','2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08','2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')
Time zone objects can also be constructed beforehand:
>>> import pytz >>> tz = pytz.timezone('Europe/London') >>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz=tz) >>> rng DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04','2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08','2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')
Sometimes, you will already have a time zone unaware time series object that you would like to make time zone aware. The tz_localize
function helps to switch between time zone aware and time zone unaware objects:
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D') >>> ts = pd.Series(np.random.randn(len(rng)), rng) >>> ts.index.tz is None True >>> ts_utc = ts.tz_localize('UTC') >>> ts_utc.index.tz <UTC>
To move a time zone aware object to other time zones, you can use the tz_convert
method:
>>> ts_utc.tz_convert('Europe/Berlin').index.tz <DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>
Finally, to detach any time zone information from an object, it is possible to pass None
to either tz_convert
or tz_localize
:
>>> ts_utc.tz_convert(None).index.tz is None True >>> ts_utc.tz_localize(None).index.tz is None True