Interpolation 

Interpolation is a technique that uses two endpoints at the extremes of consecutive missing values to create a rough mathematical relationship to fill the missing values. By default, it does a linear interpolation (which assumes a linear relationship between data points), but there are many more methods, such as polynomial, spline, quadratic, and Akima (which assumes a polynomial or piece-wise polynomial relationship).

The interpolate method can be applied to a series or all the columns of a DataFrame directly:

import numpy as np
import pandas as pd
A=[1,3,np.nan,np.nan,11,np.nan,91,np.nan,52]
pd.Series(A).interpolate()

The following is the output:

Output DataFrame with missing values filled using simple interpolation

Instead, other methods, such as spline, can be used, which assume a piece-wise polynomial relationship:

pd.Series(A).interpolate(method='spline',order=2)

The following is the output:

Output DataFrame with missing values filled using spline interpolation

Similarly, polynomial interpolation can be done like so:

pd.Series(A).interpolate(method='polynomial',order=2)

A different column can be created for each interpolation method in the same DataFrame to compare their results, as shown here:

#Needed for generating plot inside Jupyter notebook
%matplotlib inline
#Setting seed for regenerating the same random number
np.random.seed(10)
#Generate Data
A=pd.Series(np.arange(1,100,0.5)**3+np.random.normal(5,7,len(np.arange(1,100,0.5))))
#Sample random places to introduce missing values
np.random.seed(5)
NA=set([np.random.randint(1,100) for i in range(25)])
#Introduce missing values
A[NA]=np.nan
#Define the list of interpolation methods
methods=['linear','quadratic','cubic']
#Apply the interpolation methods and create a DataFrame
df = pd.DataFrame({m: A.interpolate(method=m) for m in methods})
#Find the mean of each column (each interpolation method)
df.apply(np.mean,axis=0)

The following is the output:

Comparing mean values after interpolating using different methods

As we can see, the means are slightly different for each column because separate interpolation methods were used.

You can also check the values where interpolations were made to see how different/similar they are. This can be done as follows:

np.random.seed(5)
NA1=[np.random.randint(1,100) for i in range(25)]
df.iloc[NA1,:]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset