Interpolation is a technique that uses two endpoints at the extremes of consecutive missing values to create a rough mathematical relationship to fill the missing values. By default, it does a linear interpolation (which assumes a linear relationship between data points), but there are many more methods, such as polynomial, spline, quadratic, and Akima (which assumes a polynomial or piece-wise polynomial relationship).
The interpolate method can be applied to a series or all the columns of a DataFrame directly:
import numpy as np
import pandas as pd
A=[1,3,np.nan,np.nan,11,np.nan,91,np.nan,52]
pd.Series(A).interpolate()
The following is the output:
Instead, other methods, such as spline, can be used, which assume a piece-wise polynomial relationship:
pd.Series(A).interpolate(method='spline',order=2)
The following is the output:
Similarly, polynomial interpolation can be done like so:
pd.Series(A).interpolate(method='polynomial',order=2)
A different column can be created for each interpolation method in the same DataFrame to compare their results, as shown here:
#Needed for generating plot inside Jupyter notebook
%matplotlib inline
#Setting seed for regenerating the same random number
np.random.seed(10)
#Generate Data
A=pd.Series(np.arange(1,100,0.5)**3+np.random.normal(5,7,len(np.arange(1,100,0.5))))
#Sample random places to introduce missing values
np.random.seed(5)
NA=set([np.random.randint(1,100) for i in range(25)])
#Introduce missing values
A[NA]=np.nan
#Define the list of interpolation methods
methods=['linear','quadratic','cubic']
#Apply the interpolation methods and create a DataFrame
df = pd.DataFrame({m: A.interpolate(method=m) for m in methods})
#Find the mean of each column (each interpolation method)
df.apply(np.mean,axis=0)
The following is the output:
As we can see, the means are slightly different for each column because separate interpolation methods were used.
You can also check the values where interpolations were made to see how different/similar they are. This can be done as follows:
np.random.seed(5)
NA1=[np.random.randint(1,100) for i in range(25)]
df.iloc[NA1,:]