Linear relations are commonplace in science and data analysis. Obviously, linear models are easier to understand than non-linear models. So historically, tools for linear models were developed first. In certain cases, it pays to linearize (make linear) data to make analysis simpler. A simple strategy that sometimes works is to square or cube one or more variables. Similarly, we can transform the data down an imaginary power ladder by taking the square or cube root.
In this recipe, we will use data from the Duncan dataset as described in https://vincentarelbundock.github.io/Rdatasets/doc/car/Duncan.html (retrieved August 2015). The data was gathered around 1961 and is about 45 occupations with four columns—type, income, education, and prestige. We will take a look at income and prestige. These variables seem to be linked by a cubic polynomial, so we can take the cube root of income or the cube of prestige. To check the result, we will visualize the residuals of regression. The expectation is that the residuals are randomly distributed, which means that we don't expect them to follow a recognizable pattern.
In the following steps, I will demonstrate the basic data transformation:
import matplotlib.pyplot as plt import numpy as np import dautil as dl import seaborn as sns import statsmodels.api as sm from IPython.display import HTML
df = sm.datasets.get_rdataset("Duncan", "car", cache=True).data transformed = df.copy() transformed['income'] = np.power(transformed['income'], 1.0/3)
sp = dl.plotting.Subplotter(2, 2, context) sp.label() sns.regplot(x='income', y='prestige', data=df, order=3, ax=sp.ax)
sp.label(advance=True) sns.regplot(x='income', y='prestige', data=transformed, ax=sp.ax)
sp.label(advance=True) sns.residplot(x='income', y='prestige', data=df, order=3, ax=sp.ax)
sp.label(advance=True) sp.ax.set_xlim([1, 5]) sns.residplot(x='income', y='prestige', data=transformed, ax=sp.ax) plt.tight_layout() HTML(dl.report.HTMLBuilder().watermark())
Refer to the following screenshot for the end result (refer to the transforming_up.ipynb
file in this book's code bundle):