Transforming data with the power ladder

Linear relations are commonplace in science and data analysis. Obviously, linear models are easier to understand than non-linear models. So historically, tools for linear models were developed first. In certain cases, it pays to linearize (make linear) data to make analysis simpler. A simple strategy that sometimes works is to square or cube one or more variables. Similarly, we can transform the data down an imaginary power ladder by taking the square or cube root.

In this recipe, we will use data from the Duncan dataset as described in https://vincentarelbundock.github.io/Rdatasets/doc/car/Duncan.html (retrieved August 2015). The data was gathered around 1961 and is about 45 occupations with four columns—type, income, education, and prestige. We will take a look at income and prestige. These variables seem to be linked by a cubic polynomial, so we can take the cube root of income or the cube of prestige. To check the result, we will visualize the residuals of regression. The expectation is that the residuals are randomly distributed, which means that we don't expect them to follow a recognizable pattern.

How to do it...

In the following steps, I will demonstrate the basic data transformation:

  1. The imports are as follows:
    import matplotlib.pyplot as plt
    import numpy as np
    import dautil as dl
    import seaborn as sns
    import statsmodels.api as sm
    from IPython.display import HTML
  2. Load and transform the data as follows:
    df = sm.datasets.get_rdataset("Duncan", "car", cache=True).data
    transformed = df.copy()
    transformed['income'] = np.power(transformed['income'], 1.0/3)
  3. Plot the original data with a Seaborn regression plot (cubic polynomial) as follows:
    sp = dl.plotting.Subplotter(2, 2, context)
    sp.label()
    sns.regplot(x='income', y='prestige', data=df, order=3, ax=sp.ax)
  4. Plot the transformed data with the following lines:
    sp.label(advance=True)
    sns.regplot(x='income', y='prestige', data=transformed, ax=sp.ax)
  5. Plot the residuals plot for the cubic polynomial:
    sp.label(advance=True)
    sns.residplot(x='income', y='prestige', data=df, order=3, ax=sp.ax)
  6. Plot the residuals plot for the transformed data as follows:
    sp.label(advance=True)
    sp.ax.set_xlim([1, 5])
    sns.residplot(x='income', y='prestige', data=transformed, ax=sp.ax)
    plt.tight_layout()
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (refer to the transforming_up.ipynb file in this book's code bundle):

How to do it...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset