The exploratory data analysis (EDA)

To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports:

# import packages we need for exploratory data analysis (EDA)
import pandas as pd # to store tabular data
import numpy as np # to do some math
import matplotlib.pyplot as plt # a popular data visualization tool
import seaborn as sns # another popular data visualization tool
%matplotlib inline
plt.style.use('fivethirtyeight') # a popular data visualization theme

We will import our tabular data through a CSV, as follows:

# load in our dataset using pandas
pima
= pd.read_csv('../data/pima.data')

pima.head()

The head method allows us to see the first few rows in our dataset. The output is as follows:

6 148 72 35 0 33.6 0.627 50 1
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.627 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0

 

Something's not right here, there's no column names. The CSV must not have the names for the columns built into the file. No matter, we can use the data source's website to fill this in, as shown in the following code:

pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes']

pima = pd.read_csv('../data/pima.data', names=pima_column_names)

pima.head()

Now, using the head method again, we can see our columns with the appropriate headers. The output of the preceding code is as follows:

times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi pedigree_function age onset_diabetes
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

 

Much better, now we can use the column names to do some basic stats, selecting, and visualizations. Let's first get our null accuracy as follows:

pima['onset_diabetes'].value_counts(normalize=True) 
# get null accuracy, 65% did not develop diabetes

0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64

If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction:

# get a histogram of the plasma_glucose_concentration column for
# both classes

col = 'plasma_glucose_concentration'
plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()

The output of the preceding code is as follows:

It seems that this histogram is showing us a pretty big difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows:

for col in ['bmi', 'diastolic_blood_pressure', 'plasma_glucose_concentration']:
plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()

The output of the preceding code will give us the following three histograms. The first one is show us the distributions of bmi for the two class variables (non-diabetes and diabetes):

The next histogram to appear will shows us again contrastingly different distributions between a feature across our two class variables. This time we are looking at diastolic_blood_pressure:

The final graph will show plasma_glucose_concentration differences between our two class variables:


We can definitely see some major differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. We will use the visualization tool, seaborn, which we imported at the beginning of this chapter for our correlation matrix as follows:

# look at the heatmap of the correlation matrix of our dataset
sns.heatmap(pima.corr())
# plasma_glucose_concentration definitely seems to be an interesting feature here

Following is the correlation matrix of our dataset. This is showing us the correlation amongst the different columns in our Pima dataset. The output is as follows:

This correlation matrix is showing a strong correlation between plasma_glucose_concentration and onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes column, with the following code:

pima.corr()['onset_diabetes'] # numerical correlation matrix
# plasma_glucose_concentration definitely seems to be an interesting feature here

times_pregnant 0.221898 plasma_glucose_concentration 0.466581 diastolic_blood_pressure 0.065068 triceps_thickness 0.074752 serum_insulin 0.130548 bmi 0.292695 pedigree_function 0.173844 age 0.238356 onset_diabetes 1.000000 Name: onset_diabetes, dtype: float64

We will explore the powers of correlation in a later Chapter 4Feature Construction, but for now we are using exploratory data analysis (EDA) to hint at the fact that the plasma_glucose_concentration column will be an important factor in our prediction of the onset of diabetes.

Moving on to more important matters at hand, let's see if we are missing any values in our dataset by invoking the built-in isnull() method of the pandas DataFrame:

pima.isnull().sum()
>>>>
times_pregnant 0 plasma_glucose_concentration 0 diastolic_blood_pressure 0 triceps_thickness 0 serum_insulin 0 bmi 0 pedigree_function 0 age 0 onset_diabetes 0 dtype: int64

Great! We don't have any missing values. Let's go on to do some more EDA, first using the shape method to see the number of rows and columns we are working with:

pima.shape . # (# rows, # cols)
(768, 9)

Confirming we have 9 columns (including our response variable) and 768 data observations (rows). Now, let's take a peak at the percentage of patients who developed diabetes, using the following code: 

pima['onset_diabetes'].value_counts(normalize=True) 
# get null accuracy, 65% did not develop diabetes

0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64

This shows us that 65% of the patients did not develop diabetes, while about 35% did. We can use a nifty built-in method of a pandas DataFrame called describe to look at some basic descriptive statistics:

pima.describe()  # get some basic descriptive statistics

We get the output as follows:

times_pregnant

plasma_glucose
_concentration

diastolic_
blood_pressure

triceps
_thickness

serum
_insulin

bmi

pedigree
_function

age

onset
_diabetes

count

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

mean

3.845052

120.894531

69.105469

20.536458

79.799479

31.992578

0.471876

33.240885

0.348958

std

3.369578

31.972618

19.355807

15.952218

115.244002

7.884160

0.331329

11.760232

0.476951

min

0.000000

0.000000

0.000000

0.000000

0.000000

0.000000

0.078000

21.000000

0.000000

25%

1.000000

99.000000

62.000000

0.000000

0.000000

27.300000

0.243750

24.000000

0.000000

50%

3.000000

117.000000

72.000000

23.000000

30.500000

32.000000

0.372500

29.000000

0.000000

75%

6.000000

140.250000

80.000000

32.000000

127.250000

36.600000

0.626250

41.000000

1.000000

max

17.000000

199.000000

122.000000

99.000000

846.000000

67.100000

2.420000

81.000000

1.000000

 

This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen. Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns:

  • times_pregnant
  • plasma_glucose_concentration
  • diastolic_blood_pressure
  • triceps_thickness
  • serum_insulin
  • bmi
  • onset_diabetes

Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for:

  • plasma_glucose_concentration
  • diastolic_blood_pressure
  • triceps_thickness
  • serum_insulin
  • bmi

So, we actually do having missing values! It was obviously not luck that we happened upon the zeros as missing values, we knew it beforehand. As a data scientist, you must be ever vigilant and make sure that you know as much about the dataset as possible in order to find missing values encoded as other symbols. Be sure to read any and all documentation that comes with open datasets in case they mention any missing values.

If no documentation is available, some common values used instead of missing values are:

  • 0 (for numerical values)
  • unknown or Unknown (for categorical variables)
  • ? (for categorical variables)

So, we have five columns in which missing values exist, so now we get to talk about how to deal with them, in depth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset