To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports:
# import packages we need for exploratory data analysis (EDA)
import pandas as pd # to store tabular data
import numpy as np # to do some math
import matplotlib.pyplot as plt # a popular data visualization tool
import seaborn as sns # another popular data visualization tool
%matplotlib inline
plt.style.use('fivethirtyeight') # a popular data visualization theme
We will import our tabular data through a CSV, as follows:
# load in our dataset using pandas
pima = pd.read_csv('../data/pima.data')
pima.head()
The head method allows us to see the first few rows in our dataset. The output is as follows:
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 | |
0 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
1 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.627 | 32 | 1 |
2 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
3 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
4 | 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
Something's not right here, there's no column names. The CSV must not have the names for the columns built into the file. No matter, we can use the data source's website to fill this in, as shown in the following code:
pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes']
pima = pd.read_csv('../data/pima.data', names=pima_column_names)
pima.head()
Now, using the head method again, we can see our columns with the appropriate headers. The output of the preceding code is as follows:
times_pregnant | plasma_glucose_concentration | diastolic_blood_pressure | triceps_thickness | serum_insulin | bmi | pedigree_function | age | onset_diabetes | |
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
Much better, now we can use the column names to do some basic stats, selecting, and visualizations. Let's first get our null accuracy as follows:
pima['onset_diabetes'].value_counts(normalize=True)
# get null accuracy, 65% did not develop diabetes
0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64
If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction:
# get a histogram of the plasma_glucose_concentration column for
# both classes
col = 'plasma_glucose_concentration'
plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()
The output of the preceding code is as follows:
It seems that this histogram is showing us a pretty big difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows:
for col in ['bmi', 'diastolic_blood_pressure', 'plasma_glucose_concentration']:
plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()
The output of the preceding code will give us the following three histograms. The first one is show us the distributions of bmi for the two class variables (non-diabetes and diabetes):
The next histogram to appear will shows us again contrastingly different distributions between a feature across our two class variables. This time we are looking at diastolic_blood_pressure:
The final graph will show plasma_glucose_concentration differences between our two class variables:
We can definitely see some major differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. We will use the visualization tool, seaborn, which we imported at the beginning of this chapter for our correlation matrix as follows:
# look at the heatmap of the correlation matrix of our dataset
sns.heatmap(pima.corr())
# plasma_glucose_concentration definitely seems to be an interesting feature here
Following is the correlation matrix of our dataset. This is showing us the correlation amongst the different columns in our Pima dataset. The output is as follows:
This correlation matrix is showing a strong correlation between plasma_glucose_concentration and onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes column, with the following code:
pima.corr()['onset_diabetes'] # numerical correlation matrix
# plasma_glucose_concentration definitely seems to be an interesting feature here
times_pregnant 0.221898 plasma_glucose_concentration 0.466581 diastolic_blood_pressure 0.065068 triceps_thickness 0.074752 serum_insulin 0.130548 bmi 0.292695 pedigree_function 0.173844 age 0.238356 onset_diabetes 1.000000 Name: onset_diabetes, dtype: float64
We will explore the powers of correlation in a later Chapter 4, Feature Construction, but for now we are using exploratory data analysis (EDA) to hint at the fact that the plasma_glucose_concentration column will be an important factor in our prediction of the onset of diabetes.
Moving on to more important matters at hand, let's see if we are missing any values in our dataset by invoking the built-in isnull() method of the pandas DataFrame:
Great! We don't have any missing values. Let's go on to do some more EDA, first using the shape method to see the number of rows and columns we are working with:
Confirming we have 9 columns (including our response variable) and 768 data observations (rows). Now, let's take a peak at the percentage of patients who developed diabetes, using the following code:
pima['onset_diabetes'].value_counts(normalize=True)
# get null accuracy, 65% did not develop diabetes
0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64
This shows us that 65% of the patients did not develop diabetes, while about 35% did. We can use a nifty built-in method of a pandas DataFrame called describe to look at some basic descriptive statistics:
pima.describe() # get some basic descriptive statistics
We get the output as follows:
times_pregnant |
plasma_glucose |
diastolic_ |
triceps |
serum |
bmi |
pedigree |
age |
onset |
|
count |
768.000000 |
768.000000 |
768.000000 |
768.000000 |
768.000000 |
768.000000 |
768.000000 |
768.000000 |
768.000000 |
mean |
3.845052 |
120.894531 |
69.105469 |
20.536458 |
79.799479 |
31.992578 |
0.471876 |
33.240885 |
0.348958 |
std |
3.369578 |
31.972618 |
19.355807 |
15.952218 |
115.244002 |
7.884160 |
0.331329 |
11.760232 |
0.476951 |
min |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.078000 |
21.000000 |
0.000000 |
25% |
1.000000 |
99.000000 |
62.000000 |
0.000000 |
0.000000 |
27.300000 |
0.243750 |
24.000000 |
0.000000 |
50% |
3.000000 |
117.000000 |
72.000000 |
23.000000 |
30.500000 |
32.000000 |
0.372500 |
29.000000 |
0.000000 |
75% |
6.000000 |
140.250000 |
80.000000 |
32.000000 |
127.250000 |
36.600000 |
0.626250 |
41.000000 |
1.000000 |
max |
17.000000 |
199.000000 |
122.000000 |
99.000000 |
846.000000 |
67.100000 |
2.420000 |
81.000000 |
1.000000 |
This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen. Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns:
- times_pregnant
- plasma_glucose_concentration
- diastolic_blood_pressure
- triceps_thickness
- serum_insulin
- bmi
- onset_diabetes
Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for:
- plasma_glucose_concentration
- diastolic_blood_pressure
- triceps_thickness
- serum_insulin
- bmi
So, we actually do having missing values! It was obviously not luck that we happened upon the zeros as missing values, we knew it beforehand. As a data scientist, you must be ever vigilant and make sure that you know as much about the dataset as possible in order to find missing values encoded as other symbols. Be sure to read any and all documentation that comes with open datasets in case they mention any missing values.
If no documentation is available, some common values used instead of missing values are:
- 0 (for numerical values)
- unknown or Unknown (for categorical variables)
- ? (for categorical variables)
So, we have five columns in which missing values exist, so now we get to talk about how to deal with them, in depth.