Implementing logistic regression with Python

We have understood the mathematics that goes behind the logistic regression algorithm. Now, let's take one dataset and implement a logistic regression model from scratch. The dataset we will be working with is from the marketing department of a bank and has data about whether the customers subscribed to a term deposit, given some information about the customer and how the bank has engaged and reached out to the customers to sell the term deposit.

Let us import the dataset and start exploring it:

import pandas as pd
bank=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Logistic Regression/bank.csv',sep=';')
bank.head()

The dataset looks as follows:

Implementing logistic regression with Python

Fig. 6.6: A glimpse of the bank dataset

There are 4119 records and 21 columns. The column names are as follows:

bank.columns.values
Implementing logistic regression with Python

Fig. 6.7: The columns of the bank dataset

The details of each column are mentioned in the Data Dictionary file present in the Logistic Regression folder of the Google Drive folder. The type of the column can be found out as follows:

bank.dtypes
Implementing logistic regression with Python

Fig. 6.8: The column types of the bank dataset

Processing the data

The y column is the outcome variable recording yes and no. yes for customers who bought the term deposit and no for those who didn't. Let us start by converting yes-no to 0-1 so that they can be used in modelling. This can be done as follows:

bank['y']=(bank['y']=='yes').astype(int)

The preceding code snippet converts yes to 1 and no to 0. The astype method converts the True/False to integer (0/1).

The education column of the dataset has many categories and we need to reduce the categories for a better modelling. The education column has the following categories:

bank['education'].unique()
Processing the data

Fig. 6.9: The categories of the education column in the bank dataset

The basic category has been repeated three times probably to capture 4, 6, and 9 years of education. Let us club these three together and call them basic. Also, let us modify the categories so that they look better:

import numpy as np
bank['education']=np.where(bank['education'] =='basic.9y', 'Basic', bank['education'])
bank['education']=np.where(bank['education'] =='basic.6y', 'Basic', bank['education'])
bank['education']=np.where(bank['education'] =='basic.4y', 'Basic', bank['education'])
bank['education']=np.where(bank['education'] =='university.degree', 'University Degree', bank['education'])
bank['education']=np.where(bank['education'] =='professional.course', 'Professional Course', bank['education'])
bank['education']=np.where(bank['education'] =='high.school', 'High School', bank['education'])
bank['education']=np.where(bank['education'] =='illiterate', 'Illiterate', bank['education'])
bank['education']=np.where(bank['education'] =='unknown', 'Unknown', bank['education'])

After the change, this is how the categories look:

Processing the data

Fig. 6.10: The column types of the bank dataset

Data exploration

First of all, let us find out the number of people who purchased the term deposit and those who didn't:

bank['y'].value_counts()
Data exploration

Fig. 6.11: Total number of yes's and no's in the bank dataset

There are 3668 no's and 451 yes's in the outcome variables.

As you might have observed, there are many numerical variables in the dataset. Let us get a sense of the numbers across the two classes such as yes or no:

bank.groupby('y').mean()
Data exploration

Fig. 6.12: The mean of the numerical variables for yes's and no's

A few points to note from the preceding output are as follows:

  • The average age of customers who bought the term deposit is higher than that of the customers who didn't.
  • The pdays (days since the customer was last contacted) is understandably lower for the customers who bought it. The lower the pdays, the better the memory of the last call and hence the better chances of a sale.
  • Surprisingly, campaigns (number of contacts or calls made during the current campaign) are lower for customers who bought the term deposit.

We can calculate categorical means for other categorical variables such as education and marital status to get a more detailed sense of our data. The categorical means for education looks as follows:

bank.groupby('education').mean()
Data exploration

Fig. 6.13: The mean of the numerical variables for different categories of education

Data visualization

Let us visualize our data to get a much clearer picture of the data and significant variables. Let us start with a histogram of education with separate bars for customers who bought the term deposit and the customers who didn't.

The tabular data for Education Level and whether they purchased the deposit or not would look like as follows:

Data visualization

Fig. 6.14: Tabular data for Education Level and Purchase

The same data can be plotted as a bar chart using the code snippet as follows:

       %matplotlib inline
pd.crosstab(bank.education,bank.y).plot(kind='bar')
plt.title('Purchase Frequency for Education Level')
plt.xlabel('Education')
plt.ylabel('Frequency of Purchase')
Data visualization

Fig. 6.15: Bar chart for Education Level and Frequency of Purchase

As is evident in the preceding plot, the frequency of purchase of the deposit depends a great deal on the Education Level. Thus, the Education Level can be a good predictor of the outcome variable.

Let us draw a stacked bar chart of the marital status and the purchase of term deposit. Basically, the chart will represent the proportion of the customers who bought the customers from each marital status. It looks as follows:

table=pd.crosstab(bank.marital,bank.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Marital Status vs Purchase')
plt.xlabel('Marital Status')
plt.ylabel('Proportion of Customers')
Data visualization

Fig. 6.16: The bar chart for the Marital Status and Proportion of Customers

The frequency of the purchase of the deposit is more or less the same for each marital status; hence, it might not be very helpful in predicting the outcome.

Let us plot the bar chart for the Frequency of Purchase against each day of the week to see whether this can be a good predictor of the outcome:

%matplotlib inline
import matplotlib.pyplot as plt
pd.crosstab(bank.day_of_week,bank.y).plot(kind='bar')
plt.title('Purchase Frequency for Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Frequency of Purchase')

The plot (the frequency of the positive outcomes) varies depending on the month of the year; hence, it might be a good predictor of the outcome:

Data visualization

Fig. 6.17: The bar chart for Month of the Year and Frequency of Purchase

The Histogram of Age variable looks as follows, suggesting that the most of the customers of the bank in this dataset are in the age range of 30-40:

import matplotlib.pyplot as plt
bank.age.hist()
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
Data visualization

Fig. 6.18 The bar chart for customer's Age and Frequency of Purchase

Another bar chart of Poutcome and the frequency of purchase shows that the Poutcome might be an important predictor of the outcome:

Data visualization

Fig. 6.19: The bar chart for Poutcome and Frequency of Purchase

Several other charts can be plotted to gauge which variables are more significant and which ones are not in order to predict the outcome variable.

Creating dummy variables for categorical variables

There are many categorical variables in the dataset and they need to be converted to dummy variables before they can be used for modelling. We know the process of converting a categorical variable into a dummy variable. However, since there are many categorical variables, it would be time-efficient to automate the process using a for loop. The following code snippet will create dummy variables for each categorical variables and join these dummy variables to the bank data frame:

cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(bank[var], prefix=var)
    bank1=bank.join(cat_list)
    bank=bank1

The actual categorical variable needs to be removed once the dummy variables have been created. We have done something like this earlier in this book:

cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
bank_vars=bank.columns.values.tolist()
to_keep=[i for i in bank_vars if i not in cat_vars]

Let us subset the bank data frame to keep only the columns present in the to_keep list:

bank_final=bank[to_keep]
bank_final.columns.values
Creating dummy variables for categorical variables

Fig. 6.20: Column names after creating dummy variables for categorical variables

The outcome variable is y and all the other variables are predictor variables. The X predictor and the Y outcome variable can be created using the code snippet as follows:

bank_final_vars=bank_final.columns.values.tolist()
Y=['y']
X=[i for i in bank_final_vars if i not in Y ]

Feature selection

Before implementing the model, let us perform a feature selection to decide the significant variables that can predict the outcome with great accuracy. We have the freedom to select as many as variables as we can. Let us select 12 columns. This can be done as follows, which is similar to that done in the chapter on linear regression:

from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

rfe = RFE(model, 12)
rfe = rfe.fit(bank_final[X],bank_final[Y] )
print(rfe.support_)
print(rfe.ranking_)

The output of the preceding code snippet are two arrays. One contains the support and the other contains the ranking. The columns that have True in the support array are selected for the model, or the columns that have the value 1 in the rank array are selected for the model. If we want to include more (than 12) columns in the model, we can select the columns with the rank 2 onwards:

Feature selection

Fig. 6.21: The outcome of feature selection process. The columns with True/1 in the respective positions should be selected for the final model

The columns that are selected using this method are as follows:

'previous', 'euribor3m', 'job_entrepreneur', 'job_self-employed', 'poutcome_success', 'poutcome_failure', 'month_oct', 'month_may','month_mar', 'month_jun', 'month_jul', 'month_dec'

Next, we will try to fit a logistic regression model using the preceding selected variables as predictor variables, with the y as the outcome variable:

cols=['previous', 'euribor3m', 'job_entrepreneur', 'job_self-employed', 'poutcome_success', 'poutcome_failure', 'month_oct', 'month_may',
    'month_mar', 'month_jun', 'month_jul', 'month_dec'] 
X=bank_final[cols]
Y=bank_final['y']

Implementing the model

Let's first use the stasmodel.api method to run the logistic regression model as shown in the following code snippet:

import statsmodels.api as sm
logit_model=sm.Logit(Y,X)
result=logit_model.fit()
print result.summary()
Implementing the model

Fig. 6.22: The summary of the logistic regression model with selected variables

One advantage of this method is that p-values are calculated automatically in the result summary. The scikit-learn method doesn't have this facility, but is more powerful for calculation-intensive tasks such as prediction, calculating scores, and advanced functions such as feature selection. The statsmodel.api method can be used while exploring and fine-tuning the model, while the scikit-learn method can be used in the final model used to predict the outcome.

The summary of the model looks as shown in the preceding screenshot. For each variable, the coefficient value has been estimated, and corresponding to each estimation, there is a std error and p-value. This p-value corresponds to the hypothesis testing of the Wald statistics, and the lower the p-value, the more the significance of the variable in the model. For most of the variables in this model, the p-values are very less and, hence, most of them are significant to the model.

We will be using the scikit-learn method to fit the model as is shown in the following code snippet:

from sklearn import linear_model
clf = linear_model.LogisticRegression()
clf.fit(X, Y)

The accuracy of this model can be calculated as follows:

clf.score(X,Y)

The value comes out to be .902. The mean value of the outcome is .11, meaning that the outcome is positive (1) around 11% of the time and negative around 89% of the time. So, even by predicting 0 for all the rows, one could have achieved an accuracy rate of 89%. Our model takes this accuracy to 90.2. For a little bit of enhancement, maybe, we can try reducing the number of columns, training-testing split, and cross validation to increase this score.

For this method, one can get the value of the coefficients using the following code snippet:

import numpy as np
pd.DataFrame(zip(X.columns, np.transpose(clf.coef_)))
Implementing the model

Fig. 6.23: Coefficients for the variables in the model

The variable coefficients indicate the change in the log (odds) for a unit change in the variable. The coefficient for the previous variable is 0.37. This implies that, if the previous variable increases by 1, the log(odds) will increase by 0.37 and, hence, the probability of the purchase will change accordingly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset