Case study 2 – Why do some people cheat on their spouses?

In 1978, a survey was conducted on housewives in order to discern factors that led them to pursue extramarital affairs. This study became the basis for many future studies of both men and women, all attempting to focus on features of people and marriages that led either partner to seek partners elsewhere behind their spouse's back.

Supervised learning is not always about prediction. In this case study, we will purely attempt to identify a few factors of the many that we believe might be the most important factors that lead someone to pursue an affair.

First, let's read in the data:

# Using dataset of a 1978 survey conducted to measure likelihood of women to perform extramarital affairs 
# http://statsmodels.sourceforge.net/stable/datasets/generated/fair.html 
 
import statsmodels.api as sm 
affairs_df = sm.datasets.fair.load_pandas().data 
affairs_df.head() 
Case study 2 – Why do some people cheat on their spouses?

The statsmodels website provides a data dictionary, as follows:

  • rate_marriage: The rating given to the marriage (given by the wife); 1 = very poor, 2 = poor, 3 = fair, 4 = good, 5 = very good; ordinal level
  • age: Age of the wife; ratio level
  • yrs_married: Number of years married: ratio level
  • children: Number of children between husband and wife: ratio level
  • religious: How religious the wife is; 1 = not, 2 = mildly, 3 = fairly, 4 = strongly; ordinal level
  • educ: Level of education; 9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree; ratio level
  • occupation: 1 = student; 2 = farming, agriculture; semi-skilled, or unskilled worker; 3 = white-collar; 4 = teacher, counselor, social worker, nurse; artist, writer; technician, skilled worker; 5 = managerial, administrative, business; 6 = professional with advanced degree; nominal level
  • occupation_husb: Husband's occupation. Same as occupation; nominal level
  • affairs: Measure of time spent in extramarital affairs; ratio level

Okay, so we have a quantitative response, but my question is simply what factors cause someone to have an affair. The exact number of minutes or hours does not really matter that much. For this reason, let's make a new categorical variable called affair_binary, which is either true (they had an affair for more than 0 minutes) or false (they had an affair for 0 minutes):

# Create a categorical variable 
 
affairs_df['affair_binary'] = (affairs_df['affairs'] > 0) 

Again, this column has either a true or a false value. The value is true if the person had an extramarital affair for more than 0 minutes. The value is false otherwise. From now on, let's use this binary response as our primary response. Now, we are trying to find which of these variables are associated with our response, so let's begin.

Let's start with a simple correlation matrix. Recall that this matrix shows us linear correlations between our quantitative variables and our response. I will show the correlation matrix as both a matrix of decimals and also as a heat map. Let's see the numbers first:

# find linear correlations between variables and affair_binary 
affairs_df.corr()
Case study 2 – Why do some people cheat on their spouses?

Correlation matrix for extramarital affairs data from a Likert survey conducted in 1978

Remember that we ignore the diagonal series of 1s because they are merely telling us that every quantitative variable is correlated with itself. Note the other correlated variables, which are the values closest to 1 and -1 in the last row or column (the matrix is always symmetrical across the diagonal).

We can see a few standout variables:

  • affairs
  • age
  • yrs_married
  • children

These are the top four variables with the largest magnitude (absolute value). However, one of these variables is cheating. The affairs variable is the largest in magnitude, but is obviously correlated to affair_binary because we made the affair_binary variable directly based on affairs. So let's ignore that one.

Let's take a look at our correlation heat map to see whether our views can be seen there:

import seaborn as sns 
sns.heatmap(affairs_df.corr()) 
Case study 2 – Why do some people cheat on their spouses?

Correlation matrix

The same correlation matrix, but this time as a heat map. Note the colors close to dark red and dark blue (excluding the diagonal).

We are looking for the dark red and dark blue areas of the heat map. These colors are associated with the most correlated features.

Remember correlations are not the only way to identify which features are associated with our response. This method shows us how linearly correlated the variables are with each other. We may find another variable that affects affairs by evaluating the coefficients of a decision tree classifier. These methods might reveal new variables that are associated with our variables, but not in a linear fashion.

Note

Also notice that there are two variables here that don't actually belong… Can you spot them? These are the occupation and occupation_husb variables. Recall earlier that we deemed them as nominal and therefore have no right to be included in this correlation matrix. This is because Pandas, unknowingly, casts them as integers and now considers them as quantitative variables. Don't worry, we will fix this soon.

First, let's make ourselves an X and a y DataFrame:

affairs_X = affairs_df.drop(['affairs', 'affair_binary'], axis=1)  
# data without the affairs or affair_binary column 
 
affairs_y = affairs_df['affair_binary'] 

Now, we will instantiate a decision tree classifier and cross-validate our model in order to determine whether or not the model is doing an okay job at fitting our data:

from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier()  
# instantiate the model 
 
from sklearn.cross_validation import cross_val_score 
# import our cross validation module 
 
# check the accuracy on the training set 
scores = cross_val_score(model, affairs_X, affairs_y, cv=10) 
 
print( scores.mean(), "average accuracy" )
0.659756806845 average accuracy 
 
 
print( scores.std(), "standard deviation") # very low, meaning variance of the model is low 
0.0204081732291 standard deviation 
 
# Looks ok on the cross validation side 

Because our standard deviation is low, we may make the assumption that the variance of our model is low (because variance is the square of standard deviation). This is good because that means that our model is not fitting wildly differently on each fold of the cross validation and that it is generally a reliable model.

Because we agree that our decision tree classifier is generally a reliable model, we can fit the tree to our entire dataset and use the importance metric to identify which variables our tree deems the most important:

# Explore individual features that make the biggest impact 
# rate_marriage, yrs_married, and occupation_husb. But one of these variables doesn't quite make sense right? 
# Its the occupation variable, because they are nominal, their interpretations 
model.fit(affairs_X, affairs_y) 
pd.DataFrame({'feature':affairs_X.columns, 'importance':model.feature_importances_}).sort_values('importance').tail(3)  
Case study 2 – Why do some people cheat on their spouses?

So, yrs_married and rate_marriage both are important, but the most important variable is occupation_husb. But that doesn't make sense because that variable is nominal! So, let's apply our dummy variable technique wherein we create new columns that represent each option for occupation_husb and also for occupation.

Firstly, for the occupation column:

# Dummy Variables: 
 
# Encoding qualitiative (nominal) data using separate columns (see slides for linear regression for more) 
 
occuptation_dummies = pd.get_dummies(affairs_df['occupation'], prefix='occ_').iloc[:, 1:] 
 
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns) 
affairs_df = pd.concat([affairs_df, occuptation_dummies], axis=1) 
affairs_df.head()

This new DataFrame has many new columns:

Case study 2 – Why do some people cheat on their spouses?

Remember, these new columns, occ_2.0, occ_4.0, and so on, represent a binary variable that represents whether or not the wife holds job 2, or 4, and so on:

# Now for the husband's job 
 
occuptation_dummies = pd.get_dummies(affairs_df['occupation_husb'], prefix='occ_husb_').iloc[:, 1:] 
 
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns) 
affairs_df = pd.concat([affairs_df, occuptation_dummies], axis=1) 
affairs_df.head() 
 
(6366, 15) 

Now we have 15 new columns! Let's run our tree again and find the most important variables:

# remove appropiate columns for feature set affairs_X = affairs_
df.drop(['affairs', 'affair_binary', 'occupation', 'occupation_
husb'], axis=1) affairs_y = affairs_df['affair_binary'] model =
DecisionTreeClassifier() from sklearn.cross_validation import cross_
val_score # check the accuracy on the training set scores = cross_
val_score(model, affairs_X, affairs_y, cv=10) print scores.mean(),
"average accuracy"
print (scores.std(), "standard deviation") # very low, meaning
variance of the model is low # Still looks ok # Explore individual
features that make the biggest impact model.fit(affairs_X, affairs_y)
pd.DataFrame({'feature':affairs_X.columns, 'importance':model.feature_
importances_}).sort_values('importance').tail(10)
Case study 2 – Why do some people cheat on their spouses?

And there you have it:

  • rate_marriage: The rating of the marriage, as told by the decision tree
  • children: The number of children they had, as told by the decision tree and our correlation matrix
  • yrs_married: The number of years they had been married, as told by the decision tree and our correlation matrix
  • educ: The level of education the women had, as told by the decision tree
  • age: The age of the women, as told by the decision tree and our correlation matrix

These seem to be the top five most important variables in determining whether or not a woman from the 1978 survey would be involved in an extramarital affair.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset