chi-square test

In this section, we will learn how to implement a chi-square test from scratch in Python and run it on an example dataset.

A chi-square test is conducted to determine the statistical significance of a causal relationship of two categorical variables with each other.

For example, in the following dataset, a chi-square test can be used to determine whether color preferences affect a personality type (introvert and extrovert) or not, and vice versa:

 The two hypotheses for chi-square tests are as follows:

  • H0: Color preferences are not associated with a personality type
  • Ha: Color preferences are associated with a personality type

To calculate the chi-square statistic, we assume that the null hypothesis is true. If there is no relationship between the two variables, we could just take the contribution (proportion) of that column as the total and multiply that with the row total for that cell; that would give us the expected cell. In other words, the absence of a specific relationship implies a simple proportional relationship and distribution. Therefore, we calculate the expected number in each subcategory (assuming the null hypothesis is true) as follows:

Expected Frequency = (Row Total X Column Total )/ Total:

Once the expected frequency has been calculated, the ratio of the square of the difference between the expected and observed frequency, divided by the expected frequency, is calculated:

Chi_Square_Stat =Sum( (Expected Frequency-Observed Frequency)**2/Expected Frequency)

These statistics follow a chi-square distribution with a parameter called the degree of freedom (DOF). The degree of freedom is given by the following equation:

DOF = (Number of Rows -1)*(Number of Column-1)

There is a different distribution for each degree of freedom. This is shown in the following diagram:

 Chi-square distributions at different degrees of freedoms

Like any other test we've looked at, we need to decide a significance level and find the p-value associated with the chi-square statistics for that degree of freedom.

If the p-value is less than the alpha value, the null hypothesis can be rejected.

This whole calculation can be done by writing some Python code. The following two functions calculate the chi-square statistic and the degrees of freedom:

 #Function to calculate the chi-square statistic 
def chi_sq_stat(data_ob):
col_tot=data_ob.sum(axis=0)
row_tot=data_ob.sum(axis=1)
tot=col_tot.sum(axis=0)
row_tot.shape=(2,1)
data_ex=(col_tot/tot)*row_tot
num,den=(data_ob-data_ex)**2,data_ex
chi=num/den
return chi.sum()

#Function to calculate the degrees of freedom
def degree_of_freedom(data_ob):
dof=(data_ob.shape[0]-1)*(data_ex.shape[1]-1)
return dof

# Calculting these for the observed data
data_ob=np.array([(20,6,30,44),(180,34,50,36)])
chi_sq_stat(data_ob)
degree_of_freedom(data_ob)

The chi-square statistic is 71.99, while the degrees of freedom is 3. The p-values can be calculated using the table found here: https://people.smp.uq.edu.au/YoniNazarathy/stat_models_B_course_spring_07/distributions/chisqtab.pdf.

From the tables, the p-value for 71.99 is very close to 0. Even if we choose alpha to be a small number, such as 0.01, the p-value is still smaller. With this, we can say that the null hypothesis can be rejected with a good degree of statistical confidence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset