ANOVA test

Now let's talk about another popular hypothesis test, called ANOVA. It is used to test whether similar data points coming from different groups or under a different set of experiments are statistically similar to each other or different—for example, the average height of different sections of a class in a school or the peptide length of a certain protein found in humans across different ethnicities.

ANOVA calculates two metrics to conduct this test:

  • Variance among different groups
  • Variance within each group

Based on these metrics, a statistic is calculated with variance among different groups as a numerator. If it is a statistically large enough number, it means that the variance among different groups is larger than the variance within the group, implying that the data points coming from the different groups are different.

Let's look at how variance among different groups and variance within each group can be calculated. Suppose we have k groups that data points are coming from:

The data points from group 1 are X11, X12, ......., X1n.

The data points from group 2 are X21, X22, ......., X2n.

This means that the data points from group k are Xk1, Xk2, ......., Xkn.

Let's use the following abbreviations and symbols to describe some features of this data:

  • Variance among different groups is represented by SSAG
  • Variance within each group is represented by SSWG
  • Number of elements in group k is represented by nk
  • Mean of data points in a group is represented by µk
  • Mean of all data points across groups is represented by µ
  • Number of groups is represented by k

Let's define the two hypotheses for our statistical test:

  • Ho: µ1 = µ2 = .....= µk
  • Ho: µ1 != µ2 != .....= µk

In other words, the null hypothesis states that the mean of the data points across all the groups is the same, while the alternative hypothesis says that the mean of at least one group is different from the others.

This leads to the following equations:

SSAG = (∑ nk * (Xk - µ)**2) / k-1

SSWG = (∑∑(Xkik)**2) / n*k-k-1

In SSAG, the summation is over all the groups.

In SSWG, the first summation is across the data points from a particular group and the second summation is across the groups.

The denominators in both cases denote the degree of freedom. For SSAG, we are dealing with k groups and the last value can be derived from the other k-1 values. Therefore, DOF is k-1. For SSWG, there are n*k data points, but the k-1 mean values are restricted (or fixed) by those choices, and so the DOF is n*k-k-1.

Once these numbers have been calculated, the test statistic is calculated as follows:

Test Statistic = SSAG/SSWG

This ratio of SSAG and SSWG follows a new distribution called F-distribution, and so the statistic is called an F statistic. It is defined by the two different degrees of freedom, and there is a separate distribution for each combination, as shown in the following graph:

F-distribution based on the two DOFs, d1 =k-1 and d2=n*k-k-1

Like any of the other tests we've looked at, we need to decide a significance level and need to find the p-value associated with the F statistics for those degrees of freedom. If the p-value is less than the alpha value, the null hypothesis can be rejected. This whole calculation can be done by writing some Python code.

Let's have a look at some example data and see how ANOVA can be applied:

    import pandas as pd
    data=pd.read_csv('ANOVA.csv')
    data.head()

This results in the following output:

Head of the OD data by Lot and Run

We are interested in finding out whether the mean OD is the same for different lots and runs. We will apply ANOVA for that purpose, but before that, we can draw a boxplot to get an intuitive sense of the differences in the distributions for different lots and runs:

Boxplot of OD by Lot

Similarly, a boxplot of OD grouped by Run can also be plotted:

Boxplot of OD by Run

Now, let's write the Python code to perform the calculations:

# Calculating SSAG
group_mean=data.groupby('Lot').mean()
group_mean=np.array(group_mean['OD'])
tot_mean=np.array(data['OD'].mean())
group_count=data.groupby('Lot').count()
group_count=np.array(group_count['OD'])
fac1=(group_mean-tot_mean)**2
fac2=fac1*group_count
DF1=(data['Lot'].unique()).size-1
SSAG=(fac2.sum())/DF1
SSAG

#Calculating SSWG
group_var=[]
for i in range((data['Lot'].unique()).size):
lot_data=np.array(data[data['Lot']==i+1]['OD'])
lot_data_mean=lot_data.mean()
group_var_int=((lot_data-lot_data_mean)**2).sum()
group_var.append(group_var_int)
group_var_sum=(np.array(group_var)).sum()
DF2=data.shape[0]-(data['Lot'].unique()).size-1
SSAW=group_var_sum/DF2
SSAW

F=SSAG/SSAW
F

The value of the F statistic comes out to be 3.84, while the degrees of freedom are 4 and 69, respectively.

At a significance level—that is, alpha—of 0.05, the critical value of the F statistic lies between 2.44 and 2.52 (from the F-distribution table found at: http://socr.ucla.edu/Applets.dir/F_Table.html).

Since the value of the F statistic (3.84) is larger than the critical value of 2.52, the F statistic lies in the rejection region, and so the null hypothesis can be rejected. Therefore, it can be concluded that the mean OD values are different for different lot groups. At a significance level of 0.001, the F statistic becomes smaller than the critical value, and so the null hypothesis can't be rejected. We would have to accept that the OD means from the different groups are statistically the same. The same test can be performed for different run groups. This has been left as an exercise for you to practice with.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset