Statistics

In a given dataset, we try to summarize the data by the central position of the data, which is known as measure of central tendency or summary statistics. There are several ways to measure the central tendency, such as mean, median, and mode. Mean is the widely used measure of central tendency. Under different scenarios, we use different measures of central tendency. Now we are going to give an example of how to compute the different measures of central tendency in R.

Mean

mean is the equal weightage average of the sample. For example, we can compute the mean of Volume in the dataset Sampledata by executing the following code, which gives the arithmetic mean of the volume:

mean(Sampledata$Volume) 

Median

Median is the mid value of the matrix when it is arranged in a sorted way, which can be computed by executing the following code:

median(Sampledata$Volume) 

Mode

Mode is the value present in the attribute which has maximum frequency. For mode, there does not exist an inbuilt function so we will write a function to compute mode:

findmode <- function(x) { 
   uniqx <- unique(x) 
   uniqx[which.max(tabulate(match(x, uniqx)))] 
} 
findmode(Sampledata$return) 

Executing the preceding code gives the mode of the return attribute of the dataset.

Summary

We can also generate basic statistics of a column by executing the following code:

summary(Sampledata$Volume) 

This generates the mean, median, minimum, maximum, Q1, and Q2 quartiles.

Moment

Moment gives the characteristics such as variance, skewness, and so on of the population, which is computed by the following code. The code gives the third order moment of the attribute Volume. Once can change the order to get the relevant characteristics. However before that, we need to install package e1071:

moment(Sampledata$Volume, order=3, center=TRUE) 

Kurtosis

Kurtosis measures whether the data is heavy-tailed or light-tailed relative to a normal distribution. Datasets with high kurtosis tend to have heavy tails, or outliers. Datasets with low kurtosis tend to have light tails, and fewer outliers. The computed value of kurtosis is compared with the kurtosis of normal distribution and the interpretation is made on the basis of that.

The kurtosis of Volume is given by the following code:

kurtosis(Sampledata$Volume) 

It gives value 5.777117, which shows the distribution of volume as leptokurtic.

Skewness

Skewness is the measure of symmetry of the distribution. If the mean of data values is less than the median then the distribution is said to be left-skewed and if the mean of the data values is greater than the median, then the distribution is said to be right-skewed.

The skewness of Volume is computed as follows in R:

skewness(Sampledata$Volume) 

This gives the result 1.723744, which means it is right-skewed.

Note

For computing skewness and kurtosis, we need to install the package e1071.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset