We will now cover some basic measures of central tendency, dispersion, and simple plots. The first question that we will address is how R handles the missing values in calculations? To see what happens, create a vector with a missing value (NA
in the R language), then sum the values of the vector with sum()
:
> a = c(1,2,3,NA) > sum(a) [1] NA
Unlike SAS, which would sum the non-missing values, R does not sum the non-missing values but simply returns that at least one value is missing. Now, we could create a new vector with the missing value deleted but you can also include the syntax to exclude any missing values with na.rm=TRUE
:
> sum(a, na.rm=TRUE) [1] 6
Functions exist to identify the measures of central tendency and dispersion of a vector:
> data = c(4,3,2,5.5,7.8,9,14,20) > mean(data) [1] 8.1625 > median(data) [1] 6.65 > sd(data) [1] 6.142112 > max(data) [1] 20 > min(data) [1] 2 > range(data) [1] 2 20 > quantile(data) 0% 25% 50% 75% 100% 2.00 3.75 6.65 10.25 20.00
A summary()
function is available that includes the mean, median, and quartile values:
> summary(data) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.750 6.650 8.162 10.250 20.000
We can use plots to visualize the data. The base plot here will be barplot
, then we will use abline()
to include the mean
and median
. As the default line is solid, we will create a dotted line for median
with lty=2
to distinguish it from mean
:
> barplot(data) > abline(h=mean(data)) > abline(h=median(data), lty=2)
The output of the preceding command is as follows:
A number of functions are available to generate different data distributions. Here, we can look at one such function for a normal distribution with a mean of zero and standard deviation of one using rnorm()
to create 100
data points. We will then plot the values and also plot a histogram. Additionally, to duplicate the results, ensure that you use the same random seed with set.seed()
:
> set.seed(1) > norm = rnorm(100)
This is the plot of the 100
data points:
> plot(norm)
The output of the preceding command is as follows:
Finally, produce a histogram with hist(norm)
:
> hist(norm)
The following is the output of the preceding command: