How it works...

In any dataset there is always a chance that some values are missing. In R, a missing value is often noted with the NA symbol, which stands for not available. Almost all the functions in R have some way to deal with missing values, for example, the mean or sum function will not work with NA. Consider the following simple example where a vector with missing value NA is created. When we try to find the mean it gives NA as an answer due to NA values in a set. The mean function also supports a parameter na.rm that removes NA from the set and computes the mean:

> t = c(1,2,3,NA,4,5) # Created a vector with missing value
> mean(t) # finding mean will give NA
[1] NA
> mean(t, na.rm = TRUE) # using na.rm = TRUE will work
[1] 3

This approach works as most of the functions provide the workaround to deal with the NA value. It is always better to remove such values or observations from the set. Removing the NA values is also known as imputing. In the preceding recipe, we used an.na function to find the NA values in observation. In the next step we compute the percentage of missing values in the set. Using the sapply function, we compute the missing values percentage for all observations.

To visualize the missing values, the Amelia package is used, which will plot the missing value map for each observations on one chart. From the plot it is clearly visible that Ozone observation or attribute has maximum missing values followed by Solar.R.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset