Deleting missing values

The simplest way to handle NA values is to delete any entry that contains an NA value, or a certain number of NA values. When removing entries with NA values, there is a trade-off between the correctness of the data and the completeness of the data. Data entries that contain NA values may also contain several useful non-NA values, and and removing too many data entries could reduce the dataset to a point where it is no longer useful.

For this dataset, it is not that important to have all of the years present; even one year is enough to give us a rough idea of how much road length is in the particular region at any point over the 12 years. A safe approach for this particular application would be to remove all of the rows where all of the values are NA.

A quick shortcut to finding the rows for which all values are NA is to use the rowSums() function. The rowSums() function finds the sum of each row, and takes a parameter to ignore NA values. The following finds the sum of the non-NA values in the roads.num2 dataframe:

roads.num2.rowsums <- rowSums(roads.num2,na.rm=TRUE)

Because NA values are ignored, in the resulting vector of row sums, a 0 corresponds to either a row with all NA values or a region with no roads. In either case, a 0 value corresponds to a row that is not important and can be filtered out. The following creates an index that can be used to filter out all such rows:

roads.keep3 <- roads.num2.rowsums > 0

In the following continuation of r_intro.R, the roads.keep3 vector is used to filter out the rows that have either all NA values or 0 roads:

roads3 <- roads2[roads.keep2,]
roads.num3 <- roads.num2[roads.keep2,]
roads.means3 <- roads.means3

Next, I will do a quick demonstration of another approach to NA handling, replacing the values with a constant.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset