It is perhaps an accepted notion that issues with data quality may be categorized into one of the following areas:
The quality or level of quality of your data can be affected by the way it is entered, stored, and managed. The process of addressing data quality (referred to most often as data quality assurance (DQA)) requires a routine and regular review and evaluation of the data and performing ongoing processes termed profiling and scrubbing (this is vital even if the data is stored in multiple disparate systems, making these processes difficult).
Here, tidying the data will be much more project centric in that we're probably not concerned with creating a formal DQA process, but are only concerned with making certain that the data is correct for your particular predictive project.
In statistics, data unobserved or not yet reviewed by the data scientist is considered raw and cannot be reliably used in predictive projects. The process of tidying the data will usually involve several steps. Taking the extra time to break out the work is strongly recommended (rather than haphazardly addressing multiple data issues together).
The first step requires bringing the data to what may be called mechanical correctness. In this first step, you focus on things such as:
The second step is to address the statistical soundness of the data. Here we correct issues that may be mechanically correct but will most likely (depending upon the subject matter) impact a statistical outcome.
These issues may include:
Finally, the last step (before actually attempting to use the data) may be the re-formatting step. In this step, the data scientist will determine the form that the data must be in in order to most efficiently process it, based upon the intended use or objective.
For example, one might decide to:
There are a variety of somewhat routine methods for using R to resolve the aforementioned data errors.
For example:
is
functions to test for an object's data type and the as
functions for an explicit conversion. A simplest example is shown here:as.Date
function. Typically, date values are important to a statistical model and therefore it is important to take the time to understand the format of a model's date fields and ensure that they are properly dealt with. Mostly, dates and times will appear in raw data format as strings, which can be converted and formatted as required. In the following code, the string fields containing a saledate
and a returndate
are converted to date type values and used with a common time function, difftime
:> participant<-c(1,2,3,4,5,6,7,8) > recode<-c(Doctoral=1, Masters=2, Bachelors=3, Associates=4, Nondegree=5, SomeCollege=6, HighSchool=7, None=8)) > (participant<-factor (participant, levels=recode, labels=names(recode))) [1] Doctoral Masters Bachelors Associates Nondegree SomeCollege HighSchool None Levels: Doctoral Masters Bachelors Associates Nondegree SomeCollege HighSchool None
var
, cov
, and cor
compute variance, covariance or correlation of variables. These functions have the option to set na.rm
to TRUE. Doing this tells R to exclude any and all records or cases with missing values.