Chapter 3. Exploring data

This chapter covers

  • Using summary statistics to explore data
  • Exploring data using visualization
  • Finding problems and issues during data exploration

In the last two chapters, you learned how to set the scope and goal of a data science project, and how to start working with your data in R. In this chapter, you’ll start to get your hands into the data. As shown in the mental model (figure 3.1), this chapter emphasizes the science of exploring the data, prior to the model-building step. Your goal is to have data that is as clean and useful as possible.

Example

Suppose your goal is to build a model to predict which of your customers don’t have health insurance. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and so on.

Figure 3.1. Chapter 3 mental model

You’ve put all your data into a single data frame called customer_data that you’ve input into R.[1] Now you’re ready to start building the model to identify the customers you’re interested in.

1

We have a copy of this synthetic dataset available for download from https://github.com/WinVector/PDSwR2/tree/master/Custdata, and once it's saved, you can load it into R with the command customer_data <- readRDS("custdata.RDS"). This dataset is derived from the census data that you saw in chapter 2. We have introduced a little noise to the age variable to reflect what is typically seen in real-world noisy datasets. We have also included some columns not necessarily relevant to our example scenario, but which exhibit some important data anomalies.

It’s tempting to dive right into the modeling step without looking very hard at the dataset first, especially when you have a lot of data. Resist the temptation. No dataset is perfect: you’ll be missing information about some of your customers, and you’ll have incorrect data about others. Some data fields will be dirty and inconsistent. If you don’t take the time to examine the data before you start to model, you may find yourself redoing your work repeatedly as you discover bad data fields or variables that need to be transformed before modeling. In the worst case, you’ll build a model that returns incorrect predictions—and you won’t be sure why.

Get to know your data before modeling

By addressing data issues early, you can save yourself some unnecessary work, and a lot of headaches!

You’d also like to get a sense of who your customers are. Are they young, middle-aged, or seniors? How affluent are they? Where do they live? Knowing the answers to these questions can help you build a better model, because you’ll have a more specific idea of what information most accurately predicts the probability of insurance coverage.

In this chapter, we’ll demonstrate some ways to get to know your data, and discuss some of the potential issues that you’re looking for as you explore. Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization, or graphs of the data. You can spot some problems just by using summary statistics; other problems are easier to find visually.

Organizing data for analysis

For most of this book, we’ll assume that the data you’re analyzing is in a single data frame. This is not how data is usually stored. In a database, for example, data is usually stored in normalized form to reduce redundancy: information about a single customer is spread across many small tables. In log data, data about a single customer can be spread across many log entries, or sessions. These formats make it easy to add (or, in the case of a database, modify) data, but are not optimal for analysis. You can often join all the data you need into a single table in the database using SQL, but in chapter 5, we’ll discuss commands like join that you can use within R to further consolidate data.

3.1. Using summary statistics to spot problems

In R, you’ll typically use the summary() command to take your first look at the data. The goal is to understand whether you have the kind of customer information that can potentially help you predict health insurance coverage, and whether the data is of good enough quality to be informative.[1]

1

If you haven’t already done so, we suggest you follow the steps in section A.1 of appendix A to install R, packages, tools, and the book examples.

Listing 3.1. The summary() command
setwd("PDSwR2/Custdata")                                                 1
customer_data = readRDS("custdata.RDS")
summary(customer_data)
##     custid              sex        is_employed       income           2
##  Length:73262       Female:37837   FALSE: 2351   Min.   :  -6900
##  Class :character   Male  :35425   TRUE :45137   1st Qu.:  10700
##  Mode  :character                  NA's :25774   Median :  26200
##                                                  Mean   :  41764
##                                                  3rd Qu.:  51700
##                                                  Max.   :1257000
##
##             marital_status  health_ins                                3
##  Divorced/Separated:10693   Mode :logical
##  Married           :38400   FALSE:7307
##  Never married     :19407   TRUE :65955
##  Widowed           : 4762
##
##
##
##                        housing_type   recent_move      num_vehicles   4
##  Homeowner free and clear    :16763   Mode :logical   Min.   :0.000
##  Homeowner with mortgage/loan:31387   FALSE:62418     1st Qu.:1.000
##  Occupied with no rent       : 1138   TRUE :9123      Median :2.000
##  Rented                      :22254   NA's :1721      Mean   :2.066
##  NA's                        : 1720                   3rd Qu.:3.000
##                                                       Max.   :6.000
##                                                       NA's   :1720
##       age               state_of_res     gas_usage                    5
##  Min.   :  0.00   California  : 8962   Min.   :  1.00
##  1st Qu.: 34.00   Texas       : 6026   1st Qu.:  3.00
##  Median : 48.00   Florida     : 4979   Median : 10.00
##  Mean   : 49.16   New York    : 4431   Mean   : 41.17
##  3rd Qu.: 62.00   Pennsylvania: 2997   3rd Qu.: 60.00
##  Max.   :120.00   Illinois    : 2925   Max.   :570.00
##                   (Other)     :42942   NA's   :1720

  • 1 Change this to your actual path to the directory where you unpacked PDSwR2
  • 2 The variable is_employed is missing for about a third of the data. The variable income has negative values, which are potentially invalid.
  • 3 About 90% of the customers have health insurance.
  • 4 The variables housing_type, recent_move, num_vehicles, and gas_usage are each missing 1720 or 1721 values.
  • 5 The average value of the variable age seems plausible, but the minimum and maximum values seem unlikely. The variable state_of_res is a categorical variable; summary() reports how many customers are in each state (for the first few states).

The summary() command on a data frame reports a variety of summary statistics on the numerical columns of the data frame, and count statistics on any categorical columns (if the categorical columns have already been read in as factors[1]).

1

Categorical variables are of class factor in R. They can be represented as strings (class character), and some analytical functions will automatically convert string variables to factor variables. To get a useful summary of a categorical variable, it needs to be a factor.

As you see from listing 3.1, the summary of the data helps you quickly spot potential problems, like missing data or unlikely values. You also get a rough idea of how categorical data is distributed. Let’s go into more detail about the typical problems that you can spot using the summary.

3.1.1. Typical problems revealed by data summaries

At this stage, you’re looking for several common issues:

  • Missing values
  • Invalid values and outliers
  • Data ranges that are too wide or too narrow
  • The units of the data

Let’s address each of these issues in detail.

Missing values

A few missing values may not really be a problem, but if a particular data field is largely unpopulated, it shouldn’t be used as an input without some repair (as we’ll discuss in section 4.1.2). In R, for example, many modeling algorithms will, by default, quietly drop rows with missing values. As you see in the following listing, all the missing values in the is_employed variable could cause R to quietly ignore more than a third of the data.

Listing 3.2. Will the variable is_employed be useful for modeling?
## is_employed                                            1
## FALSE: 2321
## TRUE :44887
## NA's :24333

##                       housing_type   recent_move       2
## Homeowner free and clear    :16763   Mode :logical
## Homeowner with mortgage/loan:31387   FALSE:62418
## Occupied with no rent       : 1138   TRUE :9123
## Rented                      :22254   NA's :1721
## NA's                        : 1720
##
##
##   num_vehicles     gas_usage
##  Min.   :0.000   Min.   :  1.00
##  1st Qu.:1.000   1st Qu.:  3.00
##  Median :2.000   Median : 10.00
##  Mean   :2.066   Mean   : 41.17
##  3rd Qu.:3.000   3rd Qu.: 60.00
##  Max.   :6.000   Max.   :570.00
##  NA's   :1720    NA's   :1720

  • 1 The variable is_employed is missing for more than a third of the data. Why? Is employment status unknown? Did the company start collecting employment data only recently? Does NA mean “not in the active workforce” (for example, students or stay-at-home parents)?
  • 2 The variables housing_type, recent_move, num_vehicles, and gas_usage are missing relatively few values—about 2% of the data. It’s probably safe to just drop the rows that are missing values, especially if the missing values are all in the same 1720 rows.

If a particular data field is largely unpopulated, it’s worth trying to determine why; sometimes the fact that a value is missing is informative in and of itself. For example, why is the is_employed variable missing so many values? There are many possible reasons, as we noted in listing 3.2.

Whatever the reason for missing data, you must decide on the most appropriate action. Do you include a variable with missing values in your model, or not? If you decide to include it, do you drop all the rows where this field is missing, or do you convert the missing values to 0 or to an additional category? We’ll discuss ways to treat missing data in chapter 4. In this example, you might decide to drop the data rows where you’re missing data about housing or vehicles, since there aren’t many of them. You probably don’t want to throw out the data where you’re missing employment information, since employment status is probably highly predictive of having health insurance; you might instead treat the NAs as a third employment category. You will likely encounter missing values when model scoring, so you should deal with them during model training.

Invalid values and outliers

Even when a column or variable isn’t missing any values, you still want to check that the values that you do have make sense. Do you have any invalid values or outliers? Examples of invalid values include negative values in what should be a non-negative numeric data field (like age or income) or text where you expect numbers. Outliers are data points that fall well out of the range of where you expect the data to be. Can you spot the outliers and invalid values in the next listing?

Listing 3.3. Examples of invalid values and outliers
summary(customer_data$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   -6900   11200   27300   42522   52000 1257000       1

summary(customer_data$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    0.00   34.00   48.00   49.17   62.00  120.00       2

  • 1 Negative values for income could indicate bad data. They might also have a special meaning, like “amount of debt.” Either way, you should check how prevalent the issue is, and decide what to do. Do you drop the data with negative income? Do you convert negative values to zero?
  • 2 Customers of age zero, or customers of an age greater than about 110, are outliers. They fall out of the range of expected customer values. Outliers could be data input errors. They could be special sentinel values: zero might mean “age unknown” or “refuse to state.” And some of your customers might be especially long-lived.

Often, invalid values are simply bad data input. A negative number in a field like age, however, could be a sentinel value to designate “unknown.” Outliers might also be data errors or sentinel values. Or they might be valid but unusual data points—people do occasionally live past 100.

As with missing values, you must decide the most appropriate action: drop the data field, drop the data points where this field is bad, or convert the bad data to a useful value. For example, even if you feel certain outliers are valid data, you might still want to omit them from model construction, if the outliers interfere with the model-fitting process. Generally, the goal of modeling is to make good predictions on typical cases, and a model that is highly skewed to predict a rare case correctly may not always be the best model overall.

Data range

You also want to pay attention to how much the values in the data vary. If you believe that age or income helps to predict the probability of health insurance coverage, then you should make sure there is enough variation in the age and income of your customers for you to see the relationships. Let’s look at income again, in the next listing. Is the data range wide? Is it narrow?

Listing 3.4. Looking at the data range of a variable
summary(customer_data$income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   -6900   10700   26200   41764   51700 1257000       1

  • 1 Income ranges from zero to over a million dollars, a very wide range.

Even ignoring negative income, the income variable in listing 3.4 ranges from zero to over a million dollars. That’s pretty wide (though typical for income). Data that ranges over several orders of magnitude like this can be a problem for some modeling methods. We’ll talk about mitigating data range issues when we talk about logarithmic transformations in chapter 4.

Data can be too narrow, too. Suppose all your customers are between the ages of 50 and 55. It’s a good bet that age range wouldn’t be a very good predictor of the probability of health insurance coverage for that population, since it doesn’t vary much at all.

How narrow is “too narrow” for a data range?

Of course, the term narrow is relative. If we were predicting the ability to read for children between the ages of 5 and 10, then age probably is a useful variable as is. For data including adult ages, you may want to transform or bin ages in some way, as you don’t expect a significant change in reading ability between ages 40 and 50. You should rely on information about the problem domain to judge if the data range is narrow, but a rough rule of thumb relates to the ratio of the standard deviation to the mean. If that ratio is very small, then the data isn’t varying much.

We’ll revisit data range in section 3.2, when we talk about examining data graphically.

One factor that determines apparent data range is the unit of measurement. To take a nontechnical example, we measure the ages of babies and toddlers in weeks or in months, because developmental changes happen at that time scale for very young children. Suppose we measured babies’ ages in years. It might appear numerically that there isn’t much difference between a one-year-old and a two-year-old. In reality, there’s a dramatic difference, as any parent can tell you! Units can present potential issues in a dataset for another reason, as well.

Units

Does the income data in listing 3.5 represent hourly wages, or yearly wages in units of $1000? As a matter of fact, it’s yearly wages in units of $1000, but what if it were hourly wages? You might not notice the error during the modeling stage, but down the line someone will start inputting hourly wage data into the model and get back bad predictions in return.

Listing 3.5. Checking units; mistakes can lead to spectacular errors
IncomeK = customer_data$income/1000
summary(IncomeK)                                        1
 ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   -6.90   10.70   26.20   41.76   51.70 1257.00

  • 1 The variable IncomeK is defined as IncomeK = customer_data$income/1000. But suppose you didn’t know that. Looking only at the summary, the values could plausibly be interpreted to mean either “hourly wage” or “yearly income in units of $1000.”

Are time intervals measured in days, hours, minutes, or milliseconds? Are speeds in kilometers per second, miles per hour, or knots? Are monetary amounts in dollars, thousands of dollars, or 1/100 of a penny (a customary practice in finance, where calculations are often done in fixed-point arithmetic)? This is actually something that you’ll catch by checking data definitions in data dictionaries or documentation, rather than in the summary statistics; the difference between hourly wage data and annual salary in units of $1000 may not look that obvious at a casual glance. But it’s still something to keep in mind while looking over the value ranges of your variables, because often you can spot when measurements are in unexpected units. Automobile speeds in knots look a lot different than they do in miles per hour.

3.2. Spotting problems using graphics and visualization

As you’ve seen, you can spot plenty of problems just by looking over the data summaries. For other properties of the data, pictures are better than text.

We cannot expect a small number of numerical values [summary statistics] to consistently convey the wealth of information that exists in data. Numerical reduction methods do not retain the information in the data.

William Cleveland, The Elements of Graphing Data

Figure 3.2 shows a plot of how customer ages are distributed. We’ll talk about what the y-axis of the graph means later; for now, just know that the height of the graph corresponds to how many customers in the population are of that age. As you can see, information like the peak age of distribution, the range of the data, and the presence of outliers is easier to absorb visually than it is to determine textually.

Figure 3.2. Some information is easier to read from a graph, and some from a summary.

The use of graphics to examine data is called visualization. We try to follow William Cleveland’s principles for scientific visualization. Details of specific plots aside, the key points of Cleveland’s philosophy are these:

  • A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
  • Strive for clarity. Make the data stand out. Specific tips for increasing clarity include these:

    • Avoid too many superimposed elements, such as too many curves in the same graphing space.
    • Find the right aspect ratio and scaling to properly bring out the details of the data.
    • Avoid having the data all skewed to one side or the other of your graph.
  • Visualization is an iterative process. Its purpose is to answer questions about the data.

During the visualization stage, you graph the data, learn what you can, and then regraph the data to answer the questions that arise from your previous graphic. Different graphics are best suited for answering different questions. We’ll look at some of them in this section.

In this book, we’ll demonstrate the visualizations and graphics using the R graphing package ggplot2 (the R realization of Leland Wilkinson’s Grammar of Graphics, Springer, 1999), as well as some prepackaged ggplot2 visualizations from the package WVPlots. You may also want to check out the ggpubr and ggstatsplot packages for more prepackaged ggplot2 graphs. And, of course, other R visualization packages, such as base graphics or the lattice package, can produce similar plots.

A note on ggplot2

The theme of this section is how to use visualization to explore your data, not how to use ggplot2. The ggplot2 package is based on Leland Wilkinson’s book, Grammar of Graphics. We chose ggplot2 because it excels at combining multiple graphical elements together, but its syntax can take some getting used to. Here are the key points to understand when looking at our code snippets:

  • Graphs in ggplot2 can only be defined on data frames. The variables in a graph—the x variable, the y variable, the variables that define the color or the size of the points—are called aesthetics, and are declared by using the aes function.
  • The ggplot() function declares the graph object. The arguments to ggplot() can include the data frame of interest and the aesthetics. The ggplot() function doesn’t itself produce a visualization; visualizations are produced by layers.
  • Layers produce the plots and plot transformations and are added to a given graph object using the + operator. Each layer can also take a data frame and aesthetics as arguments, in addition to plot-specific parameters. Examples of layers are geom_point (for a scatter plot) or geom_line (for a line plot).

This syntax will become clearer in the examples that follow. For more information, we recommend Hadley Wickham’s reference site https://ggplot2.tidyverse.org/reference/, which has pointers to online documentation; the Graphs section of Winston Chang’s site http://www.cookbook-r.com/; and Winston Chang’s R Graphics Cookbook (O’Reilly, 2012).

In the next two sections, we’ll show how to use pictures and graphs to identify data characteristics and issues. In section 3.2.2, we’ll look at visualizations for two variables. But let’s start by looking at visualizations for single variables.

3.2.1. Visually checking distributions for a single variable

In this section we will look at

  • Histograms
  • Density plots
  • Bar charts
  • Dot plots

The visualizations in this section help you answer questions like these:

  • What is the peak value of the distribution?
  • How many peaks are there in the distribution (unimodality versus bimodality)?
  • How normal (or lognormal) is the data? We’ll discuss normal and lognormal distributions in appendix B.
  • How much does the data vary? Is it concentrated in a certain interval or in a certain category?

One of the things that’s easy to grasp visually is the shape of the data distribution. The graph in figure 3.3 is somewhat flattish between the ages of about 25 and about 60, falling off slowly after 60. However, even within this range, there seems to be a peak at around the late-20s to early 30s range, and another in the early 50s. This data has multiple peaks: it is not unimodal.[1]

1

The strict definition of unimodal is that a distribution has a unique maximum value; in that sense, figure 3.3 is unimodal. However, most people use the term “unimodal” to mean that a distribution has a unique peak (local maxima); the customer age distribution has multiple peaks, and so we will call it multimodal.

Figure 3.3. The density plot of age

Unimodality is a property you want to check in your data. Why? Because (roughly speaking) a unimodal distribution corresponds to one population of subjects. For the solid curve in figure 3.4, the mean customer age is about 50, and 50% of the customers are between 34 and 64 (the first and third quartiles, shown shaded). So you can say that a “typical” customer is middle-aged and probably possesses many of the demographic qualities of a middle-aged person—though, of course, you have to verify that with your actual customer information.

Figure 3.4. A unimodal distribution (solid curve) can usually be modeled as coming from a single population of users. With a bimodal distribution (dashed curve), your data often comes from two populations of users.

The dashed curve in figure 3.4 shows what can happen when you have two peaks, or a bimodal distribution. (A distribution with more than two peaks is multimodal.) This set of customers has about the same mean age as the customers represented by the solid curve—but a 50-year-old is hardly a “typical” customer! This (admittedly exaggerated) example corresponds to two populations of customers: a fairly young population mostly in their teens to late twenties, and an older population mostly in their 70s. These two populations probably have very different behavior patterns, and if you want to model whether a customer probably has health insurance or not, it wouldn’t be a bad idea to model the two populations separately.

The histogram and the density plot are two visualizations that help you quickly examine the distribution of a numerical variable. Figures 3.1 and 3.3 are density plots. Whether you use histograms or density plots is largely a matter of taste. We tend to prefer density plots, but histograms are easier to explain to less quantitatively-minded audiences.

Histograms

A basic histogram bins a variable into fixed-width buckets and returns the number of data points that fall into each bucket as a height. For example, suppose you wanted a sense of how much your customers pay in monthly gas heating bills. You could group the gas bill amounts in intervals of $10: $0–10, $10–20, $20–30, and so on. Customers at a boundary go into the higher bucket: people who pay around $20 a month go into the $20–30 bucket. For each bucket, you then count how many customers are in that bucket. The resulting histogram is shown in figure 3.5.

Figure 3.5. A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.

You create the histogram in figure 3.5 in ggplot2 with the geom_histogram layer.

Listing 3.6. Plotting a histogram
library(ggplot2)                              1
ggplot(customer_data, aes(x=gas_usage)) +
  geom_histogram(binwidth=10, fill="gray")    2

  • 1 Load the ggplot2 library, if you haven’t already done so.
  • 2 The binwidth parameter tells the geom_histogram call how to make bins of $10 intervals (default is datarange/30). The fill parameter specifies the color of the histogram bars (default: black).

With the proper binwidth, histograms visually highlight where the data is concentrated, and point out the presence of potential outliers and anomalies. In figure 3.5, for example, you see that some outlier customers have much larger gas bills than is typical, so you may possibly want to drop those customers from any analysis that uses gas heating bills as an input. You also see an unusually high concentration of people who pay $0–10/month in gas. This could mean that most of your customers don’t have gas heating, but on further investigation you notice this in the data dictionary (table 3.1).

Table 3.1. Data dictionary entry for gas_usage

Value

Definition

NA Unknown or not applicable
001 Included in rent or condo fee
002 Included in electricity payment
003 No charge or gas not used
004-999 $4 to $999 (rounded and top-coded)

In other words, the values in the gas_usage column are a mixture of numerical values and symbolic codes encoded as numbers. The values 001, 002, and 003 are sentinel values, and to treat them as numerical values could potentially lead to incorrect conclusions in your analysis. One possible solution in this case is to convert the numeric values 1-3 into NA, and add additional Boolean variables to indicate the possible cases (included in rent/condo fee, and so on).

The primary disadvantage of histograms is that you must decide ahead of time how wide the buckets are. If the buckets are too wide, you can lose information about the shape of the distribution. If the buckets are too narrow, the histogram can look too noisy to read easily. An alternative visualization is the density plot.

Density plots

You can think of a density plot as a continuous histogram of a variable, except the area under the density plot is rescaled to equal one. A point on a density plot corresponds to the fraction of data (or the percentage of data, divided by 100) that takes on a particular value. This fraction is usually very small. When you look at a density plot, you’re more interested in the overall shape of the curve than in the actual values on the y-axis. You’ve seen the density plot of age; figure 3.6 shows the density plot of income.

Figure 3.6. Density plots show where data is concentrated.

You produce figure 3.6 with the geom_density layer, as shown in the following listing.

Listing 3.7. Producing a density plot
library(scales)                                           1

ggplot(customer_data, aes(x=income)) + geom_density() +
  scale_x_continuous(labels=dollar)                       2

  • 1 The scales package brings in the dollar scale notation.
  • 2 Sets the x-axis labels to dollars

When the data range is very wide and the mass of the distribution is heavily concentrated to one side, like the distribution in figure 3.6, it’s difficult to see the details of its shape. For instance, it’s hard to tell the exact value where the income distribution has its peak. If the data is non-negative, then one way to bring out more detail is to plot the distribution on a logarithmic scale, as shown in figure 3.7. This is equivalent to plotting the density plot of log10(income).

Figure 3.7. The density plot of income on a log10 scale highlights details of the income distribution that are harder to see in a regular density plot.

In ggplot2, you can plot figure 3.7 with the geom_density and scale_x_log10 layers, such as in the following listing.

Listing 3.8. Creating a log-scaled density plot
ggplot(customer_data, aes(x=income)) +
  geom_density() +
  scale_x_log10(breaks = c(10, 100, 1000, 10000, 100000, 1000000),
  labels=dollar) +                                               1
   annotation_logticks(sides="bt", color="gray")                   2

  • 1 Sets the x-axis to be in log10 scale, with manually set tick points and labels as dollars
  • 2 Adds log-scaled tick marks to the top and bottom of the graph

When you issue the preceding command, you also get back a warning message:

## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 6856 rows containing non-finite values (stat_density).

This tells you that ggplot2 ignored the zero- and negative-valued rows (since log(0) = Infinity), and that there were 6856 such rows. Keep that in mind when evaluating the graph.

When should you use a logarithmic scale ?

You should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units. You should also use a log scale to better visualize data that is heavily skewed.

For example, in income data, a difference in income of $5,000 means something very different in a population where the incomes tend to fall in the tens of thousands of dollars than it does in populations where income falls in the hundreds of thousands or millions of dollars. In other words, what constitutes a “significant difference” depends on the order of magnitude of the incomes you’re looking at. Similarly, in a population like that in figure 3.7, a few people with very high income will cause the majority of the data to be compressed into a relatively small area of the graph. For both those reasons, plotting the income distribution on a logarithmic scale is a good idea.

In log space, income is distributed as something that looks like a “normalish” distribution, as will be discussed in appendix B. It’s not exactly a normal distribution (in fact, it appears to be at least two normal distributions mixed together).

Bar charts and dotplots

A bar chart is a histogram for discrete data: it records the frequency of every value of a categorical variable. Figure 3.8 shows the distribution of marital status in your customer dataset. If you believe that marital status helps predict the probability of health insurance coverage, then you want to check that you have enough customers with different marital statuses to help you discover the relationship between being married (or not) and having health insurance.

Figure 3.8. Bar charts show the distribution of categorical variables.

The ggplot2 command to produce figure 3.8 uses geom_bar:

ggplot(customer_data, aes(x=marital_status)) + geom_bar(fill="gray")

This graph doesn’t really show any more information than summary(customer_data$marital.stat) would show, but some people find the graph easier to absorb than the text. Bar charts are most useful when the number of possible values is fairly large, like state of residence. In this situation, we often find that a horizontal graph like that shown in figure 3.9 is more legible than a vertical graph.

Figure 3.9. A horizontal bar chart can be easier to read when there are several categories with long names.

The ggplot2 command to produce figure 3.9 is shown in the next listing.

Listing 3.9. Producing a horizontal bar chart
ggplot(customer_data, aes(x=state_of_res)) +
  geom_bar(fill="gray") +                        1
   coord_flip()                                  2

  • 1 Plots bar chart as before: state_of_res is on x-axis, count is on y-axis
  • 2 Flips the x and y axes: state_of_res is now on the y-axis

Cleveland[1] prefers the dot plot to the bar chart for visualizing discrete counts. This is because bars are two dimensional, so that a difference in counts looks like a difference in bar areas, rather than merely in bar heights. This can be perceptually misleading. Since the dot-and-line of a dot plot is not two dimensional, the viewer considers only the height difference when comparing two quantities, as they should.

1

See William S. Cleveland, The Elements of Graphing Data, Hobart Press, 1994.

Cleveland also recommends that the data in a bar chart or dot plot be sorted, to more efficiently extract insight from the data. This is shown in figure 3.10. Now it is easy to see in which states the most customers—or the fewest—live.

Figure 3.10. Using a dot plot and sorting by count makes the data even easier to read.

A sorted visualization requires a bit more manipulation, at least in ggplot2, because by default, ggplot2 will plot the categories of a factor variable in alphabetical order. Fortunately, much of the code is already wrapped in the ClevelandDotPlot function from the WVPlots package.

Listing 3.10. Producing a dot plot with sorted categories
library(WVPlots)                                    1
ClevelandDotPlot(customer_data, "state_of_res",     2
     sort = 1, title="Customers by state") +        3
 coord_flip()                                       4

  • 1 Loads the WVPlots library
  • 2 Plots the state_of_res column of the customer_data data frame
  • 3 “sort = 1” sorts the categories in increasing order (most frequent last).
  • 4 Flips the axes as before

Before we move on to visualizations for two variables, we’ll summarize the visualizations that we’ve discussed in this section in table 3.2.

Table 3.2. Visualizations for one variable

Graph type

Uses

Examples

Histogram or density plot Examine data range Check number of modes Check if distribution is normal/lognormal Check for anomalies and outliers Examine the distribution of customer age to get the typical customer age range Examine the distribution of customer income to get typical income range
Bar chart or dot plot Compare frequencies of the values of a categorical variable Count the number of customers from different states of residence to determine which states have the largest or smallest customer base

3.2.2. Visually checking relationships between two variables

In addition to examining variables in isolation, you’ll often want to look at the relationship between two variables. For example, you might want to answer questions like these:

  • Is there a relationship between the two inputs age and income in my data?
  • If so, what kind of relationship, and how strong?
  • Is there a relationship between the input marital status and the output health insurance? How strong?

You’ll precisely quantify these relationships during the modeling phase, but exploring them now gives you a feel for the data and helps you determine which variables are the best candidates to include in a model.

This section explores the following visualizations:

  • Line plots and scatter plots for comparing two continuous variables
  • Smoothing curves and hexbin plots for comparing two continuous variables at high volume
  • Different types of bar charts for comparing two discrete variables
  • Variations on histograms and density plots for comparing a continuous and discrete variable

First, let’s consider the relationship between two continuous variables. The first plot you might think of (though it’s not always the best) is the line plot.

Line plots

Line plots work best when the relationship between two variables is relatively clean: each x value has a unique (or nearly unique) y value, as in figure 3.11. You plot figure 3.11 with geom_line.

Figure 3.11. Example of a line plot

Listing 3.11. Producing a line plot
x <- runif(100)                                           1
y <- x^2 + 0.2*x                                          2
ggplot(data.frame(x=x,y=y), aes(x=x,y=y)) + geom_line()   3

  • 1 First, generate the data for this example. The x variable is uniformly randomly distributed between 0 and 1.
  • 2 The y variable is a quadratic function of x.
  • 3 Plots the line plot

When the data is not so cleanly related, line plots aren’t as useful; you’ll want to use the scatter plot instead, as you’ll see in the next section.

Scatter plots and smoothing curves

You’d expect there to be a relationship between age and health insurance, and also a relationship between income and health insurance. But what is the relationship between age and income? If they track each other perfectly, then you might not want to use both variables in a model for health insurance. The appropriate summary statistic is the correlation, which we compute on a safe subset of our data.

Listing 3.12. Examining the correlation between age and income
customer_data2 <- subset(customer_data,
                   0 < age & age < 100 &
                    0 < income & income < 200000)        1

cor(customer_data2$age, customer_data2$income)           2
 ## [1] 0.005766697                                      3

  • 1 Only consider a subset of data with reasonable age and income values.
  • 2 Gets correlation of age and income
  • 3 Resulting correlation is positive but nearly zero.

The correlation is positive, as you might expect, but nearly zero, meaning there is apparently not much relation between age and income. A visualization gives you more insight into what’s going on than a single number can. Let’s try a scatter plot first (figure 3.12). Because our dataset has over 64,000 rows, which is too large for a legible scatterplot, we will sample the dataset down before plotting. You plot figure 3.12 with geom_point, as shown in listig 3.13.

Figure 3.12. A scatter plot of income versus age

Listing 3.13. Creating a scatterplot of age and income
set.seed(245566)                                                                     1
 customer_data_samp <-
      dplyr::sample_frac(customer_data2, size=0.1, replace=FALSE)                    2

ggplot(customer_data_samp, aes(x=age, y=income)) +                                   3
        geom_point() +
       ggtitle("Income as a function of age")

  • 1 Make the random sampling reproducible by setting the random seed.
  • 2 For legibility, only plot a 10% sample of the data. We will show how to plot all the data in a following section.
  • 3 Creates the scatterplot

The relationship between age and income isn’t easy to see. You can try to make the relationship clearer by also plotting a smoothing curve through the data, as shown in figure 3.13.

Figure 3.13. A scatter plot of income versus age, with a smoothing curve

The smoothing curve makes it easier to see that in this population, income tends to increase with age from a person’s twenties until their mid-thirties, after which income increases at a slower, almost flat, rate until about a person’s mid-fifties. Past the mid-fifties, income tends to decrease with age.

In ggplot2, you can plot a smoothing curve to the data by using geom_smooth:

ggplot(customer_data_samp, aes(x=age, y=income)) +
  geom_point() + geom_smooth() +
  ggtitle("Income as a function of age")

For datasets with a small number of points, the geom_smooth function uses the loess (or lowess) function to calculate smoothed local linear fits of the data. For larger datasets, like this one, geom_smooth uses a spline fit.

By default, geom_smooth also plots a "standard error" ribbon around the smoothing curve. This ribbon is wider where there are fewer data points and narrower where the data is dense. It’s meant to indicate where the smoothing curve estimate is more uncertain. For the plot in figure 3.13, the scatterplot is so dense that the smoothing ribbon isn’t visible, except at the extreme right of the graph. Since the scatterplot already gives you the same information that the standard error ribbon does, you can turn it off with the argument se=FALSE, as we will see in a later example.

A scatter plot with a smoothing curve also makes a useful visualization of the relationship between a continuous variable and a Boolean. Suppose you’re considering using age as an input to your health insurance model. You might want to plot health insurance coverage as a function of age, as shown in figure 3.14.

Figure 3.14. Fraction of customers with health insurance, as a function of age

The variable health_ins has the value 1 (for TRUE) when the person has health insurance, and 0 (for FALSE) otherwise. A scatterplot of the data will have all the y-values at 0 or 1, which may not seem informative, but a smoothing curve of the data estimates the average value of the 0/1 variable health_ins as a function of age. The average value of health_ins for a given age is simply the probability that a person of that age in your dataset has health insurance.

Figure 3.14 shows you that the probability of having health insurance increases as customer age increases, from about 80% at age 20 to nearly 100% after about age 75.

Why keep the scatterplot?

You might ask, why bother to plot the points? Why not just plot the smoothing curve? After all, the data only takes on the values 0 and 1, so the scatterplot doesn’t seem informative.

This is a matter of taste, but we like to keep the scatterplot because it gives us a visual estimate of how much data there is in different ranges of the x variable. For example, if your data has only a dozen or so customers in the 70–100 age range, then you know that estimates of the probability of health insurance in that age range may not be very good. Conversely, if you have hundreds of customers spread over that age range, then you can have more confidence in the estimate.

The standard error ribbon that geom_smooth plots around the smoothing curve gives equivalent information, but we find the scatterplot more helpful.

An easy way to plot figure 3.14 is with the BinaryYScatterPlot function from WVPlots:

BinaryYScatterPlot(customer_data_samp, "age", "health_ins",
                   title = "Probability of health insurance by age")

By default, BinaryYScatterPlot fits a logistic regression curve through the data. You will learn more about logistic regression in chapter 8, but for now just know that a logistic regression tries to estimate the probability that the Boolean outcome y is true, as a function of the data x.

If you tried to plot all the points from the customer_data2 dataset, the scatter plot would turn into an illegible smear. To plot all the data in higher volume situations like this, try an aggregated plot, like a hexbin plot.

Hexbin plots

A hexbin plot is like a two-dimensional histogram. The data is divided into bins, and the number of data points in each bin is represented by color or shading. Let’s go back to the income versus age example. Figure 3.15 shows a hexbin plot of the data. Note how the smoothing curve traces out the shape formed by the densest region of data.

Figure 3.15. Hexbin plot of income versus age, with a smoothing curve superimposed

To make a hexbin plot in R, you must have the hexbin package installed. We’ll discuss how to install R packages in appendix A. Once hexbin is installed and the library loaded, you create the plots using the geom_hex layer, or use the convenience function HexBinPlot from WVPlots, as we do here. HexBinPlot predefines a color scale where denser cells are colored darker; the default ggplot2 color scale colors denser cells lighter.

Listing 3.14. Producing a hexbin plot
library(WVPlots)                                                            1

HexBinPlot(customer_data2, "age", "income", "Income as a function of age") +2
   geom_smooth(color="black", se=FALSE)                                     3

  • 1 Loads the WVPlots library
  • 2 Plots the hexbin of income as a function of age
  • 3 Adds the smoothing line in black; suppresses standard error ribbon (se=FALSE)

In this section and the previous section, we’ve looked at plots where at least one of the variables is numerical. But in our health insurance example, the output is categorical, and so are many of the input variables. Next we’ll look at ways to visualize the relationship between two categorical variables.

Bar charts for two categorical variables

Let’s examine the relationship between marital status and the probability of health insurance coverage. The most straightforward way to visualize this is with a stacked bar chart, as shown in figure 3.16.

Figure 3.16. Health insurance versus marital status: stacked bar chart

The stacked bar chart makes it easy to compare the total number of people in each marital category, and to compare the number of uninsured people in each marital category. However, you can’t directly compare the number of insured people in each category, because the bars don’t all start at the same level. So some people prefer the side-by-side bar chart, shown in figure 3.17, which makes it easier to compare the number of both insured and uninsured across categories—but not the total number of people in each category.

Figure 3.17. Health insurance versus marital status: side-by-side bar chart

If you want to compare the number of insured and uninsured people across categories, while keeping a sense of the total number of people in each category, one plot to try is what we call a shadow plot. A shadow plot of this data creates two graphs, one for the insured population and one for the uninsured population. Both graphs are superimposed against a “shadow graph” of the total population. This allows comparison both across and within marital status categories, while maintaining information about category totals. This is shown in figure 3.18.

Figure 3.18. Health insurance versus marital status: shadow plot

The main shortcoming of all the preceding charts is that you can’t easily compare the ratios of insured to uninsured across categories, especially for rare categories like Widowed. You can use what ggplot2 calls a filled bar chart to plot a visualization of the ratios directly, as in figure 3.19.

Figure 3.19. Health insurance versus marital status: filled bar chart

The filled bar chart makes it obvious that divorced customers are slightly more likely to be uninsured than married ones. But you’ve lost the information that being widowed, though highly predictive of insurance coverage, is a rare category.

Which bar chart you use depends on what information is most important for you to convey. The code to generate each of these plots is given next. Note the use of the fill aesthetic in the ggplot2 commands; this tells ggplot2 to color (fill) the bars according to the value of the variable health_ins. The position argument to geom_bar specifies the bar chart style.

Listing 3.15. Specifying different styles of bar chart
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
                        geom_bar()                                           1

ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
                     geom_bar(position = "dodge")                            2

ShadowPlot(customer_data, "marital_status", "health_ins",
                         title = "Health insurance status by marital status")3
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
                     geom_bar(position = "fill")                             4

  • 1 Stacked bar chart, the default
  • 2 Side-by-side bar chart
  • 3 Uses the ShadowPlot command from the WVPlots package for the shadow plot
  • 4 Filled bar chart

In the preceding examples, one of the variables was binary; the same plots can be applied to two variables that each have several categories, but the results are harder to read. Suppose you’re interested in the distribution of marriage status across housing types. Some find the side-by-side bar chart easiest to read in this situation, but it’s not perfect, as you see in figure 3.20.

Figure 3.20. Distribution of marital status by housing type: side-by-side bar chart

A graph like figure 3.20 gets cluttered if either of the variables has a large number of categories. A better alternative is to break the distributions into different graphs, one for each housing type. In ggplot2 this is called faceting the graph, and you use the facet_wrap layer. The result is shown in figure 3.21.

Figure 3.21. Distribution of marital status by housing type: faceted side-by-side bar chart

The code for figures 3.20 and 3.21 looks like the next listing.

Listing 3.16. Plotting a bar chart with and without facets
cdata <- subset(customer_data, !is.na(housing_type))        1

ggplot(cdata, aes(x=housing_type, fill=marital_status)) +   2
   geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip()                                              3

ggplot(cdata, aes(x=marital_status)) +                      4
   geom_bar(fill="darkgray") +
  facet_wrap(~housing_type, scale="free_x") +               5
   coord_flip()                                             6

  • 1 Restricts to the data where housing_type is known
  • 2 Side-by-side bar chart
  • 3 Uses coord_flip () to rotate the graph so that marital_status is legible
  • 4 The faceted bar chart
  • 5 Facets the graph by housing.type. The scales="free_x" argument specifies that each facet has an independently scaled x-axis; the default is that all facets have the same scales on both axes. The argument "free_y" would free the y-axis scaling, and the argument "free" frees both axes.
  • 6 Uses coord_flip() to rotate the graph
Comparing a continuous and categorical variable

Suppose you want to compare the age distributions of people of different marital statuses in your data. You saw in section 3.2.1 that you can use histograms or density plots to look at the distribution of continuous variables like age. Now you want multiple distribution plots: one for each category of marital status. The most straightforward way to do this is to superimpose these plots in the same graph.

Figure 3.22 compares the age distributions of the widowed (dashed line) and never married (solid line) populations in the data. You can quickly see that the two populations are distributed quite differently: the widowed population skews older, and the never married population skews younger.

Figure 3.22. Comparing the distribution of marital status for widowed and never married populations

The code to produce figure 3.22 is as follows.

Listing 3.17. Comparing population densities across categories
customer_data3 = subset(customer_data2, marital_status %in%
   c("Never married", "Widowed"))                             1
ggplot(customer_data3, aes(x=age, color=marital_status,       2
   linetype=marital_status)) +
   geom_density() + scale_color_brewer(palette="Dark2")

  • 1 Restricts to the data for widowed or never married people
  • 2 Differentiates the color and line style of the plots by marital_status

Overlaid density plots give you good information about distribution shape: where populations are dense and where they are sparse, whether the populations are separated or overlap. However, they lose information about the relative size of each population. This is because each individual density plot is scaled to have unit area. This has the advantage of improving the legibility of each individual distribution, but can fool you into thinking that all the populations are about the same size. In fact, the superimposed density plots in figure 3.22 can also fool you into thinking that the widowed population becomes greater than the never married population after age 55, which is actually not true.

To retain information about the relative size of each population, use histograms. Histograms don’t superimpose well, so you can use the facet_wrap() command with geom_histogram(), as you saw with bar charts in listing 3.16. You can also produce a histogram version of the shadow plot, using the ShadowHist() function from WVPlots, as shown next.

Listing 3.18. Comparing population densities across categories with ShadowHist()
ShadowHist(customer_data3, "age", "marital_status",
 "Age distribution for never married vs. widowed populations", binwidth=5) 1

  • 1 Sets the bin widths of the histogram to 5

The result is shown in figure 3.23. Now you can see that the widowed population is quite small, and doesn’t exceed the never married population until after about age 65—10 years later than the crossover point in figure 3.22.

Figure 3.23. ShadowHist comparison of the age distributions of widowed and never married populations

You should also use faceting when comparing distributions across more than two categories, because too many overlaid plots are hard to read. Try examining the age distributions for all four categories of marital status; the plot is shown in figure 3.24.

Figure 3.24. Faceted plot of the age distributions of different marital statuses

ggplot(customer_data2, aes(x=age)) +
  geom_density() + facet_wrap(~marital_status)

Again, these density plots give you good information about distribution shape, but they lose information about the relative size of each population.

Overview of visualizations for two variables

Table 3.3 summarizes the visualizations for two variables that we’ve covered.

Table 3.3. Visualizations for two variables

Graph type

Uses

Examples

Line plot Shows the relationship between two continuous variables. Best when that relationship is functional, or nearly so. Plot y = f(x).
Scatter plot Shows the relationship between two continuous variables. Best when the relationship is too loose or cloud-like to be easily seen on a line plot. Plot income vs. years in the workforce (income on the y-axis).
Smoothing curve Shows underlying “average” relationship, or trend, between two continuous variables. Can also be used to show the relationship between a continuous and a binary or Boolean variable: the fraction of true values of the discrete variable as a function of the continuous variable. Estimate the “average” relationship of income to years in the workforce.
Hexbin plot Shows the relationship between two continuous variables when the data is very dense. Plot income vs. years in the workforce for a large population.
Stacked bar chart Shows the relationship between two categorical variables (var1 and var2). Highlights the frequencies of each value of var1. Works best when var2 is binary. Plot insurance coverage (var2) as a function of marital status (var1) when you wish to retain information about the number of people in each marital category.
Side-by-side bar chart Shows the relationship between two categorical variables (var1 and var2). Good for comparing the frequencies of each value of var2 across the values of var1. Works best when var2 is binary. Plot insurance coverage (var2) as a function of marital status (var1) when you wish to directly compare the number of insured and uninsured people in each marital category.
Shadow plot Shows the relationship between two categorical variables (var1 and var2). Displays the frequency of each value of var1, while allowing comparison of var2 values both within and across the categories of var1. Plot insurance coverage (var2) as a function of marital status (var1) when you wish to directly compare the number of insured and uninsured people in each marital category and still retain information about the total number of people in each marital category.
Filled bar chart Shows the relationship between two categorical variables (var1 and var2). Good for comparing the relative frequencies of each value of var2 within each value of var1. Works best when var2 is binary. Plot insurance coverage (var2) as a function of marital status (var1) when you wish to compare the ratio of uninsured to insured people in each marital category.
Bar chart with faceting Shows the relationship between two categorical variables (var1 and var2). Best for comparing the relative frequencies of each value of var2 within each value of var1 when var2 takes on more than two values. Plot the distribution of marital status (var2) as a function of housing type (var1).
Overlaid density plot Compares the distribution of a continuous variable over different values of a categorical variable. Best when the categorical variable has only two or three categories. Shows whether the continuous variable is distributed differently or similarly across the categories. Compare the age distribution of married vs. divorced populations.
Faceted density plot Compares the distribution of a continuous variable over different values of a categorical variable. Suitable for categorical variables with more than three or so categories. Shows whether the continuous variable is distributed differently or similarly across the categories. Compare the age distribution of several marital statuses (never married, married, divorced, widowed).
Faceted histogram or shadow histogram Compares the distribution of a continuous variable over different values of a categorical variable while retaining information about the relative population sizes. Compare the age distribution of several marital statuses (never married, married, divorced, widowed), while retaining information about relative population sizes.

There are many other variations and visualizations you could use to explore the data; the preceding set covers some of the most useful and basic graphs. You should try different kinds of graphs to get different insights from the data. It’s an interactive process. One graph will raise questions that you can try to answer by replotting the data again, with a different visualization.

Eventually, you’ll explore your data enough to get a sense of it and to spot most major problems and issues. In the next chapter, we’ll discuss some ways to address common problems that you may discover in the data.

Summary

At this point, you’ve gotten a feel for your data. You’ve explored it through summaries and visualizations; you now have a sense of the quality of your data, and of the relationships among your variables. You’ve caught and are ready to correct several kinds of data issues—although you’ll likely run into more issues as you progress.

Maybe some of the things you’ve discovered have led you to reevaluate the question you’re trying to answer, or to modify your goals. Maybe you’ve decided that you need more or different types of data to achieve your goals. This is all good. As we mentioned in the previous chapter, the data science process is made of loops within loops. The data exploration and data cleaning stages (we’ll discuss cleaning in the next chapter) are two of the more time-consuming—and also the most important—stages of the process. Without good data, you can’t build good models. Time you spend here is time you don’t waste elsewhere.

In the next chapter, we’ll talk about fixing the issues that you’ve discovered in the data.

In this chapter you have learned

  • Take the time to examine and understand your data before diving into the modeling.
  • The summary command helps you spot issues with data range, units, data type, and missing or invalid values.
  • The various visualization techniques have different benefits and applications.
  • Visualization is an iterative process and helps answer questions about the data. Information you learn from one visualization may lead to more questions—that you might try to answer with another visualization. If one visualization doesn’t work, try another. Time spent here is time not wasted during the modeling process.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset