CHAPTER 5
Explore the Data

If you tell a data scientist to go on a fishing expedition … then you deserve what you get, which is a bad analysis.1

Thomas C. Redman, “the Data Doc” and Harvard Business Review contributor

Data projects are never as simple as they appear in a boardroom presentation. Stakeholders typically see a polished PowerPoint presentation that follows a rigid script from question to data to answer. What's lost in that story, however, are all the ideas that didn't make the cut: the important decisions and assumptions the data team made along the way to arrive at their answer. A good data team does not follow a linear path but a meandering one, adapting to discoveries in the data. As they get further along in their journey, they circle back to earlier ideas, only to see multiple paths open as a result.

This process of iteration, discovery, and data scrutiny is known as exploratory data analysis (EDA). It was formulated by statistician John Tukey in the 1970s as a way to make sense of data with summary statistics and visualizations before applying more complex methods.2 Tukey saw EDA as detective work. Clues are hidden in data, and the right exploration would reveal next steps to follow. Indeed, EDA is another way to “argue” with your data. It's a fundamental part of all data work that both sets and changes the direction of a project based on what's uncovered.

EXPLORATORY DATA ANALYSIS AND YOU

Exploratory data analysis can be an uncomfortable thought for some: it exposes the subjective nature (the art?) behind all data work. Two teams, given the same problem and data, might take separate analysis paths, possibly landing on the same conclusion. Possibly not. There are just too many decisions to make along the way for any two teams (or individuals) to do everything the same. Each person will bring their own background, ideas, and tools to make recommendations on how best to solve the problem.

Therefore, in this chapter, we present EDA as an ongoing process that is every Data Head's responsibility, whether you're the hands-on data worker or the business leader in the boardroom. You'll learn questions to ask and things to watch out for as you explore data.

EMBRACING THE EXPLORATORY MINDSET

Dozens of tools and programming languages can help data teams quickly and inexpensively explore their data with summary statistics and visualizations. But EDA should not be thought of as a list of tools or a checklist. It's more of a mentality woven into each phase of data work that you can take part of, even without an analytics background.

Questions to Guide You

To help you embrace the exploratory mindset, we'll walk you through a quick scenario against the backdrop of a popular dataset compiled for educational purposes: the Ames Housing Data.4 This is a glimpse into an EDA process.

Although there's no one right path to follow, there are questions you can ask to help guide the team to a meaningful conclusion:

  • Can the data answer the question?
  • Did you discover any relationships?
  • Did you find new opportunities in the data?

Let's set up the scenario and then go through each of the three questions, discuss why they're worth asking, and share challenges you might face.

The Setup

You work for a real estate start-up company and need to drive traffic to your site. But it's hard to pull visitors from real estate tech giants like U.S.-based Zillow.com. Its famous home-value estimation tool, the Zestimate®, brings people (and profits) to Zillow's brand.5 To compete, your company needs its own prediction tool. So, you're tasked to build a model that takes a home's information as inputs and produces an estimated sales price as an output.

The boss sends you a dataset to get started. It has 80 columns describing several aspects of hundreds of residential homes that were sold in Ames, Iowa from 2006 to 2011.

Receiving this much data can be overwhelming. However, using the questions outlined previously can help you narrow down how to begin working with the data.

Let's go through them.

CAN THE DATA ANSWER THE QUESTION?

As tempting as it may be to feed your data into the algorithm trend of the moment (e.g., deep learning, covered in Chapter 12), you first need to ask: “Can the data answer the question?” And the answer is often found simply by looking at the data.

Set Expectations and Use Common Sense

You should have a pretty good idea about what information is needed to make an estimate for a home's sales price: size, number of bedrooms, number of bathrooms, the year it was built, etc. These are the popular features home buyers search for on your website. Without them, predicting sales price wouldn't seem reasonable.

You can spot the column names and data types when you open the file. The commonsense features you expect are present, as well as helpful ordinal data (Overall quality of the home, 1–10, 10 being “Very Excellent”), nominal data (Neighborhood), and a host of other features. So far, the data passes an initial sniff test.

Next, you'd likely examine the values the variables take on. Do they cover the scenarios you want to analyze? For example, if you discover the variable “Building Type: Type of Dwelling” only includes single family homes but no apartments, duplexes, or condos, then your model will have limited scope compared to Zillow's. The Zestimate® can predict the sales price of a condo, but if you don't have historical condo data, your company can't reliably predict their sales price.

The lesson: Avoid the fishing expeditions you were warned about in the quote to start this chapter. Make sure the data makes sense for what it's being asked to do.

Do the Values Make Intuitive Sense?

Software will generate a slew of summary statistics for you. Your job is to put the data into context. Check if the summary statistics match your intuitive understanding of the problem. Visualizations are also a key component of EDA—use them to spot anomalies and other weirdness in the data.

Watch Out: Outliers and Missing Values

Every dataset will have anomalies, outliers, and missing values. How you deal with these matters.

For instance, the box plots in Figure 5.2 used a rule-of-thumb to flag several data points as potential outliers. But just because a data graphic classifies certain points as “outliers,” don't turn off critical thinking and automatically delete these points assuming they can't be useful. You'll never catch Zillow removing useful information from its datasets simply because a visualization described these as outliers. Use the context of the data—homes that cost much more than the bulk of other homes is both a known and a common feature of real estate data. Recall the lessons from the previous chapter. You should at least have good business justification to remove outliers. Do you have one here?

And what about missing values? Does a missing value in “Basement Size” mean the house has a basement and the area is unknown? Or does it mean there is no basement and the value should be 0?

If we are diving into the weeds a little, that's our intent. Data workers make hundreds of these tiny decisions during projects. The cumulative effect can be substantial. Left to their own devices—and without the guidance of domain expertise—data workers may continue chipping away at the data, removing difficult and nuanced cases, until the data is too detached from the reality it's trying to capture to be useful. This is why it's important for everyone, including managers, to really understand what their data teams are doing.

DID YOU DISCOVER ANY RELATIONSHIPS?

Fortunately for us, a first pass of the housing data with summary statistics and visualizations seems encouraging and you think the data can indeed be used to build a predictive model for sales price, so you press on to the next question: “Did you discover any relationships?”

Visualizing the data has given you a head start: higher overall quality and larger square footage are unsurprisingly related to higher sales prices. This is the feedback you want from data. The relationships make sense and the variables you've plotted will help you build a model to predict sales price. What other variables share a relationship with sales price?

At this point, summary statistics can help steer you toward interesting patterns and relationships in the data because generating every possible scatter plot may not be practical. Instead, the relationship found in scatter plots can reduce down to the summary statistic correlation, which is suggestive (but not proof) of a relationship between two numeric variables.

Graph depicts the square footage and sales price have a correlation of 0.62, which measures the tightness of the data points around the solid linear trend line.

FIGURE 5.6 Square footage and sales price have a correlation of 0.62, which measures the tightness of the data points around the solid linear trend line.

Understanding Correlation

Correlation is a measure of how two variables are related. The most common type of correlation used in business is the Pearson correlation coefficient, a statistic between −1 and 1 that measures the linear relationship (think simple straight lines) between pairs of numbers shown on a scatter plot. Correlation can be positive, meaning an increase in one variable is associated with an increase in the other: larger homes sell for more money. Or, correlation can be negative: heavier cars get worse gas mileage. For a visual reference, the correlation of home size and sales price, shown in Figure 5.6, is 0.62. The “tighter” the points around a linear trend, the higher the correlation.7

Correlation can help here in two ways. First, finding variables correlated with sales price would help predict it. Second, correlation can help reduce redundancies in your data because two highly correlated variables contain roughly the same information. Imagine two columns in your data: the area of the home in square feet and the area in square meters. These two are perfectly correlated; only one is needed in an analysis.

While most of us have a basic grasp of correlation and report the metric often, it can deceive. Let's review how.

Graph depicts the two datasets with a correlation of 0.8.

FIGURE 5.7 Two datasets with a correlation of 0.8

Watch Out: Misinterpreting Correlation

People often forget that correlation is a measure of linear trend, and not all trends are linear.

Suppose, for instance, you're analyzing two neighborhoods in the housing dataset, each with 11 homes. Crunching some statistics reveals that the number of trees on a property is highly correlated with sales price within these neighborhoods. The correlation is a strong 0.8: properties with more trees tend to sell for more money.

But a visual check of the data exposes something unexpected. In Figure 5.7, the data for the neighborhood on the left shows what we'd typically expect to see with a high correlation: a linear trend with data points scattered about. But the plot on the right shows the number of trees is associated with an increase in sales prices only up to a point (11 trees). After that, it trends downward. In the Hilltop community, some properties might have too many trees crowding their lawn.

Full disclosure: the data shown in Figure 5.7 did not come from the Ames dataset we've been exploring but from a popular dataset called Anscombe's Quartet,8 four datasets with identical summary statistics but clearly different visualizations. (We are showing just two and adjusted the data to reflect the real estate theme.)

The lesson: Use visualizations to verify noteworthy correlations in your data because the linear trend that correlation can identify may not tell the full story.

Watch Out: Correlation Does Not Imply Causation

Chances are, you've heard the phrase “Correlation does not imply causation” before.10 But it's worth reiterating because of how much it's ignored and even misunderstood.

When two variables are correlated, even strongly correlated, it does not mean that one is causing the other. Yet people fall into this trap far too often, seeking to build a narrative whenever two variables move together. There are typical silly examples statisticians use to show correlation does not prove causation: Ice cream sales are correlated with shark attacks (both spike in the summer months). Shoe size is correlated with reading ability (both increase over time). But to suggest that reducing ice cream sales will mitigate shark attacks, or that buying bigger shoes helps you read, is clearly a joke. There are other factors at play—outside temperature in the ice cream example, age in the shoe size example—that obviously play a part in the spurious relationships.

But when the correlations aren't built around jokes and the true causal factor isn't clear, the mantra “correlation does not imply causation” is often forgotten.

For example, in real estate data, you find school performance metrics are correlated with home values. Does this mean better schools cause a home's value to increase? Good schools seemingly make a neighborhood more desirable. Or, does the causality go the other direction: higher home prices cause a boost in school performance? Maybe the increase in tax revenue provides more resources to the school. Or does causality go in both directions, creating a feedback loop? Most of the time, we just don't know. There are clearly a multitude of other factors at play, and it'd be rare to have all the answers you need inside your dataset.

It's safer to assume “there is no causality” between two correlated variables unless someone has conducted an experiment proving otherwise. But don't take this to the extreme. Both authors have seen cases in business, university, and media settings where causation is assumed when it shouldn't be. But there are also cases as well where an important association is immediately dismissed as being an assumed causation fallacy. (See the sidebar for an example of causality being dismissed when it should not have been.)

DID YOU FIND NEW OPPORTUNITIES IN THE DATA?

EDA is not just a process to better understand data and set a path forward to solve problems. It's also a chance to find other opportunities in the data; problems that might be valuable to your organization. A data scientist may spot something interesting or weird in a dataset and then formulate a problem.

However, you don't know if anyone needs the solution you've found until you follow the steps in Chapter 1, “What Is the Problem?”

CHAPTER SUMMARY

To become a Data Head, you need to embrace an ongoing process of exploratory data analysis. This will allow for:

  • A clearer path forward to solve the problem.
  • Refining the original business problem, given the constraints identified in the data.
  • Identifying new problems to solve with the data.
  • Cancellation of the project. While unsatisfying, EDA is successful if it stops you from wasting time and money on a dead-end problem.

We guided you through the process using a real estate dataset (one we'll return to in Chapter 9 to finally build that model we've been talking about) and talked about common hurdles you may encounter.

The flow of this chapter assumes you can be a part of the EDA process from beginning to end. There will be times when this isn't possible, particularly for senior leaders overseeing many projects. But missing the early stages does not absolve Data Heads of their responsibility to have an exploratory mindset. If you come into a project near the end, ask why the data team chose the analysis they did and what challenges they faced. You may uncover assumptions you would not have made yourself.

NOTES

  1. 1 Quoted in “Understand Regression Analysis” by Amy Gallo, chapter 10 in HBR Guide to Data Analytics Basics for Managers (HBR Guide Series)
  2. 2 Tukey, J. W. (1977). Exploratory data analysis (Vol. 2, pp. 131−160).
  3. 3 Stakeholders, to be clear, should not micromanage. There needs to be a level of trust between the business and data teams.
  4. 4 De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3). You can download the data from www.kaggle.com/c/house-prices-advanced-regression-techniques.
  5. 5 Zillow takes its Zestimate® very seriously. In 2019, it awarded $1 million to a team of data scientists for improving the accuracy of Zestimate® predictions. venturebeat.com/2019/01/30/zillow-awards-1-million-to-team-that-reduced-home-valuation-algorithm-error-to-below-4
  6. 6 Box plots are also called box-and-whisker plots. The “box” contains the middle half of the data (values between the 25th and 75th percentiles), the line in the box is the median, and the “whiskers” show the range of the remaining data points. The dots beyond the whiskers are potential outliers.
  7. 7 Correlation does not mean “steepness.” Two perfectly correlated variables could appear almost flat (though not exactly horizontal).
  8. 8 Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17−21. We multiplied the dependent variable by 22,000 to create reasonable sales price examples.
  9. 9 The datasaurus was created by Alberto Cairo and the data is available on GitHub: github.com/lockedata/datasauRus
  10. 10 Your authors debated if it's even possible to not mention “Correlation Does Not Imply Causation” in a data book. You can see the outcome.
  11. 11 Fisher, R. A. (1958). Cancer and smoking. Nature, 182 (4635), 596.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset