Identifying effective and ineffective visualizations

The main goal of data visualization is to have the reader quickly digest the data, including possible trends, relationships, and more. Ideally, a reader will not have to spend more than 5-6 seconds digesting a single visualization. For this reason, we must make visuals very seriously and ensure that we are making a visual as effective as possible. Let's look at five basic types of graphs: scatter plots, line graphs, bar charts, histograms, and box plots.

Scatter plots

A scatter plot is probably one of the simplest graphs to create. It is made by creating two quantitative axes and using data points to represent observations. The main goal of a scatter plot is to highlight relationships between two variables and, if possible, reveal a correlation.

For example, we can look at two variables: the average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). The goal here is to find a relationship (if it exists) between watching TV and average work performance.

The following code simulates a survey of a few people, in which they revealed the amount of television they watched, on average, in a day against a company-standard work performance metric. This line of code is creating 14 sample survey results of people answering the question of how many hours of TV they watch in a day:

import pandas as pd 
hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5]

This next line of code is creating 14 new sample survey results of the same people being rated on their work performance on a scale from 0 to 100. For example, the first person watched 0 hours of TV a day and was rated 87/100 on their work, while the last person watched, on an average, 5 hours of TV a day and was rated 72/100:

work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72]

Here, we are creating a DataFrame in order to simplify our exploratory data analysis and make it easier to make a scatter plot:

df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_performance':work_performance}) 

Now, we are actually making our scatter plot:

df.plot(x='hours_tv_watched', y='work_performance', kind='scatter') 

In the following plot, we can see that our axes represent the number of hours of TV watched in a day and the person's work performance metric:

Scatter plots

The scatter plot: hours of TV watched versus work performance

Each point on a scatter plot represents a single observation (in this case a person) and its location is a result of where the observation stands on each variable. This scatter plot does seem to show a relationship, which implies that as we watch more TV in the day, it seems to affect our work performance.

Of course, as we are now experts in statistics from the last two chapters, so we know that this might not be causational. A scatter plot may only work to reveal a correlation or an association, but not a causation. Advanced statistical tests, such as the ones we saw in Chapter 8, Advanced Statistics, might work to reveal causation. Later on in this chapter, we will see the damaging effects that trusting correlation might have.

Line graphs

Line graphs are, perhaps, one of the most widely used graphs in data communication. A line graph simply uses lines to connect data points and usually represents time on the x-axis. Line graphs are a popular way to show changes in variables over time. The line graph, like the scatter plot, is used to plot quantitative variables.

As a great example, many of us wonder about the possible links between what we see on TV and our behavior in the world. A friend of mine once took this thought to an extreme: he wondered if he could find a relationship between the TV show The X-Files and the amount of UFO sightings in the U.S. He found the number of sightings of UFOs per year and plotted them over time. He then added a quick graphic to ensure that readers would be able to identify the point in time when The X-Files were released:

Line graphs

Total reported UFO sightings since 1963 (Source: http://www.questionable-economics.com/what-do-we-know-about-aliens/)

It appears to be clear that right after 1993, the year of The X-Files' premiere, the number of UFO sightings started to climb drastically.

This graphic, albeit light-hearted, is an excellent example of a simple line graph. We are told what each axis measures, we can quickly see a general trend in the data, and we can identify with the author's intent, which is to show a relationship between the number of UFO sightings and The X-Files' premiere.

On the other hand, the following is a less impressive line chart:

Line graphs

The line graph: gas price changes

This line graph attempts to highlight the change in the price of gas by plotting three points in time. At first glance, it is not much different than the previous graph; we have time on the bottom x-axis and a quantitative value on the vertical y-axis. The (not so) subtle difference here is that the three points are equally spaced out on the x-axis; however, if we read their actual time indications, they are not equally spaced out in time. A year separates the first two points whereas a mere 7 days separate the last two points.

Bar charts

We generally turn to bar charts when trying to compare variables across different groups. For example, we can plot the number of countries per continent using a bar chart. Note how the x-axis does not represent a quantitative variable; in fact, when using a bar chart, the x-axis is generally a categorical variable, while the y-axis is quantitative.

Note that, for this code, I am using the World Health Organization's report on alcohol consumption around the world by country:

from matplotlib import pyplot as plt
drinks = pd.read_csv('https://raw.githubusercontent.com/sinanuozdemir/principles_of_data_science/master/data/chapter_2/drinks.csv') 
drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent') 
plt.xlabel('Continent') 
plt.ylabel('Count') 

The following graph shows us a count of the number of countries in each continent. We can see the continent code at the bottom of the bars and the bar height represents the number of countries we have in each continent. For example, we see that Africa has the most countries represented in our survey, while South America has the fewest:

Bar charts

Bar chart: countries in continent

In addition to the count of countries, we can also plot the average beer servings per continent using a bar chart, as shown:

drinks.groupby('continent').beer_servings.mean().plot(kind='bar')  

The preceding code gives us this chart:

Bar charts

Bar chart: average beer served per country

Note how a scatter plot or a line graph would not be able to support this data because they can only handle quantitative variables; bar graphs have the ability to demonstrate categorical values.

We can also use bar charts to graph variables that change over time, like a line graph.

Histograms

Histograms show the frequency distribution of a single quantitative variable by splitting the data, by range, into equidistant bins and plotting the raw count of observations in each bin. A histogram is effectively a bar chart where the x-axis is a bin (subrange) of values and the y-axis is a count. As an example, I will import a store's daily number of unique customers, as shown:

rossmann_sales = pd.read_csv('data/rossmann.csv')
 
rossmann_sales.head()

We get the following table:

Histograms

Note how we have multiple store data (in the first Store column). Let's subset this data for only the first store, as shown:

first_rossmann_sales = rossmann_sales[rossmann_sales['Store']==1] 

Now, let's plot a histogram of the first store's customer count:

first_rossmann_sales['Customers'].hist(bins=20)
plt.xlabel('Customer Bins')
plt.ylabel('Count')

This is what we get:

Histograms

Histogram: customer counts

The x-axis is now categorical in that each category is a selected range of values; for example, 600-620 customers would potentially be a category. The y-axis, like a bar chart, is plotting the number of observations in each category. In this graph, for example, one might take away the fact that most of the time, the number of customers on any given day will fall between 500 and 700.

Altogether, histograms are used to visualize the distribution of values that a quantitative variable can take on.

Note

In a histogram, we do not put spaces between bars.

Box plots

Box plots are also used to show a distribution of values. They are created by plotting the five-number summary, as follows:

  • The minimum value
  • The first quartile (the number that separates the 25% lowest values from the rest)
  • The median
  • The third quartile (the number that separates the 25% highest values from the rest)
  • The maximum value

In pandas, when we create box plots, the red line denotes the median, the top of the box (or the right if it is horizontal) is the third quartile, and the bottom (left) part of the box is the first quartile.

The following is a series of box plots showing the distribution of beer consumption according to continents:

drinks.boxplot(column='beer_servings', by='continent')

We get this graph:

Box plots

Box plot: beer consumption by continent

Now, we can clearly see the distribution of beer consumption across the seven continents and how they differ. Africa and Asia have a much lower median of beer consumption than Europe or North America.

Box plots also have the added bonus of being able to show outliers much better than a histogram. This is because the minimum and maximum are parts of the box plot.

Getting back to the customer data, let's look at the same store customer numbers, but using a box plot:

first_rossmann_sales.boxplot(column='Customers', vert=False)

This is the graph we get:

Box plots

Box plot: customer sales

This is the exact same data as plotted earlier in the histogram; however, now it is shown as a box plot. For the purpose of comparison, I will show you both graphs, one after the other:

Box plots

Histogram: customer counts

Box plots

Box plot: customer sales

Note how the x-axis for each graph is the same, ranging from 0 to 1,200. The box plot is much quicker at giving us a center for the data, the red line is the median, while the histogram works much better in showing us how spread out the data is and where people's biggest bins are. For example, the histogram reveals that there is a very large bin of zero people. This means that for a little over 150 days of data, there were zero customers.

Note that we can get the exact numbers to construct a box plot using the describe feature in pandas, as shown:

first_rossmann_sales['Customers'].describe() 
 
 
min         0.000000 
25%       463.000000 
50%       529.000000 
75%       598.750000 
max      1130.000000 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset