Creating interesting stories with data

Data visualization regularly promotes its ability to reveal stories with data, and in some cases, reveal the not so trivial stories visually. In the recent past, journalists have been integrating visualizations more into their narratives, often helping us better understand their stories. In the commercial world, there are few that grasp the ways in which data can be associated with a meaningful story that appeals both emotionally and intelligently to the audience. As Rudyard Kipling wrote, If history were taught in the form of stories, it would never be forgotten; a similar thought applies to data. We should, therefore, understand that data would be understood and remembered better if presented in the right way.

Why are stories so important?

There are many tools and methods of visualization that we have today: bar and pie charts, tables, line graphs, bubble charts, scatter plots, and so on—the list is long. However, with these tools, the focus is on data exploration, and not on aiding a narrative. While there are examples of visualizations that do help tell stories, they are rare. This is mainly because finding the story is significantly harder than crunching the numbers. There are reader-driven narratives and author-driven narratives as well.

An author-driven narrative has data and visualization that are chosen by the author and presented to the public reader. A reader-driven narrative, on the other hand, provides tools and methods for the reader to play with the data, which gives the reader more flexibility and choices to analyze and understand the visualization.

Reader-driven narratives

In 2010, researchers at Stanford University studied and reviewed the emerging importance of storytelling and suggested some design strategies for narrative visualization. According to their study, a purely author-driven approach has a strict linear path through the visualization, relies on messaging, and has no interactivity, whereas a reader-driven approach has no prescribed ordering of images, no messaging, and has a high degree of interactivity. An example of the author-driven approach is a slideshow presentation. The seven narratives of visualization listed by that study include magazine style, annotated chart, partitioned poster, flow chart, comic strip, slideshow, and a film/video/animation.

Gapminder

A classic example of a reader-driven narrative combined with a data-driven one is Gapminder World (http://gapminder.org/world). It has a collection of over 600 data indicators in international economy, environment, health, technology, and much more. It provides tools that students can use to study real-world issues and discover patterns, trends, and correlations. This was developed by the Trendalyzer software that was originally developed by Hans Rosling's foundation in Sweden, and later acquired by Google in March 2007.

Gapminder

The information visualization technique used by Gapminder is an interactive bubble chart with the default set to five variables: X, Y, bubble size, color, and a time variable that is controlled by a slider. This sliding control and the selection of what goes along the X and Y axes makes it interactive. However, creating a story, even with a tool like this, is not necessarily easy. Storytelling is a craft and can be an effective knowledge-sharing technique, because it conveys the context and emotional content more effectively than most other modes of communication.

The most attractive storytellers grasp the importance of understanding the audience. They might tell the same story to a child and an adult, but the articulation would be different. Similarly, a data-driven or reader-driven story should be adjusted based on who is listening or studying it. For example, to an executive, statistics are likely the key, but a business intelligence manager would most likely be interested in methods and techniques.

There are many JavaScript frameworks that are available today for creating interactive visualization, and the most popular one is D3.js. Using Python, there are only a few ways today in which one can create an interactive visualization (without using Flash). One way is by generating the data in the JSON format that D3.js can use to plot, and the second option is to use Plotly (http://www.plot.ly). We will go into more detail about Plotly in the concluding section of this chapter.

The State of the Union address

Twitter has created a visualization from the tweets during President Obama's speech that graphs the tweets by location and topic. This visualization is interesting because it captures a lot of details in one place. Scroll through the speech to see how Twitter reacted; it is posted at http://twitter.github.io/interactive/sotu2015/#p1.

The State of the Union address

Mortality rate in the USA

The mortality rate in the USA fell by about 17 percent from 1968 to 2010, years for which we have detailed data (from http://www.who.int/healthinfo/mortality_data/en/). Almost all of this improvement can be attributed to improved survival prospects for men. It looks like progress stopped in the mid 1990s, but one of the reasons may be that the population has aged a lot since then. One may read a complete description of this from Bloomberg, but here we attempt to display two visualizations:

  • Mortality rate during the period 1968-2010 among men, women, and combined
  • Mortality rate for seven age groups to show some interesting results
Mortality rate in the USA

The code for this example is as follows:

import csv
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15,13))
plt.ylim(740,1128)
plt.xlim(1965,2011)
# Data from http://www.who.int/healthinfo/mortality_data/en/
with open('/Users/MacBook/Downloads/mortality1.csv') as csvfile:
  mortdata = [row for row in csv.DictReader(csvfile)]

x=[]
males_y=[]
females_y=[]
every_y=[]
yrval=1968
for row in mortdata:
  x += [yrval]
  males_y += [row['Males']]
  females_y += [row['Females']]
  every_y += [row['Everyone']]
  yrval = yrval + 1

plt.plot(x, males_y, color='#1a61c3', label='Males', linewidth=1.8)
plt.plot(x, females_y, color='#bc108d', label='Females', linewidth=1.8)
plt.plot(x, every_y, color='#747e8a', label='Everyone', linewidth=1.8)
plt.legend(loc=0, prop={'size':10})
plt.show()

The mortality rates were measured per 100,000 people. By dividing the population into separate age cohorts, the improvements in life expectancy are shown to have been ongoing, particularly showing most progress for the age group below 25. What exactly happened to the population falling under the age group of 25-44 (shown in red)? The narrative on Bloomberg lays out the reason very well by connecting another fact that the number of deaths caused by AIDS had an effect on that age group during that time.

Mortality rate in the USA

AIDS killed more than 40,000 Americans a year and 75 percent of them were in the age group of 25-44. Therefore, the unusual results are seen during that window of time.

import csv
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(15,13))
plt.ylim(35,102)
plt.xlim(1965,2015)

colorsdata = ['#168cf8', '#ff0000', '#009f00', '#1d437c', '#eb912b', '#8663ec', '#38762b']
labeldata = ['Below 25', '25-44', '45-54', '55-64', '65-74', '75-84', 'Over 85']

# using reader() instead of DictReader() so that we could loop to 
# build y-values in list
with open('/Users/MacBook/Downloads/mortality2.csv') as csvfile:
  mortdata = [row for row in csv.reader(csvfile)]

x=[]
for row in mortdata:
  yrval = int(row[0])
  if ( yrval == 1969 ):
    y = [[row[1]],[row[2]],[row[3]],[row[4]],[row[5]],[row[6]],[row[7]]]
  else:
   for col in range(0,7):
     y[col] += [row[col+1]]
  x += [yrval]

for col in range(0,7):
  if ( col == 1 ):
    plt.plot(x, y[col], color=colorsdata[col], label=labeldata[col], linewidth=3.8)
  else:
    plt.plot(x, y[col], color=colorsdata[col], label=labeldata[col], linewidth=2)

plt.legend(loc=0, prop={'size':10})
plt.show()

The difference between csv.reader() and csv.DictReader() is that when the input CSV file has fieldnames (or column names), DictReader() uses the fieldnames as keys and the actual value in that column as the data value. In the preceding example, we have used reader(), because it is convenient when there is looping involved ( y[col] = [row[col+1]] ). Moreover, with reader(), if the column names exist in the CSV file, that first row should be ignored.

We have also made filtered data for both these examples available as mortality1.csv and mortality2.csv at http://www.knapdata.com/python.

For mortdata[:4], the result would be different in each of these methods of reading. In other words, the result of mortdata[:4] when we use reader() will be as follows:

[['1969', '100', '99.92', '97.51', '97.47', '97.54', '97.65', '96.04'],  ['1970', '98.63', '97.78', '97.16', '97.32', '96.2', '96.51', '83.4'],  ['1971', '93.53', '95.26', '94.52', '94.89', '93.53', '93.73', '89.63'],  ['1972', '88.86', '92.45', '94.58', '95.14', '94.55', '94.1', '89.51']] 

With DictReader(), assuming that the CSV file has fieldnames, the four rows will be displayed as follows:

[{'25-44': '99.92', '45-54': '97.51', '55-64': '97.47', '65-74': '97.54', '75-84': '97.65', 'Below 25': '100', 'Over 85': '96.04', 'Year': '1969'},
 {'25-44': '97.78', '45-54': '97.16', '55-64': '97.32', '65-74': '96.2', '75-84': '96.51', 'Below 25': '98.63', 'Over 85': '83.4', 'Year': '1970'},
 {'25-44': '95.26', '45-54': '94.52', '55-64': '94.89', '65-74': '93.53', '75-84': '93.73', 'Below 25': '93.53', 'Over 85': '89.63', 'Year': '1971'},
 {'25-44': '92.45', '45-54': '94.58', '55-64': '95.14', '65-74': '94.55', '75-84': '94.1', 'Below 25': '88.86', 'Over 85': '89.51', 'Year': '1972'}]

A few other example narratives

There are numerous examples that one can explore, visualize, interact and play with. Some notable ones are as follows:

  • How the recession reshaped the economy in 255 charts (NY Times): This narrative shows how, in five years since the end of the Great Recession, the economy regained the lost nine million jobs, highlighting which industries recovered faster than others. (Source: http://tinyurl.com/nwdp3pp.)
  • Washington Wizards shooting stars of 2013 (Washington Post): This interactive graphic was created a few years ago based on the performance of the Washington Wizards in 2013, trying to analyze and see how the signing of Paul Pierce could bring much improved shooting from the mid-range. (Source: http://www.washingtonpost.com/wp-srv/special/sports/wizards-shooting-stars/.)

Author-driven narratives

The New York Times produces some of the world's best data visualization, multimedia, and interactive stories. Their ambition for these projects has always been to meet the journalistic standards at a very prestigious level and to create genuinely new kinds of experiences for the readers. The storytelling culture among them is one of the sources of energy behind this work.

For example, there is a combination of data and author-driven narrative titled The Geography of Chaos in Yemen. On March 26, 2015, Saudi Arabian jets struck targets in Yemen in a drive against the Houthi rebel group. Yemen plays an important role for the key players such as Saudi Arabia, Iran, and the United States. The Houthis' influence has grown over the past years, which was captured visually by the authors at the NY Times.

Author-driven narratives

Yemen is home to one of Al Qaeda's most active branches in the Arabian Peninsula. Since 2009, the United States has carried out at least 100 airstrikes in Yemen. In addition to Al Qaeda's occupation, the Islamic State also has activities in that region, and recently, they claimed responsibility for the bombings at two Shiite mosques in Sana that killed more than 135 people. The following visualization comes from The Bureau of Investigative Journalism, American Enterprise Institute's Critical Threat Project:

Author-driven narratives

Another good example is the visualization of the Atlantic's past by David McCandless, which shows what the oceans were like before over-fishing. It is hard to imagine the damage that over-fishing is wreaking on the oceans. The effects are invisible, hidden in the ocean. The following image shows the biomass of the popularly-eaten fish in the North Atlantic Ocean in 1900 and 2000. The popularly-eaten fish include tuna, cod, haddock, hake, halibut, herring, mackerel, pollock, salmon, sea trout, striped bass, sturgeon, and turbot, many of which are now vulnerable or endangered.

Dr. Villy Christensen and his colleagues at the University of British Columbia used ecosystem models, underwater terrain maps, fish-catch records, and statistical analysis to render the biomass of the Atlantic fish at various points in this century.

Author-driven narratives
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset