The Ebola example

To illustrate the steps mentioned in the previous section and how they lead to an understandable visualization, let us consider the question we had earlier, that is, How many Ebola deaths were reported in the year 2014? This particular question leads to very specific data, which is usually maintained by the World Health Organization (http://www.who.int/en/) or Humanitarian Data Exchange (https://hdx.rwlabs.org). The original source of this data is the World Health Organization (WHO), but the Humanitarian Data Exchange (HDX) is the contributor. Please note, however, that we shall have all the data, along with the Python source code examples for this book, available at a single place.

The data contains information about the spread of the Ebola disease in Guinea, Liberia, Mali, Nigeria, Senegal, Sierra Leone, Spain, United Kingdom, and the United States of America.

The contributor URL for this information is https://data.hdx.rwlabs.org/dataset/ebola-cases-2014/.

The contents of the data file in the Comma Separated Value (CSV) format include the indicator, country name, date, and the number of deaths or the number of infections depending upon what the indicator says. There are 36 different indicators, and the top 10 are listed as follows (others can be viewed in Appendix, Go Forth and Explore Visualization):

  • Number of probable Ebola cases in the last 7 days
  • Number of probable Ebola deaths in the last 21 days
  • Number of suspected Ebola cases in the last 21 days
  • Number of suspected Ebola cases in the last 7 days
  • Number of suspected Ebola deaths in the last 21 days
  • Proportion of confirmed Ebola cases that are from the last 21 days
  • Proportion of confirmed Ebola cases that are from the last 7 days
  • Proportion of confirmed Ebola deaths that are from the last 21 days
  • Proportion of suspected Ebola cases that are from the last 7 days
  • Proportion of suspected Ebola deaths that are from the last 21 days

At this point, after looking at the list of indicators, the single question that we had initially, that is, How many Ebola deaths were reported in the year 2014? could be changed to multiple sets of questions. For simplicity, we stay focused on that single question and see how we can further analyze the data and come up with a visualization method. First, let us take a look at the ways to read the data file.

In any programming language, there is more than one way to read a file, and one of the options is to use the pandas library for Python, which has high performance and uses data structures and data analysis tools. The other option is to use the csv library to read the data file in the CSV format. What is the difference between them? They both can do the job. In the older version of pandas there were issues with memory maps for large data (that is, if the data file in the CSV format was very large), but now that has been optimized. So let's start with the following code:

[1]:  with open('("/Users/kvenkatr/python/ebola.csv ', 'rt') as f: 
        filtereddata = [row for row in csv.reader(f) if row[3] != "0.0" and 
        row[3] != "0" and "deaths" in row[0]]

[2]:    len(filtereddata)
Out[2]: 1194

The preceding filter can also be performed using pandas, as follows:

import pandas as pd
eboladata = pd.read_csv("/Users/kvenkatr/python/ebola.csv")
filtered = eboladata[eboladata["value"]>0]
filtered = filtered[filtered["Indicator"].str.contains("deaths")]
len(filtered)

The data can be downloaded from http://www.knapdata.com/python/ebola.csv. The next step is to open the data file with the read text (rt) format. It reads each row and further filters the rows with zero number of deaths as the indicator string has the word deaths in it. This is a very straightforward filter applied to ignore the data with no reported cases or deaths. Printing only the top five rows of the filtered data shows the following:

[3]:  filtereddata[:5]
Out[3]: 
[['Cumulative number of confirmed Ebola deaths', 'Guinea','2014-08-29', '287.0'],
 ['Cumulative number of probable Ebola deaths','Guinea','2014-08-29',
  '141.0'],
 ['Cumulative number of suspected Ebola deaths','Guinea','2014-08-29',
  '2.0'],
 ['Cumulative number of confirmed, probable and suspected Ebola deaths',
  'Guinea','2014-08-29','430.0'],
 ['Cumulative number of confirmed Ebola deaths',
  'Liberia','2014-08-29','225.0']]

If all the data about the reported cases of Ebola in each country are to be separated, how do we further filter this? We can sort them on the country column. There are four columns in this data file, indicator, country, date, and number value, as shown in the following code:

[4]:  import operator
      sorteddata = sort(filtereddata, key=operator.itemgetter(1))
[5]:  sorteddata[:5]
Out[5]: 
[['Cumulative number of confirmed Ebola deaths', 'Guinea','2014-08-29', '287.0'],
 ['Cumulative number of probable Ebola deaths','Guinea','2014-08-29',
  '141.0'],
 ['Cumulative number of suspected Ebola deaths','Guinea','2014-08-29',
  '2.0'],
 ['Cumulative number of confirmed, probable and suspected Ebola deaths',
  'Guinea','2014-08-29','430.0'],
 ['Number of confirmed Ebola deaths in the last 21 days', 'Guinea',
  '2014-08-29','8.0']]

After looking at the data so far, there are two indicators that appear to be of interest in the context in which we started this data expedition:

  • Cumulative number of confirmed Ebola deaths
  • Cumulative number of confirmed, probable, and suspected Ebola deaths

By applying visualization several times, we also notice that among the several countries, Guinea, Liberia, and Sierra Leone had more confirmed deaths than the others. We will now see how the reported deaths in these three countries could be plotted:

import matplotlib.pyplot as plt  
import csv 
import operator 
import datetime as dt  

with open('/Users/kvenkatr/python/ebola.csv', 'rt') as f: 
  filtereddata = [row for row in csv.reader(f) if row[3] != "0.0" and 
  row[3] != "0" and "deaths" in row[0]] 

sorteddata = sorted(filtereddata, key=operator.itemgetter(1))  
guineadata  = [row for row in sorteddata if row[1] == "Guinea" and 
  row[0] == "Cumulative number of confirmed Ebola deaths"] 
sierradata  = [row for row in sorteddata if row[1] == "Sierra Leone" and 
  row[0] == "Cumulative number of confirmed Ebola deaths"] 
liberiadata = [row for row in sorteddata if row[1] == "Liberia" and 
  row[0] == "Cumulative number of confirmed Ebola deaths"] 

g_x = [dt.datetime.strptime(row[2], '%Y-%m-%d').date() for 
  row in guineadata] 
g_y = [row[3] for row in guineadata] 

s_x = [dt.datetime.strptime(row[2], '%Y-%m-%d').date() for 
  row in sierradata] 
s_y = [row[3] for row in sierradata] 

l_x = [dt.datetime.strptime(row[2], '%Y-%m-%d').date() for
  row in liberiadata] 
l_y = [row[3] for row in liberiadata] 

plt.figure(figsize=(10,10))
plt.plot(g_x,g_y, color='red', linewidth=2, label="Guinea") 
plt.plot(s_x,s_y, color='orange', linewidth=2, label="Sierra Leone")
plt.plot(l_x,l_y, color='blue', linewidth=2, label="Liberia")
plt.xlabel('Date', fontsize=18)
plt.ylabel('Number of Ebola Deaths', fontsize=18)
plt.title("Confirmed Ebola Deaths", fontsize=20)
plt.legend(loc=2)
plt.show()

The result would look like the following image:

The Ebola example

We can construct a similar plot for the other indicator, that is, cumulative number of confirmed, probable, and suspected Ebola deaths. (This may not be the best way to do so, but we could include the data from more countries and plot a similar result.)

import matplotlib.pyplot as plt  
import csv 
import operator 
import datetime as dt  

with open('/Users/kvenkatr/python/ebola.csv', 'rt') as f: 
  filtereddata = [row for row in csv.reader(f) if row[3] != "0.0" and 
  row[3] != "0" and "deaths" in row[0]] 

sorteddata = sorted(filtereddata, key=operator.itemgetter(1))  

guineadata  = [row for row in sorteddata if row[1] == "Guinea" and 
  row[0] == "Cumulative number of confirmed, probable and suspected Ebola deaths"] 
sierradata  = [row for row in sorteddata if row[1] == "Sierra Leone" and 
  row[0] == " Cumulative number of confirmed, probable and suspected Ebola deaths "] 
liberiadata = [row for row in sorteddata if row[1] == "Liberia" and 
  row[0] == " Cumulative number of confirmed, probable and suspected Ebola deaths "] 

g_x = [dt.datetime.strptime(row[2], '%Y-%m-%d').date() for 
  row in guineadata] 
g_y = [row[3] for row in guineadata] 

s_x = [dt.datetime.strptime(row[2], '%Y-%m-%d').date() for 
  row in sierradata] 
s_y = [row[3] for row in sierradata] 

l_x = [dt.datetime.strptime(row[2], '%Y-%m-%d').date() for
  row in liberiadata] 
l_y = [row[3] for row in liberiadata] 

plt.figure(figsize=(10,10))
plt.plot(g_x,g_y, color='red', linewidth=2, label="Guinea") 
plt.plot(s_x,s_y, color='orange', linewidth=2, label="Sierra Leone")
plt.plot(l_x,l_y, color='blue', linewidth=2, label="Liberia")
plt.xlabel('Date', fontsize=18)
plt.ylabel('Number of Ebola Deaths', fontsize=18)
plt.title("Probable and Suspected Ebola Deaths", fontsize=20)
plt.legend(loc=2)
plt.show()

The plot should look like this:

The Ebola example
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset