© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. PajankarHands-on Matplotlibhttps://doi.org/10.1007/978-1-4842-7410-1_18

18. Visualizing Real-Life Data with Matplotlib and Seaborn

Ashwin Pajankar1  
(1)
Nashik, Maharashtra, India
 

In the previous chapter, you learned how to visualize data with a new data visualization library for scientific Python tasks. You learned to create visualizations from data stored in various formats.

In this chapter, you will take all the knowledge you have obtained in the earlier chapters of this book and put it together to prepare visualizations for real-life data from the COVID-19 pandemic and animal disease datasets obtained from the Internet. The following are the topics you will explore in this chapter:
  • COVID-19 pandemic data

  • Fetching the pandemic data programmatically

  • Preparing the data for visualization

  • Creating visualizations with Matplotlib and Seaborn

  • Creating visualizations of animal disease data

After reading this chapter, you will be comfortable working with and creating visualizations of real-life datasets.

COVID-19 Pandemic Data

The world is facing the COVID-19 pandemic as of this writing (May 2021). COVID-19 is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The symptoms include common flu-like symptoms and breathing troubles.

There are multiple organizations in the world that collect and share real-time data for pandemics. One is Johns Hopkins University (https://coronavirus.jhu.edu/map.html), and the other one is Worldometers (https://www.worldometers.info/coronavirus/). Both of these web pages have data about the COVID-19 pandemic, and they are refreshed quite frequently. Figure 18-1 shows the Johns Hopkins page for COVID-19.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig1_HTML.jpg
Figure 18-1

Johns Hopkins COVID-19 home page

Figure 18-2 shows the Worldometers website.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig2_HTML.jpg
Figure 18-2

Worldometers COVID-19 home page

As I mentioned, the data is refreshed on a frequent basis, so these websites are quite reliable for up-to-date information.

Fetching the Pandemic Data Programmatically

In this section, you will learn how to fetch both datasets (Johns Hopkins and Worldometers) using Python programs. To do that, you need to install a library for Python. The library’s home page is located at https://ahmednafies.github.io/covid/, and the PyPI page is https://pypi.org/project/covid/. Create a new notebook for this chapter using Jupyter Notebook. You can easily install the library with the following command in the notebook:
!pip3 install covid
You can import the library to a notebook or a Python script/program as follows:
from covid import Covid
You can create an object to fetch the data from an online source. By default, the data source is as follows for Johns Hopkins:
covid = Covid()

Note that due to high traffic, sometimes the servers are unresponsive. I experienced this multiple times.

You can explicitly mention the data source as follows:
covid = Covid(source="john_hopkins")
You can specify Worldometers explicitly as follows:
covid = Covid(source="worldometers")
You can see the source of the data as follows:
covid.source
Based on the data source, this returns a relevant string, as shown here:
'john_hopkins'
You can get status by country name as follows:
covid.get_status_by_country_name("italy")
This returns a dictionary, as follows:
{'id': '86',
 'country': 'Italy',
 'confirmed': 4188190,
 'active': 283744,
 'deaths': 125153,
 'recovered': 3779293,
 'latitude': 41.8719,
 'longitude': 12.5674,
 'last_update': 1621758045000}
You can also fetch the status by country ID, although only the Johns Hopkins dataset has this column, so the code will return an error for Worldometers.
# Only valid for Johns Hopkins
covid.get_status_by_country_id(115)
The output is similar to the earlier example, as shown here:
{'id': '115',
 'country': 'Mexico',
 'confirmed': 2395330,
 'active': 261043,
 'deaths': 221597,
 'recovered': 1912690,
 'latitude': 23.6345,
 'longitude': -102.5528,
 'last_update': 1621758045000}
You can also fetch the list of countries as follows:
covid.list_countries()
Here is part of the output:
[{'id': '179', 'name': 'US'},
 {'id': '80', 'name': 'India'},
 {'id': '24', 'name': 'Brazil'},
 {'id': '63', 'name': 'France'},
 {'id': '178', 'name': 'Turkey'},
 {'id': '143', 'name': 'Russia'},
 {'id': '183', 'name': 'United Kingdom'},
....

You will continue using the Johns Hopkins dataset throughout the chapter.

You can get active cases as follows:
covid.get_total_active_cases()
The output is as follows:
27292520
You can get the total confirmed cases as follows:
covid.get_total_confirmed_cases()
The output is as follows:
166723247
You can get the total recovered cases as follows:
covid.get_total_recovered()
The output is as follows:
103133392
You can get total deaths as follows:
covid.get_total_deaths()
The output is as follows:
3454602
You can fetch all the data with the function call covid.get_data(). This returns a list of dictionaries where every dictionary holds the data of one country. The following is the output:
[{'id': '179',
  'country': 'US',
  'confirmed': 33104963,
  'active': None,
  'deaths': 589703,
  'recovered': None,
  'latitude': 40.0,
  'longitude': -100.0,
  'last_update': 1621758045000},
 {'id': '80',
  'country': 'India',
  'confirmed': 26530132,
  'active': 2805399,
  'deaths': 299266,
  'recovered': 23425467,
  'latitude': 20.593684,
  'longitude': 78.96288,
  'last_update': 1621758045000},
......

Preparing the Data for Visualization

You have to prepare this fetched data for visualization. For that you have to convert the list of dictionaries in the Pandas dataframe. It can be done as follows:
import pandas as pd
df = pd.DataFrame(covid.get_data())
print(df)
Figure 18-3 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig3_HTML.jpg
Figure 18-3

Pandas dataframe for COVID-19 data

You can sort it as follows:
sorted = df.sort_values(by=['confirmed'], ascending=False)
Then you have to exclude the data for the world and continents so only the data for the individual countries remains.
excluded = sorted [ ~sorted.country.isin(['Europe', 'Asia',
                                          'South America',
                                          'World', 'Africa',
                                          'North America'])]
Let’s find out the top ten records.
top10 = excluded.head(10)
print(top10)
You can then assign the columns to the individual variables as follows:
x = top10.country
y1 = top10.confirmed
y2 = top10.active
y3 = top10.deaths
y4 = top10.recovered

Creating Visualizations with Matplotlib and Seaborn

Let’s visualize the data with Matplotlib and Seaborn. First import all the needed libraries, as shown here:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
A simple linear plot can be obtained as follows:
plt.plot(x, y1)
plt.xticks(rotation=90)
plt.show()
Figure 18-4 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig4_HTML.jpg
Figure 18-4

Linear plot with Matplotlib

You can add a title to this plot. You can also use the Seaborn library for it. The following is an example of a line plot with Seaborn:
sns.set_theme(style='whitegrid')
sns.lineplot(x=x, y=y1)
plt.xticks(rotation=90)
plt.show()
In the code example, we are using the function set_theme() . It sets the theme for the entire notebook for the Matplotlib and Seaborn visualizations. You can pass one of the strings 'darkgrid', 'whitegrid', 'dark', 'white', or 'ticks' as an argument to this function. Figure 18-5 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig5_HTML.jpg
Figure 18-5

Linear plot with Seaborn

You can create a simple bar plot with Matplotlib as follows:
plt.bar(x, y1)
plt.xticks(rotation=45)
plt.show()
Figure 18-6 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig6_HTML.jpg
Figure 18-6

Bar plot with Matplotlib

The same visualization can be prepared with Seaborn, which produces a much better bar plot aesthetically.
sns.barplot(x=x, y=y1)
plt.xticks(rotation=45)
plt.show()
Figure 18-7 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig7_HTML.jpg
Figure 18-7

Bar plot with Seaborn

You can even change the color palette as follows:
sns.barplot(x=x, y=y1,
            palette="Blues_d")
plt.xticks(rotation=45)
plt.show()
Figure 18-8 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig8_HTML.jpg
Figure 18-8

Bar plot using Seaborn with custom palette

You can create a multiline graph as follows:
labels = ['Confirmed', 'Active', 'Deaths', 'Recovered']
plt.plot(x, y1, x, y2, x, y3, x, y4)
plt.legend(labels, loc='upper right')
plt.xticks(rotation=90)
plt.show()
Figure 18-9 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig9_HTML.jpg
Figure 18-9

Multiline graph

You can use the Seaborn library to create the same graph as follows:
sns.lineplot(x=x, y=y1)
sns.lineplot(x=x, y=y2)
sns.lineplot(x=x, y=y3)
sns.lineplot(x=x, y=y4)
plt.legend(labels, loc='upper right')
plt.xticks(rotation=45)
plt.show()
Figure 18-10 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig10_HTML.jpg
Figure 18-10

Multiline graph with Seaborn

You will now see how to create a multiple-bar graph with Matplotlib as follows:
df2 = pd.DataFrame([y1, y2, y3, y4])
df2.plot.bar()
plt.legend(x, loc='best')
plt.xticks(rotation=45)
plt.show()
Figure 18-11 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig11_HTML.jpg
Figure 18-11

Multiline bar graph

You can even show this in a horizontal fashion as follows:
df2.plot.barh()
plt.legend(x, loc='best')
plt.xticks(rotation=45)
plt.show()
Figure 18-12 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig12_HTML.jpg
Figure 18-12

Multiline horizontal graph

You can use Seaborn to create a scatter plot as follows:
sns.scatterplot(x=x, y=y1)
sns.scatterplot(x=x, y=y2)
sns.scatterplot(x=x, y=y3)
sns.scatterplot(x=x, y=y4)
plt.legend(labels, loc='best')
plt.xticks(rotation=45)
plt.show()
Figure 18-13 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig13_HTML.jpg
Figure 18-13

Multiline horizontal bar graph

You can even create an area plot with Matplotlib with the following code:
df2.plot.area()
plt.legend(x, loc='best')
plt.xticks(rotation=45)
plt.show()
Figure 18-14 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig14_HTML.jpg
Figure 18-14

Stacked area plot

You can create an unstacked and transparent area plot for the data as follows:
df2.plot.area(stacked=False)
plt.legend(x, loc='best')
plt.xticks(rotation=45)
plt.show()
Figure 18-15 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig15_HTML.jpg
Figure 18-15

Stacked area plot

You can create a pie chart as follows:
plt.pie(y3, labels=x)
plt.title('Death Toll')
plt.show()
Figure 18-16 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig16_HTML.jpg
Figure 18-16

Pie chart

You can also create a KDE plot with a rug plot, but with the data that we’re using for this example, that may not make a lot of sense.
sns.set_theme(style="ticks")
sns.kdeplot(x=y1)
sns.rugplot(x=y1)
plt.show()
Figure 18-17 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig17_HTML.jpg
Figure 18-17

KDE plot

Creating Visualizations of Animal Disease Data

You can create visualizations for other real-life datasets too. Let’s create visualizations for animal disease data. Let’s first read it from an online repository.
df = pd.read_csv("https://github.com/Kesterchia/Global-animal-diseases/blob/main/Data/Outbreak_240817.csv?raw=True")
Let’s see the top five records.
df.head()
Figure 18-18 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig18_HTML.jpg
Figure 18-18

Animal disease data

Let’s get information about the columns as follows:
df.info()
The output is as follows:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17008 entries, 0 to 17007
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Id                  17008 non-null  int64
 1   source              17008 non-null  object
 2   latitude            17008 non-null  float64
 3   longitude           17008 non-null  float64
 4   region              17008 non-null  object
 5   country             17008 non-null  object
 6   admin1              17008 non-null  object
 7   localityName        17008 non-null  object
 8   localityQuality     17008 non-null  object
 9   observationDate     16506 non-null  object
 10  reportingDate       17008 non-null  object
 11  status              17008 non-null  object
 12  disease             17008 non-null  object
 13  serotypes           10067 non-null  object
 14  speciesDescription  15360 non-null  object
 15  sumAtRisk           9757 non-null   float64
 16  sumCases            14535 non-null  float64
 17  sumDeaths           14168 non-null  float64
 18  sumDestroyed        13005 non-null  float64
 19  sumSlaughtered      12235 non-null  float64
 20  humansGenderDesc    360 non-null    object
 21  humansAge           1068 non-null   float64
 22  humansAffected      1417 non-null   float64
 23  humansDeaths        451 non-null    float64
dtypes: float64(10), int64(1), object(13)
memory usage: 3.1+ MB
Let’s perform a “group by” operation on the column country and compute the sum of total cases, as shown here:
df2 = pd.DataFrame(df.groupby('country').sum('sumCases')['sumCases'])
Now let’s sort and select the top ten cases.
df3 = df2.sort_values(by='sumCases', ascending = False).head(10)
Let’s plot a bar graph, using the following code:
df3.plot.bar()
plt.xticks(rotation=90)
plt.show()
Figure 18-19 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig19_HTML.jpg
Figure 18-19

Bar chart

You can convert the index to a column as follows:
df3.reset_index(level=0, inplace=True)
df3
The output is as follows:
      country                      sumCases
0     Italy                        846756.0
1     Iraq                         590049.0
2     Bulgaria                     453353.0
3     China                        370357.0
4     Taiwan (Province of China)   296268.0
5     Egypt                        284449.0
6     Iran (Islamic Republic of)   225798.0
7     Nigeria                      203688.0
8     Germany                      133425.0
9     Republic of Korea            117018.0
Let’s make a pie chart as follows:
plt.pie(df3['sumCases'],
        labels=df3['country'])
plt.title('Death Toll')
plt.show()
Figure 18-20 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig20_HTML.jpg
Figure 18-20

Pie chart

You can create a more aesthetically pleasing bar chart with Seaborn as follows:
sns.barplot(x='country',
            y='sumCases',
            data=df3)
plt.xticks(rotation=90)
plt.show()
Figure 18-21 shows the output.
../images/515442_1_En_18_Chapter/515442_1_En_18_Fig21_HTML.jpg
Figure 18-21

Bar chart with Seaborn

You’ve just learned to visualize real-life animal disease data.

Summary

In this chapter, you explored more functionality of the Seaborn data visualization library, which is part of the scientific Python ecosystem. You also learned how to import real-life data into Jupyter Notebook. You used the Matplotlib and Seaborn libraries to visualize the data.

As you know, this is the last chapter in the book. While we explored Matplotlib in great detail, we have just scratched the surface of the vast body of knowledge and programming APIs. You now have the knowledge to further explore Matplotlib and other data visualization libraries on your own. Python has many data visualization libraries for scientific data. Examples include Plotly, Altair, and Cartopy. Armed with your knowledge of the basics of data visualization, have fun continuing your journey further into data science and visualization!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset