© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. PajankarHands-on Matplotlibhttps://doi.org/10.1007/978-1-4842-7410-1_17

17. Introduction to Data Visualization with Seaborn

Ashwin Pajankar1  
(1)
Nashik, Maharashtra, India
 

In the previous chapter, you learned how to visualize data stored in the Pandas series and dataframe.

In the previous chapters of this book, you studied the data visualization library Matplotlib extensively along with other important data science libraries called NumPy and Pandas. You will take a break in this chapter from Matplotlib and learn how to use another related library for data visualization called Seaborn. The following are the topics you will learn about in this chapter:
  • What is Seaborn?

  • Plotting statistical relationships

  • Plotting lines

  • Visualizing the distribution of data

After reading this chapter, you will be comfortable using the Seaborn library and will be able to create great visualizations of datasets.

What Is Seaborn?

You have learned how to use the Matplotlib library for data visualization. Matplotlib is not the only data visualization library in Python. There are numerous libraries in Python that can visualize data. The scientific data visualization libraries support the data structures of NumPy and Pandas. One such library for the visualization of scientific Python is Seaborn (https://seaborn.pydata.org/index.html). Seaborn is based on and built on top of Matplotlib. It provides a lot of functionality for drawing attractive graphics. It has built-in support for the series and dataframe data structures in Pandas and for Ndarrays in NumPy.

Let’s create a new notebook for the demonstrations in this chapter. Now, let’s install Seaborn with the following command:
!pip3 install seaborn
You can import the library to your notebook or a Python script with the following statement:
import seaborn as sns
You know that the Seaborn library supports the Pandas dataframes. The Seaborn library also has many dataframes stored in it that are populated with data. So, we can use them for our demonstrations. Let’s see how to retrieve these dataframes. The following command returns the list of all the built-in sample dataframes:
sns.get_dataset_names()
The following is the output:
['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'tips',
 'titanic']
You can load these dataframes into Python variables as follows:
iris = sns.load_dataset('iris')
Let’s see the data stored in the iris dataset with the following statement:
iris
Figure 17-1 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig1_HTML.jpg
Figure 17-1

The iris dataset

Plotting Statistical Relationships

You can plot the statistical relationship between two variables with various functions in Seaborn. The general plotting function to do this is relplot(). You can plot various types of data with this function. By default, the relplot() function plots a scatter plot. Here is an example:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
sns.relplot(x='sepal_length',
            y='sepal_width',
            data=iris)
plt.grid('on')
plt.show()
This produces the scatter plot shown in Figure 17-2.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig2_HTML.jpg
Figure 17-2

The scatter plot

You can explicitly specify the type of plot as follows:
sns.relplot(x='sepal_length', y='sepal_width',
            data=iris, kind='scatter')
plt.grid('on')
plt.show()
The function replot() is a generic function where you can pass an argument to specify the type of plot. You can also create a scatter plot with the function scatterplot() . For example, the following code creates the same result as shown in Figure 17-2:
sns.scatterplot(x='sepal_length',
                y='sepal_width',
                data=iris)
plt.grid('on')
plt.show()
You can feed some other columns of the dataset to the plotting function as follows:
sns.relplot(x='petal_length',
            y='petal_width',
            data=iris)
plt.grid('on')
plt.show()
Figure 17-3 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig3_HTML.jpg
Figure 17-3

Another example of a scatter plot

You can also write this with the scatterplot() function as follows:
sns.scatterplot(x='petal_length',
                y='petal_width',
                data=iris)
plt.grid('on')
plt.show()
You can customize the plot and show an additional column with color coding as follows:
sns.relplot(x='sepal_length',
            y='sepal_width',
            hue='species',
            data=iris)
plt.grid('on')
plt.show()
Figure 17-4 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig4_HTML.jpg
Figure 17-4

Scatter plot with colors

You get the same result as shown in Figure 17-4 with the following code:
sns.scatterplot(x='sepal_length',
                y='sepal_width',
                hue='species',
                data=iris)
plt.grid('on')
plt.show()
You can also assign the styles to the scatter plot data points (markers) as follows:
sns.relplot(x='sepal_length', y='sepal_width',
            hue='petal_length', style='species',
            data=iris)
plt.grid('on')
plt.show()
You can see the output in Figure 17-5.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig5_HTML.jpg
Figure 17-5

Scatter plot with colors and custom styles

The following code produces the same output as shown in Figure 17-5:
sns.scatterplot(x='sepal_length', y='sepal_width',
            hue='petal_length', style='species',
            data=iris)
plt.grid('on')
plt.show()
You can also adjust the sizes of the markers as follows:
sns.relplot(x='sepal_length', y='sepal_width',
            size='petal_length', style='species',
            hue='species', data=iris)
plt.grid('on')
plt.show()
Figure 17-6 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig6_HTML.jpg
Figure 17-6

Scatter plot with colors and custom styles and marker sizes

The following code produces the same result as shown in Figure 17-5:
sns.scatterplot(x='sepal_length', y='sepal_width',
            size='petal_length', style='species',
            hue='species', data=iris)
plt.grid('on')
plt.show()

Plotting Lines

You can also show continuous data such as time-series data along a line. Time-series data has timestamp data in at least one column or has an index. A great example of a time series is a table of daily temperature records. Let’s create a time-series dataframe to demonstrate the line plots.
df = pd.DataFrame(np.random.randn(100, 4),
                  index=pd.date_range("1/1/2020",
                                      periods=100),
                  columns=list("ABCD"))
df = df.cumsum()
You can use the function relplot() to draw the line as follows:
sns.relplot(x=df.index, y='A', kind="line", data=df)
plt.xticks(rotation=45)
plt.show()
Figure 17-7 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig7_HTML.jpg
Figure 17-7

Line plot of time-series data

You can also produce the output shown in Figure 17-7 with the following code:
sns.lineplot(x=df.index,
             y='A', data=df)
plt.xticks(rotation=45)
plt.show()

In the next section, you will learn how to visualize the distribution of data.

Visualizing the Distribution of Data

One of the most prominent examples of visualizing the distribution of data is a frequency table or a frequency distribution table. You can create buckets of value ranges that the data can have (the domain), and then you can list the number of items that satisfy the criteria for the bucket. You can also vary the bucket size, with the smallest size being 1.

You can visually show the information of a frequency distribution using bars and lines. If you use bars, then it is known as a histogram. You can use the function displot() to visualize the frequency data. Let’s start with dummy univariate data.
x = np.random.randn(100)
sns.displot(x)
plt.show()
Figure 17-8 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig8_HTML.jpg
Figure 17-8

Histogram

You can also make it explicit that you need a histogram in the output as follows:
sns.displot(x, kind='hist')
plt.show()
A histogram is the default kind of graph. You can also show a Gaussian kernel density estimation (KDE) as follows:
sns.displot(x, kind='kde')
plt.show()
Figure 17-9 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig9_HTML.jpg
Figure 17-9

KDE graph

You can visualize an empirical cumulative distribution function (eCDF) as follows:
sns.displot(x, kind='ecdf')
plt.show()
Figure 17-10 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig10_HTML.jpg
Figure 17-10

eCDF graph

You can combine a histogram and a KDE as follows:
sns.displot(x, kind='hist', kde=True)
plt.show()
Figure 17-11 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig11_HTML.jpg
Figure 17-11

Histogram combined with KDE

Now let’s use some real-life data, as follows:
tips = sns.load_dataset("tips")
sns.displot(x='total_bill', data=tips, kind='hist')
plt.show()
Figure 17-12 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig12_HTML.jpg
Figure 17-12

Real-life data visualized as a histogram

You can customize the size of bins (or buckets) in the visualization as follows:
sns.displot(x='total_bill', data=tips,
            kind='hist', bins=30, kde=True)
plt.show()
Figure 17-13 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig13_HTML.jpg
Figure 17-13

Customized buckets in a histogram

You can adjust the hue of the plots based on a criterion of your choice as follows:
sns.displot(x='total_bill', data=tips,
            kind='kde', hue='size')
plt.show()
Figure 17-14 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig14_HTML.jpg
Figure 17-14

Customized colors in a KDE plot

Up to now, we have used a single variable to show the plot. When you use two variables for plotting, it is known as a bivariate plot. Here is a simple example:
sns.displot(x='total_bill',
            y='tip', data=tips)
plt.show()
Figure 17-15 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig15_HTML.jpg
Figure 17-15

A simple bivariate histogram

You can add color to this example as follows:
sns.displot(x='total_bill', y='tip',
            hue='size', data=tips)
plt.show()
Figure 17-16 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig16_HTML.jpg
Figure 17-16

A simple bivariate histogram with color

You can also customize the size of bins and add ticks on the x- and y-axes (known as a rug plot) as follows:
sns.displot(x='total_bill', y='tip',
            data=tips, rug=True,
            kind='hist', bins=30)
plt.show()
Figure 17-17 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig17_HTML.jpg
Figure 17-17

A simple bivariate histogram with custom bins and rug plot

A more interesting type of visualization is a bivariate KDE plot. It looks like a contour. The code is as follows:
sns.displot(x='total_bill', y='tip',
            data=tips, kind='kde')
plt.show()
Figure 17-18 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig18_HTML.jpg
Figure 17-18

A simple bivariate KDE plot

You can add a rug plot to the output as follows:
sns.displot(x='total_bill', y='tip',
            data=tips, rug=True,
            kind='kde')
plt.show()
The output has KDE and rug visualizations, both as shown in Figure 17-19.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig19_HTML.jpg
Figure 17-19

A simple bivariate KDE plot with a rug plot

Based on the columns in the dataframe, you can create individual visualizations arranged in rows or columns. Let’s create a visualization based on the size of tips as follows:
sns.displot(x='total_bill', y='tip',
            data=tips, rug=True,
            kind='kde', col='size')
plt.show()
In the previous example, we are enabling the rug plot feature, and the plots will be separately generated based on the sizes of the tips. Figure 17-20 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig20_HTML.jpg
Figure 17-20

A simple bivariate KDE plot with a rug plot arranged in columns

You can also arrange the individual graphs in rows as follows:
sns.displot(x='total_bill', y='tip',
            data=tips, rug=True,
            kind='kde', row='size')
plt.show()
Figure 17-21 shows the output.
../images/515442_1_En_17_Chapter/515442_1_En_17_Fig21_HTML.jpg
Figure 17-21

A simple bivariate KDE plot with a rug plot arranged in rows

You’ve just learned how to visualize the distribution of data.

Summary

This chapter contained lots of demonstrations. You explored the Seaborn data visualization library of Python in detail. Seaborn is a vast library, and we have just scratched its surface in this chapter. You can refer to the home page of the Seaborn project at https://seaborn.pydata.org/index.html for the API documentation, tutorials, and an example gallery.

In the next and final chapter of this book, you will learn how to visualize the real-life data of the currently ongoing COVID-19 pandemic with the Matplotlib and Seaborn data visualization libraries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset