In the previous chapter, you learned how to visualize data stored in the Pandas series and dataframe.
In the previous chapters of this book, you studied the data visualization library Matplotlib extensively along with other important data science libraries called NumPy and Pandas. You will take a break in this chapter from Matplotlib and learn how to use another related library for data visualization called Seaborn. The following are the topics you will learn about in this chapter:
What is Seaborn?
Plotting statistical relationships
Plotting lines
Visualizing the distribution of data
After reading this chapter, you will be comfortable using the Seaborn library and will be able to create great visualizations of datasets.
What Is Seaborn?
You have learned how to use the Matplotlib library for data visualization. Matplotlib is not the only data visualization library in Python. There are numerous libraries in Python that can visualize data. The scientific data visualization libraries support the data structures of NumPy and Pandas. One such library for the visualization of scientific Python is Seaborn (https://seaborn.pydata.org/index.html). Seaborn is based on and built on top of Matplotlib. It provides a lot of functionality for drawing attractive graphics. It has built-in support for the series and dataframe data structures in Pandas and for Ndarrays in NumPy.
Let’s create a new notebook for the demonstrations in this chapter. Now, let’s install Seaborn with the following command:
!pip3 install seaborn
You can import the library to your notebook or a Python script with the following statement:
import seaborn as sns
You know that the Seaborn library supports the Pandas dataframes. The Seaborn library also has many dataframes stored in it that are populated with data. So, we can use them for our demonstrations. Let’s see how to retrieve these dataframes. The following command returns the list of all the built-in sample dataframes:
sns.get_dataset_names()
The following is the output:
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'exercise',
'flights',
'fmri',
'gammas',
'geyser',
'iris',
'mpg',
'penguins',
'planets',
'tips',
'titanic']
You can load these dataframes into Python variables as follows:
iris = sns.load_dataset('iris')
Let’s see the data stored in the iris dataset with the following statement:
You can plot the statistical relationship between two variables with various functions in Seaborn. The general plotting function to do this is relplot(). You can plot various types of data with this function. By default, the relplot() function plots a scatter plot. Here is an example:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
sns.relplot(x='sepal_length',
y='sepal_width',
data=iris)
plt.grid('on')
plt.show()
This produces the scatter plot shown in Figure 17-2.
You can explicitly specify the type of plot as follows:
sns.relplot(x='sepal_length', y='sepal_width',
data=iris, kind='scatter')
plt.grid('on')
plt.show()
The function replot() is a generic function where you can pass an argument to specify the type of plot. You can also create a scatter plot with the function scatterplot(). For example, the following code creates the same result as shown in Figure 17-2:
sns.scatterplot(x='sepal_length',
y='sepal_width',
data=iris)
plt.grid('on')
plt.show()
You can feed some other columns of the dataset to the plotting function as follows:
You can also show continuous data such as time-series data along a line. Time-series data has timestamp data in at least one column or has an index. A great example of a time series is a table of daily temperature records. Let’s create a time-series dataframe to demonstrate the line plots.
df = pd.DataFrame(np.random.randn(100, 4),
index=pd.date_range("1/1/2020",
periods=100),
columns=list("ABCD"))
df = df.cumsum()
You can use the function relplot() to draw the line as follows:
You can also produce the output shown in Figure 17-7 with the following code:
sns.lineplot(x=df.index,
y='A', data=df)
plt.xticks(rotation=45)
plt.show()
In the next section, you will learn how to visualize the distribution of data.
Visualizing the Distribution of Data
One of the most prominent examples of visualizing the distribution of data is a frequency table or a frequency distribution table. You can create buckets of value ranges that the data can have (the domain), and then you can list the number of items that satisfy the criteria for the bucket. You can also vary the bucket size, with the smallest size being 1.
You can visually show the information of a frequency distribution using bars and lines. If you use bars, then it is known as a histogram. You can use the function displot() to visualize the frequency data. Let’s start with dummy univariate data.
Up to now, we have used a single variable to show the plot. When you use two variables for plotting, it is known as a bivariateplot. Here is a simple example:
The output has KDE and rug visualizations, both as shown in Figure 17-19.
Based on the columns in the dataframe, you can create individual visualizations arranged in rows or columns. Let’s create a visualization based on the size of tips as follows:
sns.displot(x='total_bill', y='tip',
data=tips, rug=True,
kind='kde', col='size')
plt.show()
In the previous example, we are enabling the rug plot feature, and the plots will be separately generated based on the sizes of the tips. Figure 17-20 shows the output.
You can also arrange the individual graphs in rows as follows:
You’ve just learned how to visualize the distribution of data.
Summary
This chapter contained lots of demonstrations. You explored the Seaborn data visualization library of Python in detail. Seaborn is a vast library, and we have just scratched its surface in this chapter. You can refer to the home page of the Seaborn project at https://seaborn.pydata.org/index.html for the API documentation, tutorials, and an example gallery.
In the next and final chapter of this book, you will learn how to visualize the real-life data of the currently ongoing COVID-19 pandemic with the Matplotlib and Seaborn data visualization libraries.