In the previous chapter, you learned how to read the data stored in various file formats into Python variables using NumPy, Pandas, and Matplotlib.
You should be comfortable working with data now. In this chapter, you will practice writing programs related to another important and practical aspect of the field of data science: dataset visualization. This chapter contains lots of examples of short code snippets to demonstrate how to create visualizations of datasets. So, let’s continue our journey of data science with the following topics in this chapter:
Simple plots
Bar graphs
Histograms
Box plots
Area plots
Scatter plots
Hexagonal bin plots
Pie charts
After this chapter, you will be able to create impressive visualizations of datasets with Pandas and Matplotlib.
Simple Plots
Let’s jump directly into the hands-on examples for data visualization. You will learn how to visualize simple plots first. I recommend you create a new notebook for the code examples in this chapter.
Let’s start with the magical command that imports all the required libraries, as follows:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Let’s create some data using the routine cumsum(), as follows:
The resultant dataset will have three columns, as follows:
B C A
0 -0.684779 -0.655677 0
1 -0.699163 -1.868611 1
2 -0.315527 -3.513103 2
3 -0.504069 -4.175940 3
4 0.998419 -4.385832 4
.. ... ... ..
95 1.149399 -1.445029 95
96 2.035029 -1.886731 96
97 0.938699 0.188980 97
98 2.449148 0.335828 98
99 2.204369 -1.304379 99
[100 rows x 3 columns]
Let’s use the routine plot() to visualize this data. The plot() routine that the dataframe object uses calls Pyplot’s plot() by default. Here’s an example:
plt.figure()
df1.plot(x='A', y='B')
plt.show()
This code is self-explanatory. We are passing strings that contain the names of columns as arguments for the x- and y-axes. It produces the output depicted in Figure 16-1.
You can use other columns in the visualization as well, as shown here:
plt.figure()
df1.plot(x='A', y='C')
plt.show()
Run this example to see the result. This is how you can use different combinations of columns to visualize data.
Bar Graphs
Let’s create a simple bar graph using the same dataset. Let’s pick a record from this dataframe as follows:
print(df1.iloc[4])
The following is the output:
B 0.998419
C -4.385832
A 4.000000
Name: 4, dtype: float64
Let’s draw a simple bar graph with this data using the routine bar(). The following is the code snippet for that:
plt.figure()
df1.iloc[4].plot.bar()
plt.axhline(0, color='k')
plt.show()
In this code example, we are using axhline() to draw a horizontal line corresponding to the x-axis. Figure 16-2 shows the output.
Let’s discuss a more complex example of a bar graph. Let’s create a new dataset as follows:
In the earlier example, we visualized only a single row. Now, let’s visualize the entire dataset as follows:
plt.figure()
df2.plot.bar()
plt.show()
This will create a bar graph for every row. The graphs will be grouped together per the rows, as shown in Figure 16-3.
You can see that the indices are represented on the x-axis, and magnitudes are marked on the y-axis. This is an unstacked vertical bar graph. You can create a stacked variation of it by just passing a simple argument as follows:
You’ve just learned how to create various types of bar graphs.
Histograms
A histogram is a visual representation of the frequency distribution of numerical data. It was first used by Karl Pearson.
We first divide the data into various buckets, or bins. The size of the bins depends on the requirements. For integer datasets, you can have the smallest bin size, which is 1. Then for each bin, you can list the number of occurrences of elements that fall under the bin. Then you can show that table as a bar graph.
You can draw the histogram of a given dataset with Pandas and Matplotlib. Let’s create a dataset as follows:
The argument passed to routine decides the opacity (or alpha transparency) of the output. You had to make this transparent in the previous example because the histogram was unstacked. Let’s create a stacked histogram with the size of the buckets as 20, as follows:
You’ve just learned how to visualize datasets as histograms.
Box Plots
You can visualize data with box plots as well. Box plots (also spelled as boxplots) display the groups of numerical data through their quartiles. Let’s create a dataset as follows:
df = pd.DataFrame(np.random.rand(10, 5),
columns=['A', 'B', 'C', 'D', 'E'])
print(df)
The generated dataset is as follows:
A B C D E
0 0.684284 0.033906 0.099369 0.684024 0.533463
1 0.614305 0.645413 0.871788 0.561767 0.149080
2 0.226480 0.440091 0.096022 0.076962 0.674901
3 0.541253 0.409599 0.487924 0.649260 0.582250
4 0.436995 0.142239 0.781428 0.634987 0.825146
5 0.804633 0.874081 0.018661 0.306459 0.008134
6 0.228287 0.418942 0.157755 0.561070 0.740077
7 0.699860 0.230533 0.240369 0.108759 0.843307
8 0.530943 0.374583 0.650235 0.370809 0.595791
9 0.213455 0.221367 0.035203 0.887068 0.593629
You can draw box plots as follows:
plt.figure()
df.plot.box()
plt.show()
This will show the dataset as box plots, as shown in Figure 16-12.
The colors shown here are the default values. You can change them. First, you need to create a dictionary as follows:
The main advantage of the routine boxplot() is that you can have column-wise visualizations in a single output. Let’s create an appropriate dataset as follows:
Let’s see how to customize the scatter plot. You can customize the color and the size of the points. The color or size can be a constant or can be variable. The following is an example of variable colors and a constant size for the data points. When the color is variable, a color bar is added to the output by default.
Let’s create a partial pie chart by passing values whose sum is less than 1.0. The following is the data for that:
series = pd.Series([0.1] * 4,
index=['A', 'B', 'C', 'D'],
name='series2')
print(series)
This creates the following dataset:
A 0.1
B 0.1
C 0.1
D 0.1
Name: series2, dtype: float64
The partial pie chart can be visualized as follows:
plt.figure()
series.plot.pie(figsize=(6, 6))
plt.show()
This creates a partial pie chart (or a semicircle), as shown in Figure 16-31.
You’ve just learned how to visualize data with pie charts.
Summary
In this chapter, you learned how to visualize data with various techniques. You can use these visualization techniques in real-life projects. In the coming chapters, we will explore other libraries for creating data visualizations in Python.
In the next chapter, you will learn about how to create data visualizations with a new library called Seaborn.