© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. PajankarHands-on Matplotlibhttps://doi.org/10.1007/978-1-4842-7410-1_16

16. Visualizing Data with Pandas and Matplotlib

Ashwin Pajankar1  
(1)
Nashik, Maharashtra, India
 

In the previous chapter, you learned how to read the data stored in various file formats into Python variables using NumPy, Pandas, and Matplotlib.

You should be comfortable working with data now. In this chapter, you will practice writing programs related to another important and practical aspect of the field of data science: dataset visualization. This chapter contains lots of examples of short code snippets to demonstrate how to create visualizations of datasets. So, let’s continue our journey of data science with the following topics in this chapter:
  • Simple plots

  • Bar graphs

  • Histograms

  • Box plots

  • Area plots

  • Scatter plots

  • Hexagonal bin plots

  • Pie charts

After this chapter, you will be able to create impressive visualizations of datasets with Pandas and Matplotlib.

Simple Plots

Let’s jump directly into the hands-on examples for data visualization. You will learn how to visualize simple plots first. I recommend you create a new notebook for the code examples in this chapter.

Let’s start with the magical command that imports all the required libraries, as follows:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Let’s create some data using the routine cumsum(), as follows:
df1 = pd.DataFrame(np.random.randn(100, 2), columns=['B', 'C']).cumsum()
df1['A'] = pd.Series(list(range(100)))
print(df1)
The resultant dataset will have three columns, as follows:
           B         C   A
0  -0.684779 -0.655677   0
1  -0.699163 -1.868611   1
2  -0.315527 -3.513103   2
3  -0.504069 -4.175940   3
4   0.998419 -4.385832   4
..       ...       ...  ..
95  1.149399 -1.445029  95
96  2.035029 -1.886731  96
97  0.938699  0.188980  97
98  2.449148  0.335828  98
99  2.204369 -1.304379  99
[100 rows x 3 columns]
Let’s use the routine plot() to visualize this data. The plot() routine that the dataframe object uses calls Pyplot’s plot() by default. Here’s an example:
plt.figure()
df1.plot(x='A', y='B')
plt.show()
This code is self-explanatory. We are passing strings that contain the names of columns as arguments for the x- and y-axes. It produces the output depicted in Figure 16-1.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig1_HTML.jpg
Figure 16-1

Visualizing a simple plot

You can use other columns in the visualization as well, as shown here:
plt.figure()
df1.plot(x='A', y='C')
plt.show()

Run this example to see the result. This is how you can use different combinations of columns to visualize data .

Bar Graphs

Let’s create a simple bar graph using the same dataset. Let’s pick a record from this dataframe as follows:
print(df1.iloc[4])
The following is the output:
B    0.998419
C   -4.385832
A    4.000000
Name: 4, dtype: float64
Let’s draw a simple bar graph with this data using the routine bar(). The following is the code snippet for that:
plt.figure()
df1.iloc[4].plot.bar()
plt.axhline(0, color='k')
plt.show()
In this code example, we are using axhline() to draw a horizontal line corresponding to the x-axis. Figure 16-2 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig2_HTML.png
Figure 16-2

Visualizing a simple bar graph

Let’s discuss a more complex example of a bar graph. Let’s create a new dataset as follows:
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df2)
The output is as follows:
          a         b         c         d
0  0.352173  0.127452  0.637665  0.734944
1  0.375190  0.931818  0.769403  0.927441
2  0.830744  0.942059  0.781032  0.557774
3  0.977058  0.594992  0.557016  0.862058
4  0.960796  0.329448  0.493713  0.971139
5  0.364460  0.516401  0.432365  0.587528
6  0.292020  0.500945  0.889294  0.211502
7  0.770808  0.519468  0.279582  0.419549
8  0.982924  0.458197  0.938682  0.123614
9  0.578290  0.186395  0.901216  0.099061
In the earlier example, we visualized only a single row. Now, let’s visualize the entire dataset as follows:
plt.figure()
df2.plot.bar()
plt.show()
This will create a bar graph for every row. The graphs will be grouped together per the rows, as shown in Figure 16-3.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig3_HTML.jpg
Figure 16-3

Visualizing a more complex bar graph

You can see that the indices are represented on the x-axis, and magnitudes are marked on the y-axis. This is an unstacked vertical bar graph. You can create a stacked variation of it by just passing a simple argument as follows:
plt.figure()
df2.plot.bar(stacked=True)
plt.show()
Figure 16-4 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig4_HTML.jpg
Figure 16-4

Visualizing vertically stacked bar graphs

You can even create horizontal stacked and unstacked bar graphs too. Let’s create a horizontally stacked bar graph with the routine barh() as follows:
plt.figure()
df2.plot.barh(stacked=True)
plt.show()
Figure 16-5 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig5_HTML.jpg
Figure 16-5

Visualizing horizontally stacked bar graphs

Let’s write a code snippet for an unstacked horizontal bar graph by omitting the argument as follows:
plt.figure()
df2.plot.barh()
plt.show()
Figure 16-6 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig6_HTML.jpg
Figure 16-6

Visualizing horizontal unstacked bar graphs

You’ve just learned how to create various types of bar graphs.

Histograms

A histogram is a visual representation of the frequency distribution of numerical data. It was first used by Karl Pearson.

We first divide the data into various buckets, or bins. The size of the bins depends on the requirements. For integer datasets, you can have the smallest bin size, which is 1. Then for each bin, you can list the number of occurrences of elements that fall under the bin. Then you can show that table as a bar graph.

You can draw the histogram of a given dataset with Pandas and Matplotlib. Let’s create a dataset as follows:
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1,
                    'b': np.random.randn(1000),
                    'c': np.random.randn(1000) - 1},
                   columns=['a', 'b', 'c'])
print(df4)
The generated dataset is as follows:
            a         b         c
0    1.454474 -0.517940 -0.772909
1    1.886328  0.868393  0.109613
2    0.041313 -1.959168 -0.713575
3    0.650075  0.457937 -0.501023
4    1.684392 -0.072837  1.821190
..        ...       ...       ...
995  0.800481 -1.209032 -0.249132
996  0.490104  0.253966 -1.185503
997  2.304285  0.082134 -1.068881
998  1.249055  0.040750 -0.488890
999 -1.216627  0.444629 -1.198375
[1000 rows x 3 columns]
Let’s visualize this dataset as a histogram using the routine hist(), as follows:
plt.figure();
df4.plot.hist(alpha=0.7)
plt.show()
Figure 16-7 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig7_HTML.jpg
Figure 16-7

Visualizing a dataset as a histogram

The argument passed to routine decides the opacity (or alpha transparency) of the output. You had to make this transparent in the previous example because the histogram was unstacked. Let’s create a stacked histogram with the size of the buckets as 20, as follows:
plt.figure();
df4.plot.hist(stacked=True, bins=20)
plt.show()
Figure 16-8 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig8_HTML.jpg
Figure 16-8

Visualizing the same dataset as an unstacked histogram

Let’s create a horizontal cumulative histogram of a single column as follows:
plt.figure();
df4['a'].plot.hist(orientation='horizontal', cumulative=True)
plt.show()
Figure 16-9 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig9_HTML.jpg
Figure 16-9

Horizontal cumulative histogram

The vertical version of the same histogram can be created as follows:
plt.figure();
df4['a'].plot.hist(orientation='vertical', cumulative=True)
plt.show()
Figure 16-10 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig10_HTML.jpg
Figure 16-10

Vertical cumulative histogram

Let’s try a fancy type of histogram next. The routine diff() computes the numeric difference between the previous row and the current one.
print(df4.diff())
The output will have the first row populated with NaN for all the columns (as there is no row before the first one). The output is as follows:
            a         b         c
0         NaN       NaN       NaN
1    0.431854  1.386333  0.882522
2   -1.845015 -2.827562 -0.823188
3    0.608762  2.417105  0.212552
4    1.034317 -0.530774  2.322213
..        ...       ...       ...
995  0.411207 -2.847858  0.325067
996 -0.310378  1.462998 -0.936370
997  1.814182 -0.171832  0.116622
998 -1.055230 -0.041384  0.579991
999 -2.465682  0.403880 -0.709485
[1000 rows x 3 columns]
Let’s visualize this dataset, as shown here:
plt.figure()
df4.diff().hist(color='k', alpha=0.5, bins=50)
plt.show()
Figure 16-11 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig11_HTML.jpg
Figure 16-11

Column-wise histograms

You’ve just learned how to visualize datasets as histograms.

Box Plots

You can visualize data with box plots as well. Box plots (also spelled as boxplots ) display the groups of numerical data through their quartiles. Let’s create a dataset as follows:
df = pd.DataFrame(np.random.rand(10, 5),
                  columns=['A', 'B', 'C', 'D', 'E'])
print(df)
The generated dataset is as follows:
          A         B         C         D         E
0  0.684284  0.033906  0.099369  0.684024  0.533463
1  0.614305  0.645413  0.871788  0.561767  0.149080
2  0.226480  0.440091  0.096022  0.076962  0.674901
3  0.541253  0.409599  0.487924  0.649260  0.582250
4  0.436995  0.142239  0.781428  0.634987  0.825146
5  0.804633  0.874081  0.018661  0.306459  0.008134
6  0.228287  0.418942  0.157755  0.561070  0.740077
7  0.699860  0.230533  0.240369  0.108759  0.843307
8  0.530943  0.374583  0.650235  0.370809  0.595791
9  0.213455  0.221367  0.035203  0.887068  0.593629
You can draw box plots as follows:
plt.figure()
df.plot.box()
plt.show()
This will show the dataset as box plots, as shown in Figure 16-12.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig12_HTML.jpg
Figure 16-12

Vertical box plot

The colors shown here are the default values. You can change them. First, you need to create a dictionary as follows:
color = dict(boxes='DarkGreen',
             whiskers='DarkOrange',
             medians='DarkBlue',
             caps='Gray')
print(color)
The following is the output:
{'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue', 'caps': 'Gray'}
Finally, you pass this dictionary as an argument to the routine that draws the box plot as follows:
plt.figure()
df.plot.box(color=color, sym='r+')
plt.show()
Figure 16-13 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig13_HTML.jpg
Figure 16-13

Vertical box plot with customized colors

The following example creates a horizontal box plot visualization:
plt.figure()
df.plot.box(vert=False, positions=[1, 2, 3, 4 , 5])
plt.show()
Figure 16-14 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig14_HTML.jpg
Figure 16-14

Horizontal box plot

Let’s see another routine, boxplot(), that also creates box plots. For that, let’s create another dataset, as shown here:
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
The output dataset is as follows:
          0         1         2         3         4
0  0.936845  0.365561  0.890503  0.264896  0.937254
1  0.931661  0.226297  0.887385  0.036719  0.941609
2  0.127896  0.291034  0.161724  0.952966  0.925534
3  0.938686  0.336536  0.934843  0.806043  0.104054
4  0.743787  0.600116  0.989178  0.002870  0.453338
5  0.256692  0.773945  0.165381  0.809204  0.162431
6  0.822131  0.486780  0.453981  0.612403  0.614633
7  0.062387  0.958844  0.247515  0.573431  0.194665
8  0.453193  0.152337  0.062436  0.865115  0.220440
9  0.832040  0.237582  0.837805  0.423779  0.119027
You can draw box plots as follows:
plt.figure()
bp = df.boxplot()
plt.show()
Figure 16-15 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig15_HTML.jpg
Figure 16-15

Box plot in action

The main advantage of the routine boxplot() is that you can have column-wise visualizations in a single output. Let’s create an appropriate dataset as follows:
df = pd.DataFrame(np.random.rand(10, 2), columns=['Col1', 'Col2'] )
df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
print(df)
The output dataset is as follows:
       Col1      Col2  X
0  0.469416  0.341874  A
1  0.176359  0.921808  A
2  0.135188  0.149354  A
3  0.475295  0.360012  A
4  0.566289  0.142729  A
5  0.408705  0.571466  B
6  0.233820  0.470200  B
7  0.679833  0.633349  B
8  0.183652  0.559745  B
9  0.192431  0.726981  B
Let’s create column-wise visualizations as follows:
plt.figure();
bp = df.boxplot(by='X')
plt.show()
The output will have a title by default explaining how the data is grouped, as shown in Figure 16-16.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig16_HTML.jpg
Figure 16-16

Box plots with groups

Let’s look at a little more complex example for this. The following is the code for a new dataset:
df = pd.DataFrame(np.random.rand(10,3), columns=['Col1', 'Col2', 'Col3'])
df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])
df['Y'] = pd.Series(['A','B','A','B','A','B','A','B','A','B'])
print(df)
This code creates the following dataset:
       Col1      Col2      Col3  X  Y
0  0.542771  0.175804  0.017646  A
1  0.247552  0.503725  0.569475  A  B
2  0.593635  0.842846  0.755377  A
3  0.210409  0.235510  0.633318  A  B
4  0.268419  0.170563  0.478912  A
5  0.526251  0.258278  0.549876  B
6  0.311182  0.212787  0.966183  B  A
7  0.100687  0.432545  0.586907  B
8  0.416833  0.879384  0.635664  B  A
9  0.249280  0.558648  0.661523  B
You can create box plots in groups of multiple columns (this means the grouping criteria will have multiple columns).
plt.figure();
bp = df.boxplot(column=['Col1','Col2'], by=['X','Y'])
plt.show()
Figure 16-17 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig17_HTML.jpg
Figure 16-17

Box plots with groups (multiple columns in the grouping criteria)

Let’s see a bit more complex example with a dataset that has more variation. The following code creates such a dataset:
np.random.seed(1234)
df_box = pd.DataFrame(np.random.randn(10, 2), columns=['A', 'B'])
df_box['C'] = np.random.choice(['Yes', 'No'], size=10)
print(df_box)
The output is the following dataset:
          A         B    C
0  0.471435 -1.190976   No
1  1.432707 -0.312652  Yes
2 -0.720589  0.887163   No
3  0.859588 -0.636524  Yes
4  0.015696 -2.242685   No
5  1.150036  0.991946  Yes
6  0.953324 -2.021255   No
7 -0.334077  0.002118   No
8  0.405453  0.289092   No
9  1.321158 -1.546906   No
You can use the routine groupby() in Pandas to group the data and visualize it as follows:
plt.figure()
bp = df_box.boxplot(by='C')
plt.show()
Figure 16-18 shows the output grouped by column C.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig18_HTML.jpg
Figure 16-18

Box plot plt.figure()visualization grouped by column C

Another example is as follows:
bp = df_box.groupby('C').boxplot()
plt.show()
Figure 16-19 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig19_HTML.jpg
Figure 16-19

Box plot visualization grouped by column C

This is how you can visualize datasets as box plots.

Area Plots

You can visualize datasets as area plots too . Let’s create a dataset with four columns as follows:
df = pd.DataFrame(np.random.rand(10, 4),
                  columns=['A', 'B', 'C', 'D'])
print(df)
This creates the following dataset:
          A         B         C         D
0  0.982005  0.123943  0.119381  0.738523
1  0.587304  0.471633  0.107127  0.229219
2  0.899965  0.416754  0.535852  0.006209
3  0.300642  0.436893  0.612149  0.918198
4  0.625737  0.705998  0.149834  0.746063
5  0.831007  0.633726  0.438310  0.152573
6  0.568410  0.528224  0.951429  0.480359
7  0.502560  0.536878  0.819202  0.057116
8  0.669422  0.767117  0.708115  0.796867
9  0.557761  0.965837  0.147157  0.029647
You can visualize all this data with the routine area() as follows:
plt.figure()
df.plot.area()
plt.show()
The previous example creates a stacked area plot, as shown in Figure 16-20.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig20_HTML.jpg
Figure 16-20

Stacked area plots

You can also create unstacked area plots by passing an argument to the routine area() as follows :
plt.figure()
df.plot.area(stacked=False)
plt.show()
The unstacked area plot will be transparent by default so that all the individual area plots are visible. Figure 16-21 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig21_HTML.jpg
Figure 16-21

Unstacked area plots

This is how to create area plots .

Scatter Plots

You can also visualize any dataset as a scatter plot . Let’s create a dataset as follows:
df = pd.DataFrame(np.random.rand(10, 4),
                  columns=['A', 'B', 'C', 'D'])
print(df)
The output dataset is as follows:
          A         B         C         D
0  0.593893  0.114066  0.950810  0.325707
1  0.193619  0.457812  0.920403  0.879069
2  0.252616  0.348009  0.182589  0.901796
3  0.706528  0.726658  0.900088  0.779164
4  0.599155  0.291125  0.151395  0.335175
5  0.657552  0.073343  0.055006  0.323195
6  0.590482  0.853899  0.287062  0.173067
7  0.134021  0.994654  0.179498  0.317547
8  0.568291  0.009349  0.900649  0.977241
9  0.556895  0.084774  0.333002  0.728429
You can visualize columns A and B as a scatter plot as follows:
plt.figure()
df.plot.scatter(x='A', y='B')
plt.show()
Figure 16-22 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig22_HTML.jpg
Figure 16-22

Simple scatter plot

You can visualize multiple groups as follows:
ax = df.plot.scatter(x='A', y='B',
                     color='Blue',
                     label='Group 1')
plt.figure()
df.plot.scatter(x='C', y='D',
                color='Green',
                label='Group 2',
                ax=ax)
plt.show()
Figure 16-23 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig23_HTML.jpg
Figure 16-23

Scatter plot with multiple groups

Let’s see how to customize the scatter plot. You can customize the color and the size of the points. The color or size can be a constant or can be variable. The following is an example of variable colors and a constant size for the data points. When the color is variable, a color bar is added to the output by default.
plt.figure()
df.plot.scatter(x='A', y='B', c='C', s=40)
plt.show()
Figure 16-24 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig24_HTML.jpg
Figure 16-24

Scatter plot with different colors for the data points

Let’s assign the size to be variable as follows:
plt.figure()
df.plot.scatter(x='A', y='B', s=df['C']*100)
plt.show()
Figure 16-25 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig25_HTML.jpg
Figure 16-25

Scatter plot with different sizes for the data points

Finally, let’s see an example with fully customized variable sizes and variable colors as follows:
plt.figure()
df.plot.scatter(x='A', y='B', c='C', s=df['D']*100)
plt.show()
Figure 16-26 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig26_HTML.jpg
Figure 16-26

Scatter plot with different sizes for the data points

You’ve just learned how to create and customize scatter plots.

Hexagonal Bin Plots

You can also visualize data with hexagonal bin (hexbin) plots. Let’s prepare a dataset as follows:
df = pd.DataFrame(np.random.randn(100, 2),
                  columns=['A', 'B'])
df['B'] = df['B'] + np.arange(100)
print(df)
The output is as follows:
           A          B
0   0.165445  -1.127470
1  -1.192185   1.818644
2   0.237185   1.663616
3   0.694727   3.750161
4   0.247055   4.645433
..       ...        ...
95  0.650346  94.485664
96  0.539429  97.526762
97 -3.277193  95.151439
98  0.672125  96.507021
99 -0.827198  99.914196
[100 rows x 2 columns]
Let’s visualize this data with a hexbin plot as follows:
plt.figure()
df.plot.hexbin(x='A', y='B', gridsize=20)
plt.show()
Figure 16-27 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig27_HTML.jpg
Figure 16-27

Hexbin plot example

As you can see, you can customize the size of the grid.

Pie Charts

Finally, you will learn how to create pie charts to visualize datasets. Let’s create a dataset as follows :
series = pd.Series(3 * np.random.rand(4),
                   index=['A', 'B', 'C', 'D'],
                   name='series')
print(series)
This creates the following dataset:
A    1.566910
B    0.294986
C    2.140910
D    2.652122
Name: series, dtype: float64
You can visualize it as follows:
plt.figure()
series.plot.pie(figsize=(6, 6))
plt.show()
Figure 16-28 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig28_HTML.jpg
Figure 16-28

A simple pie chart

Let’s create a dataset with two columns as follows:
df = pd.DataFrame(3 * np.random.rand(4, 2),
                  index=['A', 'B', 'C', 'D'],
                  columns=['X', 'Y'])
print(df)
This generates the following data:
          X         Y
A  1.701163  2.983445
B  0.536219  0.036600
C  1.370995  2.795256
D  2.538074  1.419990
Figure 16-29 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig29_HTML.jpg
Figure 16-29

A simple pie chart for a multicolumn dataset

You can customize pie charts. Specifically, you can customize the font, colors, and labels as follows:
plt.figure()
series.plot.pie(labels=['A', 'B', 'C', 'D'],
                colors=['r', 'g', 'b', 'c'],
                autopct='%.2f', fontsize=20,
                figsize=(6, 6))
plt.show()
Figure 16-30 shows the output.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig30_HTML.jpg
Figure 16-30

A simple yet customized pie chart

Let’s create a partial pie chart by passing values whose sum is less than 1.0. The following is the data for that:
series = pd.Series([0.1] * 4,
                   index=['A', 'B', 'C', 'D'],
                   name='series2')
print(series)
This creates the following dataset:
A    0.1
B    0.1
C    0.1
D    0.1
Name: series2, dtype: float64
The partial pie chart can be visualized as follows:
plt.figure()
series.plot.pie(figsize=(6, 6))
plt.show()
This creates a partial pie chart (or a semicircle), as shown in Figure 16-31.
../images/515442_1_En_16_Chapter/515442_1_En_16_Fig31_HTML.jpg
Figure 16-31

A simple yet customized pie chart

You’ve just learned how to visualize data with pie charts.

Summary

In this chapter, you learned how to visualize data with various techniques. You can use these visualization techniques in real-life projects. In the coming chapters, we will explore other libraries for creating data visualizations in Python.

In the next chapter, you will learn about how to create data visualizations with a new library called Seaborn.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset