Ggplot2 is an R library for data visualization popular among R users. The main idea of ggplot2 is that the product of data visualization consists of many layers. Like a painter, we start with an empty canvas and then gradually add layers of paint. Usually, we interface with R code from Python with rpy2
(I will discuss several interoperability options in Chapter 11, of my book Python Data Analysis). However, if we only want to use ggplot2
, it is more convenient to use the pyggplot
library. In this recipe, we will visualize population growth for three countries using Worldbank data retrievable through pandas
. The data consists of various indicators and related metadata. The spreadsheet at http://api.worldbank.org/v2/en/topic/19?downloadformat=excel (retrieved July 2015) has descriptions of the indicators. I think that we can consider the Worldbank dataset to be static; however, similar datasets have frequent changes quite often enough to keep an analyst busy almost full time. Obviously, changing the name of an indicator (probably) could break the code, so I decided to cache the data via the joblib
library. The joblib
library is related to
scikit-learn, and we will discuss it in more detail in Chapter 9, Ensemble Learning and Dimensionality Reduction. Unfortunately, this approach has some limitations; in particular, we are not able to pickle all Python objects.
First, you need R with ggplot2 installed. If you are not going to seriously use ggplot2, maybe you should skip this recipe altogether. The homepage of R is http://www.r-project.org/ (retrieved July 2015). The documentation of ggplot2 is at http://docs.ggplot2.org/current/index.html (retrieved July 2015). You can install pyggplot with pip—I used pyggplot-23. To install joblib
, visit https://pythonhosted.org/joblib/installing.html (retrieved July 2015). I have joblib
0.8.4 via Anaconda.
import pyggplot from dautil import data
dawb = data.Worldbank() pop_grow = dawb.get_name('pop_grow') df = dawb.download(indicator=pop_grow, start=1984, end=2014) df = dawb.rename_columns(df, use_longnames=True)
DataFrame
object we created:p = pyggplot.Plot(df)
p.add_bar('country', dawb.get_longname(pop_grow), color='year')
p.coord_flip() p.render_notebook()
Refer to the following plot for the end result:
The code is in the using_ggplot.ipynb
file in this book's code bundle.