Loading external datasets in Python

Thanks to the SciPy community, there are many resources out there for getting our hands on some data.

A particularly useful resource comes in the form of the sklearn.datasets package of scikit-learn. This package comes preinstalled with some small datasets that do not require us to download any files from external websites. These datasets include the following:

load_boston: The Boston dataset contains housing prices in different suburbs of Boston along with several interesting features such as per capita crime rate by town, proportion of residential land, and number of non-retail business
load_iris: The Iris dataset contains three different types of Iris flowers (Setosa, Versicolor, and Virginica), along with four features describing the width and length of the sepals and petals
load_diabetes: The diabetes dataset lets us classify patients as having diabetes or not, based on features such as patient age, sex, body mass index, average blood pressure, and six blood serum measurements
load_digits: The digits dataset contains 8 x 8 pixel images of the digits 0-9
load_linnerud: The Linnerud dataset contains 3 physiological variables and 3 exercise variables measured on 20 middle-aged men in a fitness club

Also, scikit-learn allows us to download datasets directly from external repositories, such as the following:

fetch_olivetti_faces: The Olivetti faces dataset contains 10 different images each of 40 distinct subjects
fetch_20newsgroups: The 20 newsgroup dataset contains around 18,000 newsgroup posts on 20 topics

Even better, it is possible to download datasets directly from the machine learning database at http://openml.org. For example, to download the Iris flower dataset, simply type the following:

In [1]: from sklearn import datasets
In [2]: iris = datasets.fetch_openml('iris', version=1)
In [3]: iris_data = iris['data']
In [4]: iris_target = iris['target']

The Iris flower database contains a total of 150 samples with 4 features—sepal length, sepal width, petal length, and petal width. The data is divided into three classes—Iris Setosa, Iris Versicolour, and Iris Virginica. Data and labels are delivered in two separate containers, which we can inspect as follows:

In [5]: iris_data.shape 
Out[5]: (150, 4)
In [6]: iris_target.shape 
Out[6]: (150,)

Here, we can see that iris_data contains 150 samples, each with 4 features (and that's why the number 4 is in the shape). Labels are stored in iris_target, where there is only one label per sample.

We can further inspect the values of all targets, but we don't just want to print them all. Instead, we are interested to see all distinct target values, which is easy to do with NumPy:

In [7]: import numpy as np
In [8]: np.unique(iris_target) # Find all unique elements in array
Out[8]: array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Another Python library for data analysis that you should have heard about is pandas (http://pandas.pydata.org). pandas implements several powerful data operations for both databases and spreadsheets. However great the library, at this point, pandas is a bit too advanced for our purposes.

Table of Contents for Loading external datasets in Python

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading external datasets in Python