Data representation in scikit-learn

In contrast to the heterogeneous domains and applications of machine learning, the data representation in scikit-learn is less diverse, and the basic format that many algorithms expect is straightforward—a matrix of samples and features.

The underlying data structure is a numpy and the ndarray. Each row in the matrix corresponds to one sample and each column to the value of one feature.

There is something like Hello World in the world of machine learning datasets as well; for example, the Iris dataset whose origins date back to 1936. With the standard installation of scikit-learn, you already have access to a couple of datasets, including Iris that consists of 150 samples, each consisting of four measurements taken from three different Iris flower species:

>>> import numpy as np
>>> from sklearn import datasets
>>> iris = datasets.load_iris()

The dataset is packaged as a bunch, which is only a thin wrapper around a dictionary:

>>> type(iris)
sklearn.datasets.base.Bunch
>>> iris.keys()
['target_names', 'data', 'target', 'DESCR', 'feature_names']

Under the data key, we can find the matrix of samples and features, and can confirm its shape:

>>> type(iris.data)
numpy.ndarray
>>> iris.data.shape
(150, 4)

Each entry in the data matrix has been labeled, and these labels can be looked up in the target attribute:

>>> type(iris.target)
numpy.ndarray
>>> iris.target.shape
(150,)
>>> iris.target[:10]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> np.unique(iris.target)
array([0, 1, 2])

The target names are encoded. We can look up the corresponding names in the target_names attribute:

>>> iris.target_names
>>> array(['setosa', 'versicolor', 'virginica'], dtype='|S10')

This is the basic anatomy of many datasets, such as example data, target values, and target names.

What are the features of a single entry in this dataset?:

>>> iris.data[0]
array([ 5.1,  3.5,  1.4,  0.2])

The four features are the measurements taken of real flowers: their sepal length and width, and petal length and width. Three different species have been examined: the Iris-Setosa, Iris-Versicolour, and Iris-Virginica.

Machine learning tries to answer the following question: can we predict the species of the flower, given only the measurements of its sepal and petal length?

In the next section, we will see how to answer this question with scikit-learn.

Besides the data about flowers, there are a few other datasets included in the scikit-learn distribution, as follows:

  • The Boston House Prices dataset (506 samples and 13 attributes)
  • The Optical Recognition of Handwritten Digits dataset (5620 samples and 64 attributes)
  • The Iris Plants Database (150 samples and 4 attributes)
  • The Linnerud dataset (30 samples and 3 attributes)

A few datasets are not included, but they can easily be fetched on demand (as these are usually a bit bigger). Among these datasets, you can find a real estate dataset and a news corpus:

>>> ds = datasets.fetch_california_housing()
downloading Cal. housing from http://lib.stat.cmu.edu/modules.php?op=...
>>> ds.data.shape
(20640, 8)
>>> ds = datasets.fetch_20newsgroups()
>>> len(ds.data)
11314
>>> ds.data[0][:50]
u"From: [email protected] (where's my thing)
Subjec"
>>> sum([len([w for w in sample.split()]) for sample in ds.data])
3252437

These datasets are a great way to get started with the scikit-learn library, and they will also help you to test your own algorithms. Finally, scikit-learn includes functions (prefixed with datasets.make_) to create artificial datasets as well.

If you work with your own datasets, you will have to bring them in a shape that scikit-learn expects, which can be a task of its own. Tools such as Pandas make this task much easier, and Pandas DataFrames can be exported to numpy.ndarray easily with the as_matrix() method on DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset