Chapter 15

Dealing with Common Datasets

IN THIS CHAPTER

Check Considering the use of standard datasets

Check Accessing a standard dataset

Check Performing dataset tasks

The reason to have computers in the first place is to manage data. You can easily lose sight of the overriding goal of computers when faced with all the applications that don’t seem to manage anything. However, even these applications manage data. For example, a graphics application, even if it simply displays pictures from last year’s camping trip, is still managing data. When looking at a Facebook page, you see data in myriad forms transferred over an Internet connection. In fact, it would be hard to find a consumer application that doesn’t manage data, and impossible to find a business application that doesn’t manage data in some way. Consequently, data is king on the computer.

Remember The datasets in this chapter are composed of a specific kind of data. For you to be able to perform comparisons, conduct testing, and verify results of a group of applications, each application must have access to the same standard data. Of course, more than just managing data comes into play when you're considering a standard dataset. Other considerations involve convenience and repeatable results. This chapter helps you take these various considerations into account.

Because the sorts of management an application performs differs by the purpose of the application, the number of commonly available standard datasets is quite large. Consequently, finding the right dataset for your needs can be time consuming. Along with defining the need for standardized datasets, this chapter also looks at methods that you can use to locate the right standard dataset for your application.

After you have a dataset loaded, you need to perform various tasks with it. An application can perform a simple analysis, display data content, or perform Create, Read, Update, and Delete (CRUD) tasks as described in the “Considering CRUD” section of Chapter 13. The point is that functional applications, like any other application, require access to a standardized data source to look for better ways of accomplishing tasks.

Understanding the Need for Standard Datasets

A standard dataset is one that provides a specific number of records using a specific format. It normally appears in the public domain and is used by professionals around the world for various sorts of tests. Professionals categorize these datasets in various ways:

  • Kinds of fields (features or attributes)
  • Number of fields
  • Number of records (cases)
  • Complexity of data
  • Task categories (such as classification)
  • Missing values
  • Data orientation (such as biology)
  • Popularity

Depending on where you search, you can find all sorts of other information, such as who donated the data and when. In some cases, old data may not reflect current social trends, making any testing you perform suspect. Some languages actually build the datasets into their downloadable source so that you don’t even have to do anything more than load them.

Warning Given the mandates of the General Data Protection Regulation (GDPR), you also need to exercise care in choosing any dataset that could potentially contain any individually identifiable information. Some people didn’t prepare datasets correctly in the past, and these datasets don’t quite meet the requirements. Fortunately, you have access to resources that can help you determine whether a dataset is acceptable, such as the one found on IBM at https://www.ibm.com/security/data-security/gdpr. None of the datasets used in this book are problematic.

Of course, knowing what a standard dataset is and why you would use it are two different questions. Many developers want to test using their own custom data, which is prudent, but using a standard dataset does provide specific benefits, as listed here:

  • Using common data for performance testing
  • Reducing the risk of hidden data errors causing application crashes
  • Comparing results with other developers
  • Creating a baseline test for custom data testing later
  • Verifying the adequacy of error-trapping code used for issues such as missing data
  • Ensuring that graphs and plots appear as they should
  • Saving time creating a test dataset
  • Devising mock-ups for demo purposes that don’t compromise sensitive custom data

Remember A standardized common dataset is just a starting point, however. At some point, you need to verify that your own custom data works, but after verifying that the standard dataset works, you can do so with more confidence in the reliability of your application code. Perhaps the best reason to use one of these datasets is to reduce the time needed to locate and fix errors of various sorts — errors that might otherwise prove time consuming because you couldn’t be sure of the data that you’re using.

Finding the Right Dataset

Locating the right dataset for testing purposes is essential. Fortunately, you don’t have to look very hard because some online sites provide you with everything needed to make a good decision. The following sections offer insights into locating the right dataset for your needs.

Locating general dataset information

Datasets appear in a number of places online, and you can use many of them for general needs. An example of these sorts of datasets appears on the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets.html, shown in Figure 15-1. As the table shows, the site categorizes the individual datasets so that you can find the dataset you need. More important, the table helps you understand the kinds of tasks that people normally employ the dataset to perform.

“Screen capture of UCI Machine Learning Repository: Data Sets in the Firefox browser window including Abalone, Adult, Annealing.”

FIGURE 15-1: Standardized, common, datasets are categorized in specific ways.

If you want to know more about a particular dataset, you click its link and go to a page like the one shown in Figure 15-2. You can determine whether a dataset will help you test certain application features, such as searching for and repairing missing values. The Number of Web Hits field tells you how popular the dataset is, which can affect your ability to find others who have used the dataset for testing purposes. All this information is helpful in ensuring that you get the right dataset for a particular need; the goals include error detection, performance testing, and comparison with other applications of the same type.

“Screen capture of UCI Machine Learning Repository: Iris Data Set in the Firefox browser window.”

FIGURE 15-2: Dataset details are important because they help you find the right dataset.

Tip Even if your language provides easy access to these datasets, getting onto a site such as UCI Machine Learning Repository can help you understand which of these datasets will work best. In many cases, a language will provide access to the dataset and a brief description of dataset content — not a complete description of the sort you find on this site.

Using library-specific datasets

Depending on your programming language, you likely need to use a library to work with datasets in any meaningful way. One such library for Python is Scikit-learn (http://scikit-learn.org/stable/). This is one of the more popular libraries because it contains such an extensive set of features and also provides the means for loading both internal and external datasets as described at http://scikit-learn.org/stable/datasets/index.html. You can obtain various kinds of datasets using Scikit-learn as follows:

  • Toy datasets: Provides smaller datasets that you can use to test theories and basic coding.
  • Image datasets: Includes datasets containing basic picture information that you can use for various kinds of graphic analysis.
  • Generators: Defines randomly generated data based on the specifications you provide and the generator used. You can find generators for
    • Classification and clustering
    • Regression
    • Manifold learning
    • Decomposition
  • Support Vector Machine (SVM) datasets: Provides access to both the svmlight (http://svmlight.joachims.org/) and libsvm (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) implementations, which include datasets that enable you to perform sparse dataset tasks.
  • External load: Obtains datasets from external sources. Python provides access to a huge number of datasets, each of which is useful for a particular kind of analysis or comparison. When accessing an external dataset, you may have to rely on additional libraries:
    • pandas.io: Provides access to common data formats that include CSV, Excel, JSON, and SQL.
    • scipy.io: Obtains information from binary formats popular with the scientific community, including .mat and .arff files.
    • numpy/routines.io: Loads columnar data into NumPy (http://www.numpy.org/) arrays.
    • skimage.io: Loads images and videos into NumPy arrays.
    • scipy.io.wavfile.read: Reads .wav file data into NumPy arrays.
  • Other: Includes standard datasets that provide enough information for specific kinds of testing in a real-world manner. These datasets include (but are not limited to) Olivetti Faces and 20 Newsgroups Text.

Loading a Dataset

The fact that Python provides access to such a large variety of datasets might make you think that a common mechanism exists for loading them. Actually, you need a variety of techniques to load even common datasets. As the datasets become more esoteric, you need additional libraries and other techniques to get the job done. The following sections don’t give you an exhaustive view of dataset loading in Python, but you do get a good overview of the process for commonly used datasets so that you can use these datasets within the functional programming environment. (See the “Finding Haskell support” sidebar in this chapter for reasons that Haskell isn’t included in the sections that follow.)

Working with toy datasets

As previously mentioned, a toy dataset is one that contains a small amount of common data that you can use to test basic assumptions, functions, algorithms, and simple code. The toy datasets reside directly in Scikit-learn, so you don’t have to do anything special except call a function to use them. The following list provides a quick overview of the function used to import each of the toy datasets into your Python code:

  • load_boston(): Regression analysis with the Boston house-prices dataset
  • load_iris(): Classification with the iris dataset
  • load_diabetes(): Regression with the diabetes dataset
  • load_digits([n_class]): Classification with the digits dataset
  • load_linnerud(): Multivariate regression using the linnerud dataset (health data described at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/descr/linnerud.rst)
  • load_wine(): Classification with the wine dataset
  • load_breast_cancer(): Classification with the Wisconsin breast cancer dataset

Remember Note that each of these functions begins with the word load. When you see this formulation in Python, the chances are good that the associated dataset is one of the Scikit-learn toy datasets.

The technique for loading each of these datasets is the same across examples. The following example shows how to load the Boston house-prices dataset:

from sklearn.datasets import load_boston
Boston = load_boston()
print(Boston.data.shape)

To see how the code works, click Run Cell. The output from the print() call is (506, 13). You can see the output shown in Figure 15-3.

Screen capture of FPD_15_Datasets window in the Firefox browser with the title Loading a Dataset, Working with toy datasets, and code with output 506, 13.

FIGURE 15-3: The Boston object contains the loaded dataset.

Creating custom data

The purpose of each of the data generator functions is to create randomly generated datasets that have specific attributes. For example, you can control the number of data points using the n_samples argument and use the centers argument to control how many groups the function creates within the dataset. Each of the calls starts with the word make. The kind of data depends on the function; for example, make_blobs() creates Gaussian blobs for clustering (see http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html for details). The various functions reflect the kind of labeling provided: single label and multilabel. You can also choose bi-clustering, which allows clustering of both matrix rows and columns. Here's an example of creating custom data:

from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=120, n_features=2, centers=4)
print(X.shape)

The output will tell you that you have indeed created an X object containing a dataset with two features and 120 cases for each feature. The Y object contains the color values for the cases. Seeing the data plotted using the following code is more interesting:

import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(X[:, 0], X[:, 1], s=25, c=Y)
plt.show()

The %matplotlib magic function appears in Table 11-1. In this case, you tell Notebook to present the plot inline. The output is a scatter chart using the x-axis and y-axis contained in X. The c=Y argument tells scatter() to create the chart using the color values found in Y. Figure 15-4 shows the output of this example. Notice that you can clearly see the four clusters based on their color (even though the colors don't appear in the book).

Screen capture of FPD_15_Datasets window in the Firefox browser with the title Creating custom data and code for In [2]: and In [3]: with output 120, 2 and a scatterplot.

FIGURE 15-4: Custom datasets provide randomized data output in the form you specify.

Fetching common datasets

At some point, you need larger datasets of common data to use for testing. The toy datasets that worked fine when you were testing your functions may not do the job any longer. Python provides access to larger datasets that help you perform more complex testing but won’t require you to rely on network sources. These datasets will still load on your system so that you’re not waiting on network latency during testing. Consequently, they’re between the toy datasets and a real-world dataset in size. More important, because they rely on actual (standardized) data, they reflect real-world complexity. The following list tells you about the common datasets:

  • fetch_olivetti_faces(): Olivetti faces dataset from AT&T containing ten images each of 40 different test subjects; each grayscale image is 64 x 64 pixels in size
  • fetch_20newsgroups(subset='train'): Data from 18,000 newsgroup posts based on 20 topics, with the dataset split into two subgroups: one for training and one for testing
  • fetch_mldata('MNIST original', data_home=custom_data_home): Dataset containing machine learning data in the form of 70,000, 28-x-28-pixel handwritten digits from 0 through 9
  • fetch_lfw_people(min_faces_per_person=70, resize=0.4): Labeled Faces in the Wild dataset described at http://vis-www.cs.umass.edu/lfw/, which contains pictures of famous people in JPEG format
  • sklearn.datasets.fetch_covtype(): U.S. forestry dataset containing the predominant tree type in each of the patches of forest in the dataset
  • sklearn.datasets.fetch_rcv1(): Reuters Corpus Volume I (RCV1) is a dataset containing 800,000 manually categorized stories from Reuters, Ltd.

Notice that each of these functions begins with the word fetch. Some of these datasets require a long time to load. For example, the Labeled Faces in the Wild (LFW) dataset is 200MB in size, which means that you wait several minutes just to load it. However, at 200MB, the dataset also begins (in small measure) to start reflecting the size of real-world datasets. The following code shows how to fetch the Olivetti faces dataset:

from sklearn.datasets import fetch_olivetti_faces
data = fetch_olivetti_faces()
print(data.images.shape)

When you run this code, you see that the shape is 400 images, each of which is 64 x 64 pixels. The resulting data object contains a number of properties, including images. To access a particular image, you use data.images[?], where ? is the number of the image you want to access in the range from 0 to 399. Here is an example of how you can display an individual image from the dataset.

import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(data.images[1], cmap="gray")
plt.show()

Tip The cmap argument tells how to display the image, which is in grayscale in this case. The tutorial at https://matplotlib.org/tutorials/introductory/images.html provides additional information on using cmap, as well as on adjusting the image in various ways. Figure 15-5 shows the output from this example.

Screen capture of FPD_15_Datasets window in the Firefox browser with the title Fetching common datasets and code for In [4]: and In [5]: with output 400, 64, 6 and a 64-x-64-pixel matrix image.

FIGURE 15-5: The image appears as a 64-x-64-pixel matrix.

Manipulating Dataset Entries

You're unlikely to find a common dataset used with Python that doesn't provide relatively good documentation. You need to find the documentation online if you want the full story about how the dataset is put together, what purpose it serves, and who originated it, as well as any needed statistics. Fortunately, you can employ a few tricks to interact with a dataset without resorting to major online research. The following sections offer some tips for working with the dataset entries found in this chapter.

Determining the dataset content

The previous sections of this chapter show how to load or fetch existing datasets from specific sources. These datasets generally have specific characteristics that you can discover online at places like http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html for the Boston house-prices dataset. However, you can also use the dir() function to learn about dataset content. When you use dir(Boston) with the previously created Boston house-prices dataset, you discover that it contains DESCR, data, feature_names, and target properties. Here is a short description of each property:

  • DESCR: Text that describes the dataset content and some of the information you need to use it effectively
  • data: The content of the dataset in the form of values used for analysis purposes
  • feature_names: The names of the various attributes in the order in which they appear in data
  • target: An array of values used with data to perform various kinds of analysis

The print(Boston.DESCR) function displays a wealth of information about the Boston house-prices dataset, including the names of attributes that you can use to interact with the data. Figure 15-6 shows the results of these queries.

Screen capture of FPD_15_Datasets window in the Firefox browser with the title Manipulating dataset entries and code for In [6]: and In [7]: with output Boston House Prices Dataset.

FIGURE 15-6: Most common datasets are configured to tell you about themselves.

Remember The information that the datasets contain can have significant commonality. For example, if you use dir(data) for the Olivetti faces dataset example described earlier, you find that it provides access to DESCR, data, images, and target properties. As with the Boston house-prices dataset, DESCR gives you a description of the Olivetti faces dataset, which you can use for things like accessing particular attributes. By knowing the names of common properties and understanding how to use them, you can discover all you need to know about a common dataset in most cases without resorting to any online resource. In this case, you'd use print(data.DESCR) to obtain a description of the Olivetti faces dataset. Also, some of the description data contains links to sites where you can learn more information.

Creating a DataFrame

The common datasets are in a form that allows various types of analysis, as shown by the examples provided on the sites that describe them. However, you might not want to work with the dataset in that manner; instead, you may want something that looks a bit more like a database table. Fortunately, you can use the pandas (https://pandas.pydata.org/) library to perform the conversion in a manner that makes using the datasets in other ways easy. Using the Boston house-prices dataset as an example, the following code performs the required conversion:

import pandas as pd
BostonTable = pd.DataFrame(Boston.data,
columns=Boston.feature_names)

If you want to include the target values with the DataFrame, you must also execute: BostonTable['target'] = Boston.target. However, this chapter doesn't use the target data.

Accessing specific records

If you were to do a dir() command against a DataFrame, you would find that it provides you with an overwhelming number of functions to try. The documentation at https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.html supplies a good overview of what's possible (which includes all the usual database-specific tasks specified by CRUD). The following example code shows how to perform a query against a pandas DataFrame. In this case, the code selects only those housing areas where the crime rate is below 0.02 per capita.

CRIMTable = BostonTable.query('CRIM < 0.02')
print(CRIMTable.count()['CRIM'])

The output shows that only 17 records match the criteria. The count() function enables the application to count the records in the resulting CRIMTable. The index, ['CRIM'], selects just one of the available attributes (because every column is likely to have the same values).

You can display all these records with all of the attributes, but you may want to see only the number of rooms and the average house age for the affected areas. The following code shows how to display just the attributes you actually need:

print(CRIMTable[['RM', 'AGE']])

Figure 15-7 shows the output from this code. As you can see, the houses vary between 5 and nearly 8 rooms in size. The age varies from almost 14 years to a little over 65 years.

Screen capture of FPD_15_Datasets window in the Firefox browser with the title Accessing specific records and code print(CRIMTable[['RM', 'AGE']]) with output listing of RM and AGE values.

FIGURE 15-7: Manipulating the data helps you find specific information.

You might find it a bit hard to work with the unsorted data in Figure 15-7. Fortunately, you do have access to the full range of common database features. If you want to sort the values by number of rooms, you use:

print(CRIMTable[['RM', 'AGE']].sort_values('RM'))

As an alternative, you can always choose to sort by average home age:

print(CRIMTable[['RM', 'AGE']].sort_values('AGE'))

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset