Building a data matrix using pandas

Now, it's time to introduce another essential data science tool that comes preinstalled with Python Anaconda: pandas. pandas is built on NumPy and provides several useful tools and methods to deal with data structures in Python. Just as we generally import NumPy under the alias, np, it is common to import pandas under the pd alias:

In [6]: import pandas as pd

pandas provide a useful data structure called a DataFrame, which can be understood as a generalization of a 2D NumPy array, as shown here:

In [7]: pd.DataFrame({
...         'model': [
...             'Normal Bayes',
...             'Multinomial Bayes',
...             'Bernoulli Bayes'
...         ],
...         'class': [
...             'cv2.ml.NormalBayesClassifier_create()',
...             'sklearn.naive_bayes.MultinomialNB()',
...             'sklearn.naive_bayes.BernoulliNB()'
...         ]
...     })

The output of the cell will look like this:

We can combine the preceding functions to build a pandas DataFrame from the extracted data:

In [8]: def build_data_frame(extractdir, classification):
...         rows = []
...         index = []
...         for file_name, text in read_files(extractdir):
...             rows.append({'text': text, 'class': classification})
...             index.append(file_name)
...
...         data_frame = pd.DataFrame(rows, index=index)
...         return data_frame

We then call it with the following command:

In [9]: data = pd.DataFrame({'text': [], 'class': []})
...     for source, classification in sources:
...         extractdir = '%s/%s' % (datadir, source[:-7])
...         data = data.append(build_data_frame(extractdir,
...                                             classification))

Table of Contents for Building a data matrix using pandas

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a data matrix using pandas