Now, it's time to introduce another essential data science tool that comes preinstalled with Python Anaconda: pandas. pandas is built on NumPy and provides several useful tools and methods to deal with data structures in Python. Just as we generally import NumPy under the alias, np, it is common to import pandas under the pd alias:
In [6]: import pandas as pd
pandas provide a useful data structure called a DataFrame, which can be understood as a generalization of a 2D NumPy array, as shown here:
In [7]: pd.DataFrame({
... 'model': [
... 'Normal Bayes',
... 'Multinomial Bayes',
... 'Bernoulli Bayes'
... ],
... 'class': [
... 'cv2.ml.NormalBayesClassifier_create()',
... 'sklearn.naive_bayes.MultinomialNB()',
... 'sklearn.naive_bayes.BernoulliNB()'
... ]
... })
The output of the cell will look like this:
We can combine the preceding functions to build a pandas DataFrame from the extracted data:
In [8]: def build_data_frame(extractdir, classification):
... rows = []
... index = []
... for file_name, text in read_files(extractdir):
... rows.append({'text': text, 'class': classification})
... index.append(file_name)
...
... data_frame = pd.DataFrame(rows, index=index)
... return data_frame
We then call it with the following command:
In [9]: data = pd.DataFrame({'text': [], 'class': []})
... for source, classification in sources:
... extractdir = '%s/%s' % (datadir, source[:-7])
... data = data.append(build_data_frame(extractdir,
... classification))