Loading the dataset

You can refer to these steps to load the dataset:

If you downloaded the latest code from GitHub, you will find several .zip files in the notebooks/data/chapter7 directory. These files contain raw email data (with fields for To:, Cc:, and text body) that are either classified as spam (with the SPAM = 1 class label) or not (also known as ham, the HAM = 0 class label).
We build a variable called sources, which contains all of the raw data files:

In [1]: HAM = 0
...     SPAM = 1
...     datadir = 'data/chapter7'
...     sources = [
...        ('beck-s.tar.gz', HAM),
...        ('farmer-d.tar.gz', HAM),
...        ('kaminski-v.tar.gz', HAM),
...        ('kitchen-l.tar.gz', HAM),
...        ('lokay-m.tar.gz', HAM),
...        ('williams-w3.tar.gz', HAM),
...        ('BG.tar.gz', SPAM),
...        ('GP.tar.gz', SPAM),
...        ('SH.tar.gz', SPAM)
...     ]

The first step is to extract these files into subdirectories. For this, we can use the extract_tar function we wrote in the previous chapter:

In [2]: def extract_tar(datafile, extractdir):
...         try:
...             import tarfile
...         except ImportError:
...             raise ImportError("You do not have tarfile installed. "
...                               "Try unzipping the file outside of "
...                               "Python.")
...         tar = tarfile.open(datafile)
...         tar.extractall(path=extractdir)
...         tar.close()
...         print("%s successfully extracted to %s" % (datafile,
...                                                    extractdir))

To apply the function to all data files in the sources, we need to run a loop. The extract_tar function expects a path to the .tar.gz file—which we build from datadir and an entry in sources—and a directory to extract the files to (datadir). This will extract all emails in, for example, data/chapter7/beck-s.tar.gz to the data/chapter7/beck-s/ subdirectory:

In [3]: for source, _ in sources:
...         datafile = '%s/%s' % (datadir, source)
...         extract_tar(datafile, datadir)
Out[3]: data/chapter7/beck-s.tar.gz successfully extracted to data/chapter7
        data/chapter7/farmer-d.tar.gz successfully extracted to
            data/chapter7
        data/chapter7/kaminski-v.tar.gz successfully extracted to
            data/chapter7
        data/chapter7/kitchen-l.tar.gz successfully extracted to
            data/chapter7
        data/chapter7/lokay-m.tar.gz successfully extracted to
            data/chapter7
        data/chapter7/williams-w3.tar.gz successfully extracted to
            data/chapter7
        data/chapter7/BG.tar.gz successfully extracted to data/chapter7
        data/chapter7/GP.tar.gz successfully extracted to data/chapter7
        data/chapter7/SH.tar.gz successfully extracted to data/chapter7

Now here's the tricky bit. Every one of these subdirectories contains many other directories, wherein the text files reside. So, we need to write two functions:

read_single_file(filename): This is a function that extracts the relevant content from a single file called filename.
read_files(path): This is a function that extracts the relevant content from all files in a particular directory called path.

To extract the relevant content from a single file, we need to be aware of how each file is structured. The only thing we know is that the header section of the email (From:, To:, and Cc:) and the main body of text are separated by a newline character, ' '. So, what we can do is iterate over every line in the text file and keep only those lines that belong to the main text body, which will be stored in the variable lines. We also want to keep a Boolean flag, past_header, around, which is initially set to False but will be flipped to True once we are past the header section:

We start by initializing those two variables:

In [4]: import os
...     def read_single_file(filename):
...         past_header, lines = False, []

Then, we check whether a file with the name filename exists. If it does, we start looping over it line by line:

...         if os.path.isfile(filename):
...             f = open(filename, encoding="latin-1")
...             for line in f:

You may have noticed the encoding="latin-1" part. Since some of the emails are not in Unicode, this is an attempt to decode the files correctly.

We do not want to keep the header information, so we keep looping until we encounter the ' ' character, at which point we flip past_header from False to True.

At this point, the first condition of the following if-else clause is met, and we append all remaining lines in the text file to the lines variable:

...                 if past_header:
...                     lines.append(line)
...                 elif line == '
':
...                     past_header = True
...             f.close()

In the end, we concatenate all lines into a single string, separated by the newline character, and return both the full path to the file and the actual content of the file:

...         content = '
'.join(lines)
...         return filename, content

The job of the second function will be to loop over all files in a folder and call read_single_file on them:

In [5]: def read_files(path):
...         for root, dirnames, filenames in os.walk(path):
...             for filename in filenames:
...                 filepath = os.path.join(root, filename)
...                 yield read_single_file(filepath)

Here, yield is a keyword that is similar to return. The difference is that yield returns a generator instead of the actual values, which is desirable if you expect to have a large number of items to iterate over.

Table of Contents for Loading the dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading the dataset