Preparing the data

Preparing the text is a task in its own right. This is because in the real world, text is often messy and cannot be fixed with a few simple scaling operations. For instance, people can often make typos after adding unnecessary characters as they are adding text encodings that we cannot read. NLP involves its own set of data cleaning challenges and techniques.

Sanitizing characters

To store text, computers need to encode the characters into bits. There are several different ways to do this, and not all of them can deal with all the characters out there.

It is good practice to keep all the text files in one encoding scheme, usually UTF-8, but of course, that does not always happen. Files might also be corrupted, meaning that a few bits are off, therefore rendering some characters unreadable. Therefore, before we do anything else, we need to sanitize our inputs.

Python offers a helpful codecs library, which allows us to deal with different encodings. Our data is UTF-8 encoded, but there are a few special characters in there that cannot be read easily. Therefore, we have to sanitize our text of these special characters, which we can do by running the following:

import codecs
input_file = codecs.open('../input/socialmedia-disaster-tweets-DFE.csv','r',',encoding='utf-8', errors='replace')

In the preceding code, codecs.open acts as a stand-in replacement for Python's standard file opening function. It returns a file object, which we can later read line by line. We specify the input path that we want to read the file (with r), the expected encoding, and what to do with errors. In this case, we are going to replace the errors with a special unreadable character marker.

To write to the output file, we can just use Python's standard open() function. This function will create a file at the specified file path we can write to:

output_file = open('clean_socialmedia-disaster.csv', 'w')

Now that's done, all we have to do is loop over the lines in our input file that we read with our codecs reader and save it as a regular CSV file again. We can achieve this by running the following:

for line in input_file:
    out = line
    output_file.write(line)

Likewise, it's good practice to close the file objects afterward, which we can do by running:

input_file.close()
output_file.close()

Now we can read the sanitized CSV file with pandas:

df = pd.read_csv('clean_socialmedia-disaster.csv')

Lemmatization

Lemmas have already made several appearances throughout this chapter. A lemma in the field of linguistics, also called a headword, is the word under which the set of related words or forms appears in a dictionary. For example, "was" and "is" appear under "be," "mice" appears under "mouse," and so on. Quite often, the specific form of a word does not matter very much, so it can be a good idea to convert all your text into its lemma form.

spaCy offers a handy way to lemmatize text, so once again, we're going to load a spaCy pipeline. Only that in this case, we don't need any pipeline module aside from the tokenizer. The tokenizer splits the text into separate words, usually by spaces. These individual words, or tokens, can then be used to look up their lemma. In our case, it looks like this:

import spacy
nlp = spacy.load('en',disable=['tagger','parser','ner'])

Lemmatization can be slow, especially for big files, so it makes sense to track our progress. tqdm allows us to show progress bars on the pandas apply function. All we have to do is import tqdm as well as the notebook component for pretty rendering in our work environment. We then have to tell tqdm that we would like to use it with pandas. We can do this by running the following:

from tqdm import tqdm, tqdm_notebook
tqdm.pandas(tqdm_notebook)

We can now run progress_apply on a DataFrame just as we would use the standard apply method, but here it has a progress bar.

For each row, we loop over the words in the text column and save the lemma of the word in a new lemmas column:

df['lemmas'] = df["text"].progress_apply(lambda row: [w.lemma_ for w in nlp(row)])

Our lemmas column is now full of lists, so to turn the lists back into text, we will join all of the elements of the lists with a space as a separator, as we can see in the following code:

df['joint_lemmas'] = df['lemmas'].progress_apply(lambda row: ' '.join(row))

Preparing the target

There are several possible prediction targets in this dataset. In our case, humans were asked to rate a tweet, and, they were given three options, Relevant, Not Relevant, and Can't Decide, as the lemmatized text shows:

df.choose_one.unique()
array(['Relevant', 'Not Relevant', "Can't Decide"], dtype=object)

The tweets where humans cannot decide whether it is about a real disaster are not interesting to us. Therefore, we will just remove the category, Can't Decide, which we can do in the following code:

df = df[df.choose_one != "Can't Decide"]

We are also only interested in mapping text to relevance, therefore we can drop all the other metadata and just keep these two columns, which we do here:

df = df[['text','choose_one']]

Finally, we're going to convert the target into numbers. This is a binary classification task, as there are only two categories. So, we map Relevant to 1 and Not Relevant to 0:

f['relevant'] = df.choose_one.map({'Relevant':1,'Not Relevant':0})

Preparing the training and test sets

Before we start building models, we're going to split our data into two sets, the training dataset and the test dataset. To do this we simply need to run the following code:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['joint_lemmas'], 
                                                    df['relevant'], 
                                                    test_size=0.2,
                                                    random_state=42)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset