© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
M. Paluszek et al.Practical MATLAB Deep Learninghttps://doi.org/10.1007/978-1-4842-7912-0_8

8. Completing Sentences

Michael Paluszek1  , Stephanie Thomas2   and Eric Ham2  
(1)
Plainsboro, NJ, USA
(2)
Princeton, NJ, USA
 

8.1 Introduction

8.1.1 Sentence Completion

Completing sentences is a useful feature for text entry systems. Given a set of possible sentences, we want the system to predict a missing part of the sentence. We will use the Research Sentence Completion Challenge [46]. It is a database of 1040 sentences each of which has four imposter sentences and one correct sentence. Each imposter sentence differs from the correct sentence by one word in a fixed position. The deep learning system should identify the correct word in the sentence. Imposter words have similar occurrence statistics. The sentences were selected from Sherlock Holmes novels. The imposter words were generated using a language model trained using over 500 nineteenth-century novels. Thirty alternative words for the correct word were produced. Human judges picked the four best imposter words from the 30 alternatives. The database can be downloaded from Google Drive [27].

The first question in the database and the five answers, including the four imposters, are given as follows:

(b) and (d) don’t fit grammatically. (a) and (e) are incompatible with the beginning in which the speaker is recounting general information about the subject’s state. (c) makes the most sense. If after “are” we had “often seen,” then (a) and (e) would be possibilities, and (c) would no longer make sense. You would need additional information to determine if (a) or (e) were correct.

The first few recipes in this chapter are dedicated to preparing the data from our online source. The final recipe, 8.6, creates and trains a deep learning net to complete sentences.

8.1.2 Grammar

Grammar is important in interpreting sentences. The structure of a language, that is, its grammar, is very important. Since not all of our readers speak English as their primary language, we’ll give some examples in other languages.

In Russian, the word order is not fixed. You can always figure out the words and whether they are adjectives, verbs, nouns, and so forth from the declension and conjugation, but the word order is important as it determines the emphasis. For example, to say, “I am an engineer” in Russian:
We could reverse the order

which would mean the emphasis is on “engineer” not “I.” While it is easy to know that the sentence is stating that “I am an engineer,” we don’t necessarily know how the speaker feels about it. This may not be important in rote translation but certainly makes a difference in literature.

Japanese is known as a subject-object-verb language. In Japanese, the verb is at the end of the sentence. Japanese also makes use of particles to denote word function such as subject or object. For the sentence completion problem, the particle would denote the function of the word. The rest of the sentence would determine what the word may mean. Here are some particles:
“wa/ha” indicates the topic, which could be the object or subject.
“wo/o” indicates the object.

“ga” indicates the subject.

For example, in Japanese

or “watashi wa enjinia desu”

means “I am an engineer.”
is the topic marker pointing to “I.”
is the verb. We’d need other sentences to predict the

, “I,” or “engineer.”

Japanese also has the feature where everything, except the verb, can be omitted.

or “I ta da ki ma su.” This means “I will eat” whatever given. You need to know the context or have other sentences to understand what is meant by the sentence.

In addition, in Japanese, many different Kanji, or symbols, can mean approximately the same thing, but the emphasis will be different. Other Kanji have different meanings depending on the context. Japanese also does not have any spaces between words. You just have to know when a kana character, like
, is part of the preceding Kanji. For example, verb conjugation uses Hiragana characters to indicate past, present, etc. For example:
is “to eat.” There is a Kanji root and then two Hiragana to form the entire verb. By itself, it is a legitimate sentence, and you need the context to determine who is eating. The negative verb, “not eat,” is
with the Hiragana
replacing

to form the negative.

Every language needs to be approached a little differently when you are trying to do natural language processing.

8.1.3 Sentence Completion by Pattern Recognition

Our approach is sentence completion by pattern recognition. Given a database of your sentences, the pattern recognition algorithm should be able to recognize the patterns you use and find errors. Also, in most languages, dialogs between people use far fewer words and simpler structures than the written languages. You will notice this if you watch a movie in a foreign language for which you have a passable knowledge. You can recognize a lot more than you would expect. Russian is an extreme in this regard; it is very hard to build vocabulary from reading because the language is so complex. Many Russian teachers teach the root system so that you can guess word meaning without constantly referring to a dictionary. Using word roots and sentence structure to guess words is a form of sentence completion. We’ll leave that to our Russian readers.

8.1.4 Sentence Generation

As an aside, sentence completion leads to generative deep learning [13]. In generative deep learning, the neural network learns patterns and then can create new material. For example, a deep learning network might learn how a newspaper article is written and be able to generate new articles given basic facts the article is supposed to present. This is not a whole lot different than when writers are paid to write new books in a series such as Tom Swift or Nancy Drew. Presumably, the writer adds his or her personality to the story, but perhaps a reader, who just wants a page-turner, wouldn’t really care.

8.2 Generating a Database

8.2.1 Problem

We want to create a set of sentences accessible from MATLAB.

8.2.2 Solution

Read in the sentences from the database. Write a function, ReadDatabase.m, to read in tab-separated text.

8.2.3 How It Works

The database that we downloaded from Google Drive was an Excel csv file. We need to first open the file and save it as tab-delimited text. Once this is done, you are ready to read it into MATLAB. We do this for both test_answer.csv and testing_data.csv. We manually removed the first column in test_answer.csv in Excel because it was not needed. Only the txt files that we generated are needed in this book.

If you have the Statistics and Machine Learning Toolbox, you could use tdfread - s = tdfread(file,delimiter). We’ll write the equivalent. There are three outputs shown in the header. They are the sentences, the range of characters where the word needed for completion fits, the five possible words, and the answer.

We open the file using f = fopen(’testing_data.txt’,’r’);. This tells it that the file is a text file. We search for tabs and add the end of the line so that we can find the last word. The second read reads in the test answers and converts them from a character to a number. We removed all extraneous quotes from the text file with a text editor.
If we run the function, we get the following outputs:

All outputs (except for the answer number) are strings. convertCharsToStrings does the conversion. Now that we have all of the data in MATLAB, we are ready to train the deep learning system to determine the best word for each sentence. As an intermediate step, we will convert the words to numbers.

8.3 Creating a Numeric Dictionary

8.3.1 Problem

We want to create a numeric dictionary to speed neural net training. This eliminates the need for string matching during the training process. Expressing a sentence as a numeric sequence as opposed to a sequence of character arrays (words) essentially gives us a more efficient way to represent the sentence. This will become useful later when we perform machine learning over a database of sentences to learn valid and invalid sequences.

8.3.2 Solution

Write a MATLAB function to search through text and find unique words.

8.3.3 How It Works

The function removes punctuation using erase in the following lines of code.
It then uses split to break up the string and finds unique strings using unique.
This is the built-in demo. It finds 38 unique words.

d is a string array and maps onto array n.

8.4 Mapping Sentences to Numbers

8.4.1 Problem

We want to map words in sentences to unique numbers.

8.4.2 Solution

Write a MATLAB function to search through text and assign a unique number to each word. This approach will have problems with homonyms.

8.4.3 How It Works

The function splits the string and searches using d. The last line removes any words (in this case, only punctuation) that are not in the dictionary.
This is the built-in demo.

8.5 Converting the Sentences

8.5.1 Problem

We want to convert the sentences to numeric sequences.

8.5.2 Solution

Write a MATLAB script to take each sentence in our database, add the words, and create a sequence. Each sentence is classified as correct or incorrect.

8.5.3 How It Works

The script reads in the database. It creates a numeric dictionary for all of the sentences and then converts them to numbers. The numeric data is then saved to a mat-file for easy access later. This first part of the script creates 5200 sentences. Each sentence is classified as correct or incorrect. Note how we initialize a string array.
The next section concatenates all of the sentences into a gigantic string and creates a dictionary.
The final part creates the numeric sentences and saves them. This part uses MapTo Numbers from the previous recipe. The loop that prints the lines shows a handy way of printing an array using fprintf.
As expected, only one word is different in each set of five sentences.

8.6 Training and Testing

8.6.1 Problem

We want to build a deep learning system to complete sentences. The idea is that the full database of correct and incorrect sentences provides enough information for the neural net to determine which word is the correct word for the sentence.

8.6.2 Solution

Write a MATLAB script to implement an LSTM to classify the sentences as correct or incorrect. The LSTM will be trained with complete sentences. No information about the words, such as whether a word is noun, verb, or adjective, nor of any grammatical structure will be used.

8.6.3 How It Works

We will produce the simplest possible design. It will read in sentences, classified as correct or incorrect, and attempt to determine if new sentences are correct or incorrect just from the learned patterns. This is a very simple and crude approach. We aren’t taking advantage of our knowledge of grammar, word types (verb, noun, etc.), or context to help with the predictions. Language modeling is a huge field, and we are not using any results from that body of work. Of course, applications of all of the rules of grammar don’t necessarily ensure success; otherwise, there would be more 800s on the SAT verbal tests. We’ll show two different sets of layers. The first will produce a good fit, and the second will overfit.

We use the same code as the previous recipe to make sure the sequences are valid. We use a clear all so that the sentences are always the same.
The layers were designed to get a well-fitted training. As the training plot shows, the validation error is the same as the training error. Because we have access to full sequences at prediction time, we use a bidirectional LSTM layer in the network. A bidirectional LSTM layer learns from the full sequence at each step. We use three BiLSTM layers with fully connected layers and dropouts in between.
The second test tests only correct sentences. It never identifies any sentences as correct. The output of this section is
The first layer says the input is a one-dimensional sequence. The second is the bidirectional LSTM. The next layer is a fully connected layer of neurons. This is repeated with a dropout layer in between. Figure 8.1 shows the training progress. This is followed by a softmax layer and then by the classification layer. The standard softmax is
$$displaystyle egin{aligned} sigma_k = frac{e^{z_i}}{sum_{j=1}^Ke^{z_j}} end{aligned} $$
(8.1)
which is essentially a normalized output.
Figure 8.1

Training progress.

The testing code is

The results are not as good as just guessing all the sentences are wrong. When we test the neural net, it never thinks any sentences are correct. This really doesn’t solve our problem. Given that only 20% sentences are correct, the neural net scores 80% by saying they are all incorrect.

Our second neural net has a different structure. We have two BiLSTM layers without the fully connected layer in between. Both biLSTM have the same number of hidden units. This was found after trying many different combinations.

The training code follows. We convert the classes, 0 and 1, to a categorical variable.
The output is
The testing code is shown as follows:
Figure 8.2 shows training. As the training progresses, the validation loss continues. Because it is higher than the training loss, we see that we are overfitting, that is, we have too many neurons. Our training accuracy is over 73%. The training loss continues to improve, and the validation accuracy gets worse. The accuracy is worse than the case with the good fitting. However, this network will classify sentences as correct, instead of just declaring them all wrong and getting the default 80% accuracy. In that way, it is an improvement though you still wouldn’t want to use it as an SAT aid.
Figure 8.2

Training progress.

This network works better than the previous one, despite overfitting. It thinks some sentences are correct and gets that right 12.5% of the time. This may indicate that we do not have enough data for the neural network to form a grammar. The poor performance also helps to show that NLP is a tough problem with many facets of research. You would expect to do better if you tried to take advantage of the grammatical structure of sentences, classifying words into different types, etc.

We’ve shown that we can get a neural network to be somewhat successful in sentence completion. The problem the training faces is that it can be 80% successful by saying there are no correct fits. This simple approach might work better if we had a larger database of sentences. We didn’t use all of the sentences that are available with the book so you can try expanding the set. With enough examples, the network might begin to learn the grammar. It would also be interesting to try this with different languages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset