8.1 Introduction
8.1.1 Sentence Completion
Completing sentences is a useful feature for text entry systems. Given a set of possible sentences, we want the system to predict a missing part of the sentence. We will use the Research Sentence Completion Challenge [46]. It is a database of 1040 sentences each of which has four imposter sentences and one correct sentence. Each imposter sentence differs from the correct sentence by one word in a fixed position. The deep learning system should identify the correct word in the sentence. Imposter words have similar occurrence statistics. The sentences were selected from Sherlock Holmes novels. The imposter words were generated using a language model trained using over 500 nineteenth-century novels. Thirty alternative words for the correct word were produced. Human judges picked the four best imposter words from the 30 alternatives. The database can be downloaded from Google Drive [27].
(b) and (d) don’t fit grammatically. (a) and (e) are incompatible with the beginning in which the speaker is recounting general information about the subject’s state. (c) makes the most sense. If after “are” we had “often seen,” then (a) and (e) would be possibilities, and (c) would no longer make sense. You would need additional information to determine if (a) or (e) were correct.
The first few recipes in this chapter are dedicated to preparing the data from our online source. The final recipe, 8.6, creates and trains a deep learning net to complete sentences.
8.1.2 Grammar
Grammar is important in interpreting sentences. The structure of a language, that is, its grammar, is very important. Since not all of our readers speak English as their primary language, we’ll give some examples in other languages.
which would mean the emphasis is on “engineer” not “I.” While it is easy to know that the sentence is stating that “I am an engineer,” we don’t necessarily know how the speaker feels about it. This may not be important in rote translation but certainly makes a difference in literature.
“ga” indicates the subject.
or “watashi wa enjinia desu”
, “I,” or “engineer.”
or “I ta da ki ma su.” This means “I will eat” whatever given. You need to know the context or have other sentences to understand what is meant by the sentence.
to form the negative.
Every language needs to be approached a little differently when you are trying to do natural language processing.
8.1.3 Sentence Completion by Pattern Recognition
Our approach is sentence completion by pattern recognition. Given a database of your sentences, the pattern recognition algorithm should be able to recognize the patterns you use and find errors. Also, in most languages, dialogs between people use far fewer words and simpler structures than the written languages. You will notice this if you watch a movie in a foreign language for which you have a passable knowledge. You can recognize a lot more than you would expect. Russian is an extreme in this regard; it is very hard to build vocabulary from reading because the language is so complex. Many Russian teachers teach the root system so that you can guess word meaning without constantly referring to a dictionary. Using word roots and sentence structure to guess words is a form of sentence completion. We’ll leave that to our Russian readers.
8.1.4 Sentence Generation
As an aside, sentence completion leads to generative deep learning [13]. In generative deep learning, the neural network learns patterns and then can create new material. For example, a deep learning network might learn how a newspaper article is written and be able to generate new articles given basic facts the article is supposed to present. This is not a whole lot different than when writers are paid to write new books in a series such as Tom Swift or Nancy Drew. Presumably, the writer adds his or her personality to the story, but perhaps a reader, who just wants a page-turner, wouldn’t really care.
8.2 Generating a Database
8.2.1 Problem
We want to create a set of sentences accessible from MATLAB.
8.2.2 Solution
Read in the sentences from the database. Write a function, ReadDatabase.m, to read in tab-separated text.
8.2.3 How It Works
The database that we downloaded from Google Drive was an Excel csv file. We need to first open the file and save it as tab-delimited text. Once this is done, you are ready to read it into MATLAB. We do this for both test_answer.csv and testing_data.csv. We manually removed the first column in test_answer.csv in Excel because it was not needed. Only the txt files that we generated are needed in this book.
If you have the Statistics and Machine Learning Toolbox, you could use tdfread - s = tdfread(file,delimiter). We’ll write the equivalent. There are three outputs shown in the header. They are the sentences, the range of characters where the word needed for completion fits, the five possible words, and the answer.
All outputs (except for the answer number) are strings. convertCharsToStrings does the conversion. Now that we have all of the data in MATLAB, we are ready to train the deep learning system to determine the best word for each sentence. As an intermediate step, we will convert the words to numbers.
8.3 Creating a Numeric Dictionary
8.3.1 Problem
We want to create a numeric dictionary to speed neural net training. This eliminates the need for string matching during the training process. Expressing a sentence as a numeric sequence as opposed to a sequence of character arrays (words) essentially gives us a more efficient way to represent the sentence. This will become useful later when we perform machine learning over a database of sentences to learn valid and invalid sequences.
8.3.2 Solution
Write a MATLAB function to search through text and find unique words.
8.3.3 How It Works
d is a string array and maps onto array n.
8.4 Mapping Sentences to Numbers
8.4.1 Problem
We want to map words in sentences to unique numbers.
8.4.2 Solution
Write a MATLAB function to search through text and assign a unique number to each word. This approach will have problems with homonyms.
8.4.3 How It Works
8.5 Converting the Sentences
8.5.1 Problem
We want to convert the sentences to numeric sequences.
8.5.2 Solution
Write a MATLAB script to take each sentence in our database, add the words, and create a sequence. Each sentence is classified as correct or incorrect.
8.5.3 How It Works
8.6 Training and Testing
8.6.1 Problem
We want to build a deep learning system to complete sentences. The idea is that the full database of correct and incorrect sentences provides enough information for the neural net to determine which word is the correct word for the sentence.
8.6.2 Solution
Write a MATLAB script to implement an LSTM to classify the sentences as correct or incorrect. The LSTM will be trained with complete sentences. No information about the words, such as whether a word is noun, verb, or adjective, nor of any grammatical structure will be used.
8.6.3 How It Works
We will produce the simplest possible design. It will read in sentences, classified as correct or incorrect, and attempt to determine if new sentences are correct or incorrect just from the learned patterns. This is a very simple and crude approach. We aren’t taking advantage of our knowledge of grammar, word types (verb, noun, etc.), or context to help with the predictions. Language modeling is a huge field, and we are not using any results from that body of work. Of course, applications of all of the rules of grammar don’t necessarily ensure success; otherwise, there would be more 800s on the SAT verbal tests. We’ll show two different sets of layers. The first will produce a good fit, and the second will overfit.
The results are not as good as just guessing all the sentences are wrong. When we test the neural net, it never thinks any sentences are correct. This really doesn’t solve our problem. Given that only 20% sentences are correct, the neural net scores 80% by saying they are all incorrect.
Our second neural net has a different structure. We have two BiLSTM layers without the fully connected layer in between. Both biLSTM have the same number of hidden units. This was found after trying many different combinations.
This network works better than the previous one, despite overfitting. It thinks some sentences are correct and gets that right 12.5% of the time. This may indicate that we do not have enough data for the neural network to form a grammar. The poor performance also helps to show that NLP is a tough problem with many facets of research. You would expect to do better if you tried to take advantage of the grammatical structure of sentences, classifying words into different types, etc.
We’ve shown that we can get a neural network to be somewhat successful in sentence completion. The problem the training faces is that it can be 80% successful by saying there are no correct fits. This simple approach might work better if we had a larger database of sentences. We didn’t use all of the sentences that are available with the book so you can try expanding the set. With enough examples, the network might begin to learn the grammar. It would also be interesting to try this with different languages.