Creating a pipeline to search text

Searching is a rich and complex topic. There are many different types of searches and approaches to perform a search. The intent here is to demonstrate how various NLP techniques can be applied to support this effort. A single text document can be processed at one time in a reasonable time period on most machines. However, when multiple large documents need to be searched, then creating an index is a common approach to support searches. This results in a search process that completes in a reasonable period of time. We will demonstrate one approach to create an index and then search using the index. Although the text we will use is not that large, it is sufficient to demonstrate the process.
We need to do the following:

Read the text from the file
Tokenize and find sentence boundaries
Remove stop words
Accumulate the index statistics
Write out the index file

There are several factors that influence the contents of an index file, including:

Removal of stop words
Case-sensitive searches
Finding synonyms
Using stemming and lemmatization
Allowing searches across sentence boundaries

We will use OpenNLP to demonstrate this process. The intent of this example is to demonstrate how to combine NLP techniques in a pipeline process to solve a search-type problem. This is not a comprehensive solution and we will ignore some techniques, such as stemming. In addition, the actual creation of an index file will not be presented but rather left as an exercise for the reader. Here, we will focus on how NLP techniques can be used. Specifically, we will do the following:

Split the book into sentences
Convert the sentences to lowercase
Remove stop words
Create an internal index data structure

We will develop two classes to support the index data structure: Word and Positions. We will also augment the StopWords class, developed in Chapter 2, Finding Parts of Text, to support an overloaded version of the removeStopWords method. The new version will provide a more convenient method for removing stop words. We start with a try-with-resources block to open streams for the sentence model, en-sent.bin, and a file containing the contents of Twenty Thousand Leagues Under the Sea, by Jules Verne. The book was downloaded from http://www.gutenberg.org/ebooks/164. The following code shows a working example of the search:

try {
            InputStream is = new FileInputStream(new File(getResourcePath() + "en-sent.bin"));
            FileReader fr = new FileReader(getResourcePath() + "pg164.txt");
            BufferedReader br = new BufferedReader(fr);
            System.out.println(getResourcePath() + "en-sent.bin");
            SentenceModel model = new SentenceModel(is);
            SentenceDetectorME detector = new SentenceDetectorME(model);
            
            String line;
            StringBuilder sb = new StringBuilder();
            while((line = br.readLine())!=null){
                sb.append(line + " ");
            }
            String sentences[] = detector.sentDetect(sb.toString());
            for (int i = 0; i < sentences.length; i++) {
                sentences[i] = sentences[i].toLowerCase();
            }
            
//            StopWords stopWords = new StopWords("stop-words_english_2_en.txt");
//            for (int i = 0; i < sentences.length; i++) {
//                sentences[i] = stopWords.removeStopWords(sentences[i]);
//            }
            
            HashMap<String, Word> wordMap = new HashMap();
            for (int sentenceIndex = 0; sentenceIndex < sentences.length; sentenceIndex++) {
            String words[] = WhitespaceTokenizer.INSTANCE.tokenize(sentences[sentenceIndex]);
            Word word;
            for (int wordIndex = 0; 
                    wordIndex < words.length; wordIndex++) {
                String newWord = words[wordIndex];
                if (wordMap.containsKey(newWord)) {
                     word = wordMap.remove(newWord);
                } else {
                    word = new Word();
                }
                word.addWord(newWord, sentenceIndex, wordIndex);
                wordMap.put(newWord, word);
            }

            Word sword = wordMap.get("sea");
            ArrayList<Positions> positions = sword.getPositions();
            for (Positions position : positions) {
                System.out.println(sword.getWord() + " is found at line " 
                    + position.sentence + ", word " 
                    + position.position);
            }
        }

        } catch (FileNotFoundException ex) {
            Logger.getLogger(SearchText.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(SearchText.class.getName()).log(Level.SEVERE, null, ex);
        }

class Positions {
    int sentence;
    int position;

    Positions(int sentence, int position) {
        this.sentence = sentence;
        this.position = position;
    }
}


public class Word {
    private String word;
    private final ArrayList<Positions> positions;

    public Word() {
        this.positions = new ArrayList();
    }

    public void addWord(String word, int sentence, 
            int position) {
        this.word = word;
        Positions counts = new Positions(sentence, position);
        positions.add(counts);
    }

    public ArrayList<Positions> getPositions() {
        return positions;
    }

    public String getWord() {
        return word;
    }
}

Let's break up the code to understand it. The SentenceModel is used to create an instance of the SentenceDetectorME class, as shown here:

SentenceModel model = new SentenceModel(is);
SentenceDetectorME detector = new SentenceDetectorME(model);

Next, we will create a string using a StringBuilder instance to support the detection of sentence boundaries. The book's file is read and added to the StringBuilder instance. The sentDetect method is then applied to create an array of sentences, and we used the toLowerCase method to convert the text to lowercase. This was done to ensure that when stop words are removed, the method will catch all of them, as shown here:

String line;
StringBuilder sb = new StringBuilder();
while((line = br.readLine())!=null){
    sb.append(line + " ");
}
String sentences[] = detector.sentDetect(sb.toString());
for (int i = 0; i < sentences.length; i++) {
    sentences[i] = sentences[i].toLowerCase();
}

The next step will be to create an index-like data structure based on the processed text. This structure will use the Word and Positions class. The Word class consists of fields for the word and an ArrayList of Positions objects. Since a word may appear more than once in a document, the list is used to maintain its position within the document. The Positions class contains a field for the sentence number, sentence, and for the position of the word within the sentence, position. Both of these classes are defined here:

class Positions {
    int sentence;
    int position;

    Positions(int sentence, int position) {
        this.sentence = sentence;
        this.position = position;
    }
}


public class Word {
    private String word;
    private final ArrayList<Positions> positions;

    public Word() {
        this.positions = new ArrayList();
    }

    public void addWord(String word, int sentence, 
            int position) {
        this.word = word;
        Positions counts = new Positions(sentence, position);
        positions.add(counts);
    }

    public ArrayList<Positions> getPositions() {
        return positions;
    }

    public String getWord() {
        return word;
    }
}

To use these classes, we create a HashMap instance to hold positional information about each word in the file. The creation of the word entries in the map is shown in the following code. Each sentence is tokenized and then each token is checked to see if it exists in the map. The word is used as the key to the hash map. The containsKey method determines whether the word has already been added. If it has, then the Word instance is removed. If the word has not been added before, a new Word instance is created. Regardless, the new positional information is added to the Word instance and then it is added to the map, as shown here:

HashMap<String, Word> wordMap = new HashMap();
            for (int sentenceIndex = 0; sentenceIndex < sentences.length; sentenceIndex++) {
            String words[] = WhitespaceTokenizer.INSTANCE.tokenize(sentences[sentenceIndex]);
            Word word;
            for (int wordIndex = 0; 
                    wordIndex < words.length; wordIndex++) {
                String newWord = words[wordIndex];
                if (wordMap.containsKey(newWord)) {
                     word = wordMap.remove(newWord);
                } else {
                    word = new Word();
                }
                word.addWord(newWord, sentenceIndex, wordIndex);
                wordMap.put(newWord, word);
            }

To demonstrate the actual lookup process, we use the get method to return an instance of the Word object for the word "reef". The list of the positions is returned with the getPositions method and then each position is displayed, as shown here:

Word sword = wordMap.get("sea");
            ArrayList<Positions> positions = sword.getPositions();
            for (Positions position : positions) {
                System.out.println(sword.getWord() + " is found at line " 
                    + position.sentence + ", word " 
                    + position.position);
            }

The output is as follows:

sea is found at line 0, word 7
sea is found at line 2, word 6
sea is found at line 2, word 37
sea is found at line 3, word 5
sea is found at line 20, word 11
sea is found at line 39, word 3
sea is found at line 46, word 6
sea is found at line 57, word 4
sea is found at line 133, word 2
sea is found at line 229, word 3
sea is found at line 281, word 14
sea is found at line 292, word 12
sea is found at line 320, word 22
sea is found at line 328, word 21
sea is found at line 355, word 22
sea is found at line 363, word 1
sea is found at line 391, word 13
sea is found at line 395, word 6
sea is found at line 450, word 12
sea is found at line 460, word 6
.....

This implementation is relatively simple but does demonstrate how to combine various NLP techniques to create and use an index data structure that can be saved as an index file. Other enhancements are possible, including the following:

Other filter operations
Storing document information in the Positions class
Storing chapter information in the Positions class
Providing search options, such as:
- Case-sensitive searches
- Exact text searches
- Better exception handling

These are left as exercises for the reader.

Table of Contents for Creating a pipeline to search text

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating a pipeline to search text