Text indexes are special indexes on string value fields, used to support text searches. This book is based on version 3 of the text index functionality, available since version 3.2.
A text index can be specified similarly to a regular index, by replacing the index sort order (-1, 1) with the word text, as follows:
> db.books.createIndex({"name": "text"})
Since we only have one text index per collection, we need to choose the fields wisely. Reconstructing this text index can take quite some time, and having only one of them per collection makes maintenance quite tricky, as you will see toward the end of this chapter.
Luckily, this index can also be a compound index:
> db.books.createIndex( { "available": 1, "meta_data.page_count": 1, "$**": "text" } )
A compound index with text fields follows the same rules of sorting and prefix indexing that we explained earlier in this chapter. We can use this index to query on available, or the combination of available and meta_data.page_count, or sort them if the sort order allows for traversing our index in any direction.
We can also blindly index each and every field in a document that contains strings as text:
> db.books.createIndex( { "$**": "text" } )
This can result in unbounded index sizes, and should be avoided; however, it can be useful if we have unstructured data (for example, coming straight from application logs wherein we don't know which fields may be useful, and we want to be able to query as many of them as possible).
Text indexes will apply stemming (removing common suffixes, such as plural s/es for English language words) and remove stop words (a, an, the, and so on) from the index.
- Case-insensitivity and diacritic insensitivity: A text index is case- and diacritic-insensitive. Version 3 of the text index (the one that comes with version 3.4) supports common C, simple S, and the special T case folding, as described in Unicode Character Database (UCD) 8.0 case folding. In addition to case-insensitivity, version 3 of the text index supports diacritic insensitivity. This expands insensitivity to characters with accents in both small and capital-letter form. For example, e, è, é, ê, ë, and their capital letter counterparts, can all result into being equal when comparing using a text index. In the previous versions of the text index, these were treated as different strings.
- Tokenization delimiters: Version 3 of the text index supports the tokenization delimiters, defined as Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space, as described in UCD 8.0 case folding.