Implementing autocomplete with a custom analyzer

In certain situations, you may want to create your own custom analyzer by composing character filters, tokenizers, and token filters of your choice. Please remember that most requirements can be fulfilled by one of the built-in analyzers with some configuration. Let's create an analyzer that can help when implementing autocomplete functionality.

To support autocomplete, we cannot rely on Standard Analyzer or one of the pre-built analyzers in Elasticsearch. The analyzer is responsible for generating the terms at indexing time. Our analyzer should be able to generate the terms that can help with autocompletion. Let's understand this through a concrete example.

If we were to use Standard Analyzer at indexing time, the following terms would be generated for the field with the Learning Elastic Stack 7 value:

GET /_analyze
{
  "text": "Learning Elastic Stack 7",
  "analyzer": "standard"
}

The response of this request would contain the terms Learning, Elastic, Stack, and 7. These are the terms that Elasticsearch would create and store in the index if Standard Analyzer was used. Now, what we want to support is that when the user starts typing a few characters, we should be able to match possible matching products. For example, if the user has typed elas, it should still recommend Learning Elastic Stack 7 as a product. Let's compose an analyzer that can generate terms such as el, ela, elas, elast, elasti, elastic, le, lea, and so on:

PUT /custom_analyzer_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "custom_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "custom_edge_ngram"
            ]
          }
        },
        "filter": {
          "custom_edge_ngram": {
            "type": "edge_ngram",
            "min_gram": 2,
            "max_gram": 10
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product": {
        "type": "text",
        "analyzer": "custom_analyzer",
        "search_analyzer": "standard"
      }
    }
  }
}

This index definition creates a custom analyzer that uses Standard Tokenizer to create the tokens and uses two token filters – a lowercase token filter and the edge_ngram token filter. The edge_ngram token filter breaks down each token into lengths of 2 characters, 3 characters, and 4 characters, up to 10 characters. One incoming token, such as elastic, will generate tokens such as el, ela, and so on, from one token. This will enable autocompletion searches.

Given that the following two products are indexed, and the user has typed Ela so far, the search should return both products:

POST /custom_analyzer_index/_doc
{
  "product": "Learning Elastic Stack 7"
}

POST /custom_analyzer_index/_doc
{
 "product": "Mastering Elasticsearch"
}

GET /custom_analyzer_index/_search
{
 "query": {
   "match": {
     "product": "Ela"
   }
 }
}

This would not have been possible if the index was built using Standard Analyzer at indexing time. We will cover the match query later in this chapter. For now, you can assume that it applies Standard Analyzer (the analyzer configured as search_analyzer) on the given search terms and then uses the output terms to perform the search. In this case, it would search for the term Ela in the index. Since the index was built using a custom analyzer using an edge_ngram token filter, it would find a match for both products.

In this section, we have learned about analyzers. Analyzers play a vital role in the functioning of Elasticsearch. Analyzers decide which terms get stored in the index. As a result, what kind of search operations can be performed on the index after it has been built is decided by the analyzer used at index time. For example, Standard Analyzer cannot fulfill the requirement of supporting the autocompletion feature. We have looked at the anatomy of analyzers, tokenizers, token filters, character filters, and some built-in support in Elasticsearch. We also looked at a scenario in which building a custom analyzer solves a real business problem regarding supporting the autocomplete function in your application.

Before we move onto the next section and start looking at different query types, let's set up the necessary index with the data required for the next section. We are going to use product catalog data taken from the popular e-commerce site www.amazon.com. The data is downloadable from http://dbs.uni-leipzig.de/file/Amazon-GoogleProducts.zip.

Before we start with the queries, let's create the required index and import some data:

PUT /amazon_products
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {}
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "manufacturer": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100
      }
    }
  }
}

The title and description fields are analyzed text fields on which analysis should be performed. This will enable full-text queries on these fields. The manufacturer field is of the text type, but it also has a field with the name raw. The manufacturer field is stored in two ways, as text, and manufacturer.raw is stored as a keyword. All fields of the keyword type internally use the keyword analyzer. The keyword analyzer consists of just the keyword tokenizer, which is a noop tokenizer, simply returning the whole input as one token. Remember, in an analyzer, character filters and token filters are optional. Thus, by using the keyword type on the field, we are choosing a noop analyzer and hence skipping the whole analysis process on that field.

The price field is chosen to be of the scaled_float type. This is a new type introduced with Elastic 6.0, which internally stores floats as scaled whole numbers. For example, 13.99 will be stored as 1399 with a scaling factor of 100. This is space-efficient as float and double datatypes occupy much more space.

To import the data, please follow the instructions in the book's accompanying source code repository at GitHub: https://github.com/pranav-shukla/learningelasticstack in the branch v7.0.

The instructions for importing data are in chapter-03/products_data/README.md.

After you have imported the data, verify that it is imported with the following query:

GET /amazon_products/_search
{
  "query": {
    "match_all": {}
  }
}

In the next section, we will look at structured search queries.

Table of Contents for Implementing autocomplete with a custom analyzer

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing autocomplete with a custom analyzer