One of the best ways to enhance the search experience is by offering spelling corrections. This is sometimes presented at the top of search results with such text as "Did you mean ...". Solr supports this with the SpellCheckComponent
.
For spelling corrections to work, Solr must clearly have a corpus of words (a dictionary) to suggest alternatives to those in the user's query. "Dictionary" is meant loosely as a collection of words, and not their definitions. Typically, you configure an appropriately indexed field as the dictionary or instead, you supply a plain text file. Solr can be configured to use one or more of the following spellcheckers:
IndexBasedSpellChecker
.There is also a Suggester SpellChecker that implements auto-suggest / query completion. That choice is deprecated as of Solr 4.7, which introduced a dedicated SearchComponent for suggestions. We'll describe that feature later in this chapter.
The notion of a parallel index, also known as a side-car index, is simply an additional internal working index for a dedicated purpose. These must be 'built', which takes time, and they can get out of sync with the main index.
Assuming your dictionary is going to be based on indexed content instead of a file, a field should be set aside exclusively for this purpose. This is so that it can be analyzed appropriately and so that other fields can be copied into it, as the spellcheckers use just one field. Most Solr setups would have one field; our MusicBrainz searches, on the other hand, are segmented by the data type (artists, releases, and tracks), and so one for each would be best. For the purposes of demonstrating this feature, we will only do it for artists.
In schema.xml
, we need to define the field type for spellchecking. This particular configuration is one we recommend for most scenarios:
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" stored="false" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
A field type for spellchecking is not marked as stored because the spellcheck component only uses the indexed terms. The important thing is to ensure that the text analysis does not perform stemming, as the corrections presented would suggest the stems, which would look very odd to the user for most stemmer algorithms. It's also hard to imagine a use case that doesn't apply lowercasing.
Now, we need to create a field for this data:
<field name="a_spell" type="textSpell" />
And we need to get data into it with some copyField
directives:
<copyField source="a_name" dest="a_spell" /> <copyField source="a_alias" dest="a_spell" />
Arguably, a_member_name
may be an additional choice to copy as well, as the dismax
search we configured (seen in the following code) searches it too, albeit at a reduced score. This, as well as many decisions with search configuration, is subjective.
To use any search component, it needs to be in the components list of a request handler. The spellcheck component is not in the standard list, so it needs to be added:
<requestHandler name="/mb_artists" class="solr.SearchHandler"> <!-- default values for query parameters --> <lst name="defaults"> <str name="defType">edismax</str> <str name="qf">a_name a_alias^0.8 a_member_name^0.4</str> <!-- etc. --> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
This component should already be defined in solrconfig.xml
. Within the spellchecker search component, there is one or more XML blocks named spellchecker
, so that different dictionaries and other options can be configured. These might also be loosely referred to as the dictionaries, because the parameter that refers to this choice is named that way (more on that later). We have two spellcheckers configured as follows:
a_spell
: This is an index-based spellchecker that is a typical recommended configuration using DirectSolrSpellChecker
on the a_spell
field.file
: This is a sample configuration where the input dictionary comes from a file (not included).A complete MusicBrainz implementation would have a different spellchecker for each MB data type, with all of them configured similarly.
Following the excerpt given here is an example configuration of the key options available in the spellchecker component:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textSpell</str><!-- 'q' only --> <lst name="spellchecker"> <str name="name">a_spell</str> <str name="field">a_spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="distanceMeasure">internal</str> <float name="accuracy">0.5</float> <int name="maxEdits">1</int> <int name="minPrefix">1</int> <int name="maxInspections">5</int> <int name="minQueryLength">4</int> <float name="maxQueryFrequency">0.01</float> <float name="thresholdTokenFrequency">.01</float> </lst> <!-- just an example --> <lst name="spellchecker"> <str name="name">file</str> <str name="classname">solr.FileBasedSpellChecker</str> <str name="sourceLocation">spellings.txt</str> <str name="characterEncoding">UTF-8</str> </lst> </searchComponent>
The double layer of spellchecker configuration is perhaps a little confusing. The outer one just names the search component—it's just a container for configuration(s). The inner ones are distinct configurations to choose at search time.
The following options are common to all spellcheckers, unless otherwise specified:
name
: This refers to the name of the spellcheck configuration. It defaults to default
. Be sure not to have more than one configuration with the same name.classname
: This refers to the implementation of the spellchecker. It's optional but you should be explicit. The choices are solr.DirectSolrSpellChecker
, solr.IndexBasedSpellChecker
, solr.WordBreakSolrSpellChecker
, and solr.FileBasedSpellChecker
. Further information on these is just ahead.accuracy
: This sets the minimum spelling correction accuracy to act as a threshold. It falls between 0
and 1
with a default of 0.5
. The higher this number is, the simpler the corrections are. The accuracy is computed by the distanceMeasure
. This option doesn't apply to WordBreakSolrSpellChecker
.distanceMeasure
: This Java class computes how similar a possible misspelling and a candidate correction are. It defaults to org.apache.lucene.search.spell.LevensteinDistance
, which is the same algorithm used in fuzzy query matching. Alternatively, org.apache.lucene.search.spell.JaroWinklerDistance
works quite well. This option doesn't apply to WordBreakSolrSpellChecker
.field
: This refers to the name of the field within the index that contains the dictionary. It's mandatory except when using FileBasedSolrSpellChecker
where it's not applicable, since its data comes from a file, not an index. The field must be indexed as the data is taken from there and not from the stored content, which is ignored. Generally, this field exists expressly for spell correction purposes and other fields are copied into it.fieldType
: This is a reference to a field type in schema.xml
to perform text-analysis on words to be spellchecked by the spellcheck.q
parameter (not q
). If this isn't specified, then it defaults to the field type of the field
parameter, and if not specified, it defaults to a simple whitespace delimiter, which most likely would be a misconfiguration. When using the file-based spellchecker with spellcheck.q
, be sure to specify this.Technically, buildOnCommit
and buildOnOptimize
should be in the preceding list, but it's only worthwhile for the Index- or file-based spellcheckers, since they maintain a parallel index.
The DirectSolrSpellChecker
component works directly off the Solr index without needing to maintain a parallel index to generate suggestions that might get out of sync. It's a great choice to start with.
maxEdits
: This is the number of changes to allow for each term; the default value is 2
. Since most spelling mistakes are only one letter off, setting this to 1
will reduce the number of possible suggestions.minPrefix
: This refers to the minimum number of characters that the terms should share. If you want the spelling suggestions to start with the same letter, set this value as 1
.maxInspections
: This defines the maximum number of possible matches to review before returning the results; the default is 5
.minQueryLength
: This specifies how many characters must be in the input query before suggestions are returned; the default is 4
.maxQueryFrequency
: This is the maximum threshold for the number of documents a term must appear in before being considered as a suggestion. This can be a percentage (such as .01 percent for 1 percent) or an absolute value (such as 2). A lower threshold is better for small indexes.thresholdTokenFrequency
: This specifies a document frequency threshold, which will exclude words that don't occur sufficiently often. This can be expressed as a fraction in the range 0-1
, defaulting to 0
, which effectively disables the threshold, letting all words through. It can also be expressed as an absolute value.If there is a lot of data and lots of common words, as opposed to proper nouns, then this threshold should be effective. If testing shows spelling candidates including strange fluke words found in the index, then introduce a threshold that is high enough to weed out such outliers. The threshold will probably be less than 0.01
—one percent of documents.
The IndexBasedSpellChecker
component gets the dictionary from the indexed content of a field in a Lucene/Solr index, and it loads it into its own private parallel index to perform spellcheck searches on. The options are explained as follows:
buildOnCommit
and buildOnOptimize
: These Boolean options (defaulting to false
) enable the spellchecker's internal index to be built automatically when either Solr performs a commit or optimize. This can make keeping the spellchecker in sync easier than building manually, but beware that commits or optimizes will subsequently be hit with a long delay.spellcheckIndexDir
: This is the directory where the spellchecker's internal dictionary is built, not its source. It is relative to Solr's data directory. This is actually optional, which results in an in-memory dictionary.sourceLocation
: If specified, it refers to a directory containing Lucene index files, such as a Solr data directory. This is an unusual expert choice, but shows that the source dictionary does not need to come from Solr's main index; it could be from another location, perhaps from another Solr core. If you are doing this, then you'll probably also need to use the spellcheck.reload
command mentioned later.thresholdTokenFrequency
: This has the same definition as in DirectSolrSpellChecker
The FileBasedSpellChecker
component is very similar to IndexBasedSpellChecker
, except that it gets the dictionary from a plain text file instead of the index. It maintains its own private parallel index to perform spellcheck searches on. This can be useful if you are using Solr as a spelling server, or if you don't need spelling suggestions to be based on actual terms in the index. The file format is one word per line. You can find an example file (spellings.txt
) in the conf
directory.
buildOnCommit
, buildOnOptimize
, and, spellcheckIndexDir
: For more on these, see the IndexBasedSpellChecker options sectionsourceLocation
: This is mandatory and references a plain text file with each word on its own line. Note that an option by the same name but different meaning exists for IndexBasedSpellChecker
.For a freely available English word list, check out Spell Checker Oriented Word Lists (SCOWL) at http://wordlist.sourceforge.net. In addition, see the dictionary files for OpenOffice, which supports many languages at http://wiki.services.openoffice.org/wiki/Dictionaries.
characterEncoding
: This is optional, but should be set. It is the character encoding of sourceLocation
, defaulting to UTF-8.The WordBreakSolrSpellChecker
component offers suggestions by combining adjacent query terms and/or breaking terms into multiple words from the Solr index. It can detect spelling errors resulting from misplaced whitespace without the use of shingle-based dictionaries and provides collation support for word-break errors, including cases where the user has a mix of single-word spelling errors and word-break errors in the same query. The following are options specific to this spellchecker:
combineWords
: This defines whether words should be combined in a dictionary search; default is true
breakWords
: This defines whether words should be broken during a dictionary search; default is true
maxChanges
: This defines how many times the spell checker should check collation possibilities against the index; default is 10
(can be any integer)For more advanced options, see the Javadocs at http://lucene.apache.org/solr/4_8_0/solr-core/org/apache/solr/spelling/WordBreakSolrSpellChecker.html.
You can find an example of this spellchecker configuration in Solr's example solrconfig.xml
.
We've not yet discussed the parameters of a search with the spellchecker component enabled. But at this point of the configuration discussion, understand that you have the choice of just letting the user query q
get processed or you can use spellcheck.q
.
When a user query (q
parameter) is processed by the spellcheck
component to look for spelling errors, Solr needs to determine what words are to be examined. This is a two-step process. The first step is to pull out the queried words from the query string, ignoring any syntax, such as AND
. The next step is to process the words with an analyzer so that, among other things, lowercasing is performed.
The analyzer chosen is through a field type specified directly within the search component configuration with queryAnalyzerFieldType
. It really should be specified, but it's actually optional. If left unspecified, there would be no text analysis, which would in all likelihood be a misconfiguration.
If the spellcheck.q
parameter is given (which really isn't a query per se), then the string is processed with the text analysis referenced by the fieldType
option of the spellchecker being used. If a file-based spellchecker is being used, then you should set this explicitly. Index-based spellcheckers will sensibly use the field type of the referenced indexed spelling field.
If the spellchecker you are using is IndexedBasedSpellChecker
or FileBasedSpellChecker
(or, technically, Suggester
), then it needs to be built, which is the process in which the dictionary is read and is built into the spellcheckIndexDir
. If it isn't built, then no corrections will be offered, and you'll probably be very confused. You'll be even more confused when troubleshooting the results if it was built once but is far out of date and so needs to be built again.
Generally, building is required if it has never been built before, and it should be built periodically when the dictionary changes. It need not necessarily be built for every change, but it obviously won't benefit from any such modifications.
Using buildOnOptimize
or buildOnCommit
is a low-hassle way to keep the spellcheck index up to date. However, most apps never optimize or optimize too infrequently to make use of this, or they commit too frequently. So instead (or in addition to buildOnOptimize
), issue build commands manually on a suitable time period and/or at the end of your data loading scripts. Furthermore, setting spellcheckIndexDir
will ensure the built spellcheck index is persisted between Solr restarts.
In order to perform the build of a spellchecker, simply enable the component with spellcheck=true
, add a special parameter called spellcheck.build
, and set it to true
: http://localhost:8983/solr/mbartists/select?&qt=mb_artists&rows=0&spellcheck=true&spellcheck.build=true&spellcheck.dictionary=a_spell
.
The other spellcheck parameters will be explained shortly. There is an additional related option similar to spellcheck.build
called spellcheck.reload
. This doesn't rebuild the index, but it basically re-establishes connections with the index—both sourceLocation
for index-based spellcheckers and spellcheckIndexDir
for all types. If you've decided to have some external process build the dictionary or simply share built indexes between spellcheckers, then Solr needs to know to reload it to see the changes—a quick operation.
At this point, we've covered how to configure a spellchecker but not how to issue requests that actually use it. In summary, all that you are required to do is add spellcheck=true
to a standard search request, but it is more likely that you will set other options, once you start experimenting.
It's important to be aware that there are effectively three mutually exclusive internal modes that this component places itself in:
spellcheck.onlyMorePopular=true
, then the spellcheck component will not only try to offer suggestions for query terms that find no results, it will also do so for the other terms, provided it can offer a suggestion that occurs more frequently in the index. Now Solr is working harder and intuitively, this should help fix cases when the query is an indexed typo. However, the erroneous query term might not be an indexed typo (for example, June versus Jane); can Solr still try harder? Yes…spellcheck.alternativeTermCount
is set, then it will try to find suggestions for all terms, and the suggestions need not occur more frequently.Despite these progressively aggressive spellcheck modes, there might still be no suggestions or fewer than the number asked for if it simply can't find anything suitable.
Let's now explore the various request parameters recognized by the spellchecker component:
spellcheck
: This refers to a Boolean switch that must be set to true
to enable the component in order to see suggested spelling corrections.spellcheck.dictionary
: This is the named reference to a dictionary (spellchecker) to use configured in solrconfig.xml
. It defaults to default
. This can be set multiple times and Solr will merge the suggestions.spellcheck.q
or q
: The string containing words to be processed by this component can be specified as the spellcheck.q
parameter, and if not present, then the q
parameter. Please look for the information presented earlier on how these are processed.Which should you use: spellcheck.q or q?
Assuming you're handling user queries for Solr that might contain some query syntax, then the default q
is right, as Solr will then know to filter out possible uses of Lucene/Solr's syntax, such as AND
, OR
, fieldname:word
, and so on. If not, then spellcheck.q
is preferred, as it won't go through that unnecessary processing. This also allows its parsing to be different on a spellchecker-by-spellchecker basis, which we'll leverage in our example.
spellcheck.count
: This refers to the maximum number of corrections to offer per word. The default is 1
. Corrections are ordered by those closest to the original, as determined by the distanceMeasure
algorithm.spellcheck.extendedResults
: This is a Boolean switch that adds frequency information, both for the original word and for the suggestions. It's helpful when debugging.spellcheck.collate
: This is a Boolean switch that adds a revised query string to the output that alters the original query (from spellcheck.q
or q
) to use the top recommendation for each suggested word. It's smart enough to leave any other query syntax in place. The following are some additional options for use when collation is enabled:spellcheck.maxCollations
: This specifies the maximum number of collations to return, defaulting to 1
.spellcheck.maxCollationTries
: This specifies the maximum number of collations to try (verify it yields results), defaulting to 5
. If this is non-zero, then the spellchecker will not return collations that yield no results.spellcheck.maxCollationEvaluations
: This specifies the maximum number of word correction combinations to rank before the top candidates are tried (verified). Without this limit, queries with many misspelled words could yield a combinatoric explosion of possibilities. The default is 10000
, which should be fine.spellcheck.collateExtendedResults
: This is a Boolean switch that adds more details to the collation response. It adds the collation hits (number of documents found) and a mapping of misspelled words to corrected words.spellcheck.collateParam.xx
: This will allow parameter override, where xx
is the parameter you want to override; for example, to override mm
from a low value to a high value so that the spellchecker is truly verifying that the replacement (collation) terms exist together in the same document. This is similar to local-params, but is applied to the collated query string verification when maxCollationTries
is used.spellcheck.onlyMorePopular
: This is a Boolean switch that will offer spelling suggestions for queried terms that were found in the index, provided that the suggestions occur more often. This is in addition to the normal behavior of only offering suggestions for queried terms not found in the index. To detect when this happens, enable extendedResults
and look for origFreq
being greater than 0
. This is disabled, by default.spellcheck.alternativeTermCount
: This specifies the maximum number of suggestions to return for each query term that already exists in the index/dictionary. Normally, the spellchecker doesn't offer suggestions for such query terms, and so setting this triggers the spellchecker to try to find suggestions for all query terms. The configured number essentially overrides spellcheck.count
for such terms, giving the opportunity to use a more conservative (lower) number, since it's less likely one of these query terms was actually misspelled.spellcheck.maxResultsForSuggest
: This specifies the maximum number of results the request can return in order to both generate spelling suggestions and set the correctlySpelled
element to false
. This acts as an early short-circuit rule in the spellchecker if you set it, otherwise there is no rule. This option is only applicable when spellcheck.onlyMorePopular
is true
or spellcheck.alternativeTermCount
is set, because only those two options can trigger suggestions for queries that return results.We'll try out a typical spellcheck configuration that we've named a_spell
. We've disabled showing the query results with rows=0
because the actual query results aren't the point of these examples. In this example, it is imagined that the user is searching for the band Smashing
Pumpkins
, but with a misspelling.
Here are the search results for Smashg
Pumpkins
, using the a_spell
dictionary:
<?xml version="1.0"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">124</int>
<lst name="params">
<str name="spellcheck">true</str>
<str name="indent">on</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.collate">true</str>
<str name="rows">0</str>
<str name="echoParams">explicit</str>
<str name="q">Smashg Pumpkins</str>
<str name="spellcheck.dictionary">a_spell</str>
<str name="spellcheck.count">5</str>
<str name="qt">/mb_artists</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="smashg">
<int name="numFound">5</int>
<int name="startOffset">0</int>
<int name="endOffset">6</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst>
<str name="word">smash</str>
<int name="freq">36</int>
</lst>
<lst>
<str name="word">smashing</str>
<int name="freq">4</int>
</lst>
<lst>
<str name="word">smashign</str>
<int name="freq">1</int>
</lst>
<lst>
<str name="word">smashed</str>
<int name="freq">5</int>
</lst>
<lst>
<str name="word">smasher</str>
<int name="freq">2</int>
</lst>
</arr>
</lst>
<bool name="correctlySpelled">false</bool>
<lst name="collation">
<str name="collationQuery">smashing Pumpkins</str>
<int name="hits">1</int>
<lst name="misspellingsAndCorrections">
<str name="smashg">smashing</str>
</lst>
</lst>
</lst>
</lst>
</response>
In this scenario, we intentionally chose a misspelling that is closer to another word: "smash". Were it not for maxCollationTries
, the suggested collation would be "smash Pumpkins", which would return no results. There are a few things we want to point out regarding the spellchecker response:
startOffset
and endOffset
are the index into the query of the spellchecked word. This information can be used by the client to display the query differently, perhaps displaying the corrected words in bold.numFound
is always the number of suggested words returned, not the total number available, if spellcheck.count
were raised.correctlySpelled
is intuitively true
or false
, depending on whether all of the query words were found in the dictionary or not.