The lucene
query parser we've been using so far for searching offers a rich syntax, but it doesn't do anything more. A notable problem with using this parser is that the query must be well formed according to the aforementioned syntax rules, such as having balanced quotes and parentheses. Users might type just about anything for a query, not knowing anything about this syntax, possibly resulting in an error or unexpected results. The DisMax query parser, named after Lucene's DisjunctionMaxQuery
, addresses this problem and adds many features to enhance search relevancy (good scoring). The features of this query parser that have a more direct relationship to scoring are described in the The DisMax query parser – part 2 section in the next chapter. Use of this parser is so important that we need to introduce it here.
You'll see references here to eDisMax, whereby the e stands for extended. This is a forked evolution of DisMax that adds features. It hasn't yet replaced the original DisMax query parser because it enables more support for Lucene's syntax at the expense of a user inadvertently using it. So if you don't care about eDisMax's extra features and don't have users that want the more advanced syntax support, then stick with the venerable DisMax. In a future Solr version, perhaps as soon as the next release, we expect dismax
to refer to the enhanced version while the older one will likely exist under another name.
Almost always use defType=edismax or dismax
The dismax
(or edismax
) query parser should almost always be chosen for parsing user queries q
. Set it in the request handler definition for your app. Furthermore, we recommend the use of edismax
. The only consideration against this is whether it will be a problem for users to be able to use Solr's full syntax, inadvertently or maliciously. This will be explained shortly.
Here is a summary of the features that the dismax
query parser has over the lucene
query parser:
DisjunctionMaxQuery
.edismax
query parser permits Solr's full syntax, assuming it parses correctly.edismax
query parser boosts contiguous portions of the query too.The edismax
query parser was only mentioned a couple of times in this list, but it improves on the details of how some of these features work.
These features will subsequently be described in greater detail. But first, let's take a look at a request handler we've set up to search for artists. Solr configuration that is not related to the schema is located in solrconfig.xml
. The following definition is a simplified version of the one in this book's code supplement:
<requestHandler name="/mb_artists" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">edismax</str> <str name="qf">a_name a_alias^0.8 a_member_name^0.4</str> <str name="q.alt">*:*</str> <str name="mm">100%</str> </lst> </requestHandler>
In Solr's admin Query interface screen, we can refer to this by setting Request-Handler to /mb_artists
. You can observe the effect in the URL when you submit the form. It wasn't necessary to set up such a request handler, because Solr is fully configurable from a URL, but it's a good practice and it's convenient for Solr's search form.
You use the qf
parameter to tell the dismax
query parser which fields you want to search and their corresponding score boosts. As explained in the section on request handlers, the query parameters can be specified in the URL or in the request handler configuration in solrconfig.xml
—you'll probably choose the latter for this one. Here is the relevant configuration line from our dismax
based handler configuration earlier:
<str name="qf">a_name a_alias^0.8 a_member_name^0.4</str>
This syntax is a space-separated list of field names that can each have optional boosts applied using the same syntax that is used in the query syntax for boosting. This request handler is intended to find artists from a user's query. Such a query would ideally match the artist's name, but we'll also search for aliases as well as bands that the artist is a member of. Perhaps the user didn't recall the band name but knew the artist's name. This configuration would give them the band in the search results, most likely towards the end.
The score boosts do not strictly order the results in a cascading fashion. An exact match in a_alias
that matched only part of a_name
will probably appear on top. If in your application you are matching identifiers of some sort, then you may want to give a boost to that field which is very high, such as 1,000, to virtually assure it will be on top.
One detail involved in searching multiple fields is the effect of stop words (for example, "the", "a", and so on) in the schema definition. If qf
refers to some fields using stop words and others that don't, then a search involving stop words will usually return no results. The edismax
query parser fixes this by making them all optional in the query unless the query is entirely stop words. With dismax
, you can ensure the query analyzer chain in queried fields filters out the same set of stop words.
The edismax
query parser will first try to parse the user query with the full syntax supported by the lucene
query parser, with a couple tweaks. If it fails to parse, it will fall back to the limited syntax of the original dismax
in the next paragraph. Someday, this should be configurable, but it is not at this time. The aforementioned "tweaks" to the full syntax are that or
and and
Boolean operators can be used in a lowercase form, and pure-negative subqueries are supported.
When using dismax
(or edismax
, when the user query failed to parse with the lucene
query parser), the parser will restrict the syntax permitted to terms, phrases, and use of +
and -
(but not AND
, OR
, &&
, ||
) to make a clause mandatory or prohibited. Anything else is escaped if needed to ensure that the underlying query is valid. The intention is to never trigger an error, but unless you're using edismax
, you'll have to code for this possibility due to outstanding bugs (SOLR-422, SOLR-874).
The following query example uses all of the supported features of this limited syntax:
"a phrase query" plus +mandatory without -prohibited
With the lucene
query parser, you have a choice of the default operator being OR
, thereby requiring just one query clause to match, or choosing AND
to make all clauses required. This, of course, only applies to clauses not otherwise explicitly marked required or prohibited in the query using +
and -
. But these are two extremes, and sometimes it is preferable to find some middle ground. The dismax
parser uses a method called min-should-match, a feature which describes how many clauses should match, depending on how many there are in the query—required and prohibited clauses are not included in the numbers. This allows you to quantify the number of clauses as either a percentage or a fixed number. The configuration of this setting is entirely contained within the mm
query parameter using a concise syntax specification, which I'll describe in a moment.
Always set mm
. When in doubt what to set it to, use 100 percent. If it is not set, it uses the same defaulting rules as the lucene
query parser, most likely resulting in an mm
value equivalent to 0 percent, which is probably not what you want.
This feature is more useful if users use many words in their queries—at least three. This in turn suggests a text field that has some substantial text in it but that is not the case for our MusicBrainz dataset. Nevertheless, we will put this feature to good use.
The following are the four basic mm
specification formats expressed as examples:
3
: This specifies that three clauses are required, the rest are optional.-2
: This specifies that two clauses are optional, the rest are required.66%
: This specifies that 66 percent of the clauses (rounded down) are required, the rest are optional.-25%
: This specifies that 25 percent of the clauses (rounded down) are optional, the rest are required.Notice that -
inverses the required/optional definition. It does not make any number negative from the standpoint of any definitions herein.
Two additional points about these rules are as follows:
mm
rule is a fixed number n
, but there are fewer queried clauses, then n
is reduced to the queried clause count so that the rule will make sense. For example, if mm
is -5
and only two clauses are in the query, then all are optional. Sort of!0
or 0%
, one clause must still match, assuming that there are no required clauses present in the query.Now that you understand the basic mm
specification format, which is for one simple rule, I'll describe the final format, which allows for multiple rules. This format is composed of an ordered space-separated series of number<basicmm
. This can be read as, "If the clause count is greater than number
, then apply rule basicmm
". Only the right-most rule that meets the clause count threshold is evaluated. As they are ordered in an ascending order, the chosen rule is the one that requires the greatest number of clauses. If none match because there are fewer clauses, then all clauses are required—a basic specification of 100 percent.
An example of the mm
specification is given here:
2<75% 9<-3
This reads as follows:
If there are over nine clauses, then all but three are required (three are optional, and the rest are required). If there are over two clauses, then 75 percent are required (rounded down). Otherwise (one or two clauses) all clauses are required, which is the default rule.
A simple configuration for min-should-match is to require all clauses:
100%
For MusicBrainz searches, I do not expect users to be using many terms, but I expect most of them to match. If a user searches for three or more terms, then I'll let one be optional. Here is the mm
spec:
2<-1
You may be inclined to require all of the search terms; and that's a good common approach. However, if just one word isn't found, then there will be no search results—an occurrence that most search software tries to minimize. Even if you make some of the words optional, the matching documents that have more of the search words will be towards the top of the search results, assuming score-sorted order (you'll learn why in the next chapter). There are other ways to approach this problem, for example, by performing a secondary search if the first returns none or too few. Solr doesn't do this for you, but it's easy for the client to do. This approach could even tell the user that this was done, which would yield a better search experience.
The dismax
query parser supports a default query, which is used in the event the user query q
is not specified. This parameter is q.alt
, and it is not subject to the limited syntax of dismax
. Here's an example of it used to match all documents from within the request handler defaults in solrconfig.xml
:
<str name="q.alt">*:*</str>
This parameter is usually set to *:*
to match all documents and is often specified in the request handler configuration in solrconfig.xml
. You'll see with faceting in the next section that there will not necessarily even be a keyword search, and so you'll want to display facets over all of the data.
The DisMax and eDisMax query parsers support fielded queries within the q
parameter. This means that a user can explicitly search any valid field using this syntax: field_name:value
. The uf (user fields) parameter makes it possible to restrict the set of fields the user can search against. The value of this parameter can be a space-delimited list of field names. A wildcard (*
) can be used for field name globing
. Dashes can be used to negate fields. For example, to allow user queries to search in the id
field, all fields starting with a_
except a_id
, the uf parameter value would be id a_* -a_id
.