The query parser named lucene
is Solr's most expressive and capable. With the benefit of hindsight, it should have been named "solr". It is based on Lucene's classic syntax with some additions that will be pointed out explicitly. In fact, you've already seen the first addition, which is local-params.
The lucene
query parser does have a couple of query parameters that can be set. These parameters aren't normally specified though; Lucene's query syntax is easily made explicit to not need these options.
q.op
: This is the default query operator, either AND
or OR
to signify if all of the search terms or just one of the search terms need to match. If this isn't present, then the default is specified in schema.xml
near the bottom in the defaultOperator
attribute. If that isn't specified, then the default is OR
.df
: This is the default field that will be searched by the user query. If this isn't specified, then the default is specified in schema.xml
near the bottom in the <defaultSearchField>
element. If that isn't specified, then a query that does not explicitly specify a field to search will cause an error.We recommend not using these parameters, unless they are used with local-params, such as, {! df=text q.op=AND}my query
. Similarly, we recommend not setting the global defaults in the schema. One reason is that they affect all queries in the same request that you perhaps didn't intend, such as a facet query. Another is that it makes a query that depends on it ambiguous without knowing what these parameters are.
To play along with the examples in the book, go to http://localhost:8983/solr/#/mbartists/query
and set the df
parameter to a_name
. We advise you not to use that parameter, but this is for experimentation. The default query operator remains at OR
and doesn't need changing. You may find it easier to scan the results if you set fl
(the field list) to a_name
, score.
Lucene doesn't natively have a query syntax to match all documents. Solr enhances Lucene's query syntax to support this with the following syntax:
*:*
In Solr 4.2, the syntax is as follows:
*
When using dismax
, it's common to set q.alt
to this match-everything query so that a blank query returns all results.
Lucene has a unique way of combining multiple clauses in a query string. It is tempting to think of this as a mundane detail that is common to Boolean operations in programming languages, but Lucene doesn't quite work that way.
A query expression is decomposed into a set of unordered clauses of three types:
+Smashing
This matches only artists containing the word Smashing
.
-Smashing
This matches all artists except those with Smashing
. You can also use an exclamation mark as in !Smashing
but that's rarely used.
Smashing
Spaces must not come between +
, !
or -
and the search word for it to work as described here, otherwise the operator itself is treated like a separate word and the word to its right will default to optional. Typically, the operator won't actually be searched for since text analysis usually removes it.
The term optional deserves further explanation. If the query expression contains at least one mandatory clause, then any optional clause is just that—optional. This notion may seem pointless, but it serves a useful function in scoring documents that match more of them higher. If the query expression does not contain any mandatory clauses, then at least one of the optional clauses must match. The next two examples illustrate optional clauses.
Here, Pumpkins
is optional, and the well-known band will surely be at the top of the list, ahead of bands with names like Smashing Atoms
:
+Smashing Pumpkins
In this example, there are no mandatory clauses and so documents with Smashing
or Pumpkins
are matched, but not Atoms
. The Smashing Pumpkins
is at the top because it matched both, followed by other bands containing only one of those words:
Smashing Pumpkins –Atoms
If you would like to specify that a certain number or percentage of optional clauses should match or should not match, you can instead use the DisMax query parser with the min-should-match feature, described later in the chapter.
The Boolean operators AND
, OR
, and NOT
can be used as an alternative syntax to arrive at the same set of mandatory, optional, and prohibited clauses that were mentioned previously. Use the debugQuery
feature and observe that the parsedquery
string normalizes away this syntax into the previous (clauses being optional by default, such as OR
).
When the AND
or &&
operator is used between clauses, then both the left and right sides of the operand become mandatory, if not already marked as prohibited. Let's consider this search result:
Smashing AND Pumpkins
It is equivalent to:
+Smashing +Pumpkins
Similarly, if the OR
or ||
operator is used between clauses, then both the left and right sides of the operand become optional, unless they are marked mandatory or prohibited. If the default operator is already OR
, then this syntax is redundant. If the default operator is AND
, then this is the only way to mark a clause as optional.
To match artist names that contain Smashing
or Pumpkins
, try:
Smashing || Pumpkins
The NOT
operator is equivalent to the -
syntax. So to find artists with Smashing
but not Atoms
in the name, you can do this:
Smashing NOT Atoms
We didn't need to specify a +
on Smashing
. This is because it is the only optional clause and there are no explicit mandatory clauses. Likewise, using AND
or OR
would have no effect in this example.
It may be tempting to try to combine AND
with OR
such as:
Smashing AND Pumpkins OR Green AND Day
However, this doesn't work as you might expect! Remember that AND
is equivalent to both sides of the operand being mandatory, and thus each of the four clauses becomes mandatory. Our dataset returned no results for this query. In order to combine query clauses in some ways, you will need to use subqueries.
You can use parenthesis to compose a query of smaller queries, referred to as subqueries or nested queries. The following example satisfies the intent of the previous example:
(Smashing AND Pumpkins) OR (Green AND Day)
Using what we know previously, this could also be written as:
(+Smashing +Pumpkins) (+Green +Day)
But this is not the same as:
+(Smashing Pumpkins) +(Green Day)
The preceding subquery is interpreted as documents that must have a name with Smashing
or Pumpkins
and either Green
or Day
in its name. So if there were a band named Green
Pumpkins
, then it would match.
Solr added another syntax for subqueries to Lucene's old syntax, which allows the subquery to use a different query parser, including local-params. This is an advanced technique, so don't worry if you don't understand it at first.
As an example, suppose you have a search interface with multiple query boxes, whereas each box is to search a different field. You could compose the query string yourself, but you would have some query-escaping issues to deal with. And if you wanted to take advantage of the dismax
parser, then with what you know so far, that isn't possible. Here's an approach using this new syntax:
+{!dismax qf=a_name v=$q.a_name} +{!dismax qf=a_alias v=$q.a_alias}
This example assumes that request parameters of q.a_name
and q.a_alias
are supplied for the user input for these fields in the schema. Recall from the local-params definition that the parameter v
can hold the query and that the $
refers to another named request parameter.
With versions of Solr earlier than 4.1, the syntax is slightly different and more complicated. The syntax uses a magic field named _query_
with its value being the subquery, which practically speaking, needs to be quoted. Here's the query from the preceding example, using the old syntax:
+_query_:"{!dismax qf=a_name v=$q.a_name}" +_query_:"{!dismax qf=a_alias v=$q.a_alias}"
Lucene doesn't actually support a pure negative query; for example:
-Smashing -Pumpkins
Solr enhances Lucene to support this, but only at the top-level query, such as in the preceding example. Consider the following, admittedly strange, query:
Smashing (-Pumpkins)
This query attempts to ask the question: Which artist names contain either Smashing
or do not contain Pumpkins
? However, it doesn't work and only matches the first clause—(four documents). The second clause should essentially match most documents resulting in a total for the query that is nearly every document. The artist named Wild Pumpkins at Midnight
is the only one in the index that does not contain Smashing
but does contain Pumpkins
, and so this query should match every document except that one.
To make this work, you have to take the subexpression containing only negative clauses, and add the all-documents query clause: *:*
, as shown here:
Smashing (-Pumpkins *:*)
Interestingly, this limitation is fixed in the edismax
query parser. Hopefully, a future version of Solr will fix it universally, thereby making this workaround unnecessary.
To have a clause explicitly search a particular field, you need to precede the relevant clause with the field's name, and then add a colon; spaces may be used in between, but that is generally not done:
a_member_name:Corgan
This matches bands containing a member with the name Corgan
. To match Billy
and Corgan
, do the following:
+a_member_name:Billy +a_member_name:Corgan
Or use this shortcut to match multiple words:
a_member_name:(+Billy +Corgan)
The content of the parenthesis is a subquery, but with the default field being overridden to be a_member_name
, instead of what the default field would be otherwise. By the way, we could have used
AND
instead of +
, of course. Moreover, in these examples, all of the searches were targeting the same field, but you can certainly match any combination of fields needed.
A clause may be a phrase query: a contiguous series of words to be matched in order. In the previous examples, we've searched for text containing multiple words such as Billy
and Corgan
, but let's say we wanted to match Billy
Corgan
(that is, the two words adjacent to each other in that order). This further constrains the query. Double quotes are used to indicate a phrase query, as shown in the following query:
"Billy Corgan"
Related to phrase queries is the notion of the term proximity, also known as the slop factor or a near query. In our previous example, if we wanted to permit these words to be separated by no more than say three words in between, we could do this:
"Billy Corgan"~3
For the MusicBrainz dataset, this is probably of little use. For larger text fields, this can be useful in improving search relevance. The dismax
query parser, which is described in the next chapter, can automatically turn a user's query into a phrase query with a configured slop.
For advanced requirements such as wildcards and Booleans within a phrase query, ComplexPhraseQueryParser
can be used. For more information on this parser, its options and performance considerations, visit https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser.
A plain keyword search will look in the index for an exact match, subsequent to text analysis processing on both the query and input document text (for example, tokenization and lowercasing). But sometimes you need to express a query for a partial match expressed using wildcards.
There is a highly relevant section in Chapter 3, Text Analysis, on partial/substring indexing. In particular, read about ReversedWildcardFilterFactory
. N-grams is a different approach that does not work with wildcard queries.
There are a few points to understand about wildcard queries:
smashing*
might not find the original text Smashing
. The Porter stemmer will transform this word to smash
, whereas EnglishMinimalStemmer
(used in a_name
) won't touch this word. Consequently, don't stem or use a minimal stemmer.ReversedWildcardFilterFactory
helps with this a lot. But if you have an asterisk wildcard on both ends of the word, then this is the worst-case scenario.To find artists containing words starting with Smash
, you can use:
smash*
Or perhaps to find those starting with sma
and ending with ing
, use:
sma*ing
The asterisk matches any number of characters (perhaps none). You can also use ?
to force a match of any character at that position:
sma??*
That would match words that start with sma
that have at least two more characters, but potentially more.
As far as scoring is concerned, each matching term gets the same score regardless of how close it is to the query pattern. Lucene can support a variable score at the expense of performance, but you would need to do some hacking to get Solr to do that.
Fuzzy queries are useful when your search term needn't be an exact match, but the closer the better. The fewer the number of character insertions, deletions, or exchanges relative to the search term length, the better the score. The algorithm used is known as the Levenshtein Distance algorithm, also known as the edit distance. Fuzzy queries have the same need to avoid stemming, just as wildcard queries do. For example:
smashing~
Notice the tilde character at the end. Without this notation, simply smashing
matches only four documents because only that many artist names contain that word. The search term smashing~
matched 26 documents. The default edit distance is 2, but you can reduce it to 1 like so for less fuzzy matching:
smashing~1
That results in six matched documents—two more than a non-fuzzy search. Prior to Lucene 4, the edit distance was specified as a fraction of the number of characters in the word, and Lucene could search based on whatever edit distance this came to, albeit slowly. Lucene 4 is much faster but is limited to an edit distance no greater than 2, so you are now best off simply specifying 1 or 2 instead of using the fractional syntax.
There may be scenarios where you need to match documents using a specific pattern that can't be expressed using wildcard or fuzzy queries. For these cases, a regular expression query might be the answer.
The Solr regular expression syntax is simple and straightforward. Here's an example that matches documents that contain a possible 5-digit zip code, somewhere in the a_address
field:
a_address:/[0-9]{5}/
As you can see, the pattern is enclosed in forward slashes (delimiters). Solr implicitly applies the pattern matching the full indexed value. There is no need to anchor to the beginning or end of the input string.
Regular expression queries are constant scoring—the scores of any matching documents will always be 1.0.
Lucene lets you query for numeric, date, and even text ranges. The following query matches all of the bands formed in the 1990s:
a_type:2 AND a_begin_date:[1990-01-01T00:00:00.000Z TO 1999-12-31T24:59:99.999Z]
Observe that the date format is the full ISO-8601 date-time in UTC, which Solr mandates (the same format used by Solr to index dates and that which is emitted in search results). The .999
milliseconds part is optional. The [
and ]
brackets signify an inclusive range, and, therefore, it includes the dates on both ends. To specify an exclusive range, use {
and }
. In Solr 3, both sides must be inclusive or both exclusive; Solr 4 allows both. The workaround in Solr 3 is to introduce an extra clause to include or exclude a side of the range.
Use the right field type
To get the fastest numerical/date range query performance, particularly when there are many indexed values, use a trie
field (for example, tdate
) with precisionStep
. This was discussed in Chapter 2, Schema Design.
For most numbers in the MusicBrainz schema, we only have identifiers, and so it made sense to use the plain long
field type, but there are some other fields. For the track duration in the tracks data, we could do a query such as the following one to find all of the tracks that are longer than 5 minutes (300 seconds, 300,000 milliseconds):
t_duration:[300000 TO *]
In this example, we can see Solr's support for open-ended range queries by using *
.
Although uncommon, you can also use range queries with text fields. For this to have any use, the field should have only one term indexed. You can control this either by using the string
field type, or by using the KeywordTokenizer
. You may want to do some experimentation. The following example finds all documents where somefield
has a term starting with B
:
somefield:[B TO C}
Both sides of the range B
and C
are not processed with text analysis that could exist in the field type definition. If there is any text analysis such as lowercasing, you will need to do the same to the query or you will get unexpected results.
Solr extended Lucene's old query parser to add date literals as well as some simple math that is especially useful in specifying date ranges. In addition, there is a way to specify the current date-time using NOW
. The syntax offers addition, subtraction, and rounding at various levels of date granularity, such as years, seconds, and so on down to milliseconds. The operations can be chained together as needed, in which case they are executed from left to right. Spaces aren't allowed. For example:
r_event_date:[* TO NOW-2YEAR]
In the preceding example, we searched for documents where an album was released over two years ago. NOW
has millisecond precision. Let's say what we really wanted was precision to the day. By using /
, we can round down (it never rounds up):
r_event_date:[* TO NOW/DAY-2YEAR]
The units to choose from are YEAR
, MONTH
, DAY
, DATE
(synonymous with DAY
), HOUR
, MINUTE
, SECOND
, MILLISECOND
, and MILLI
(synonymous with MILLISECOND
). Furthermore, they can be pluralized by adding an S
, as in YEARS
.
This so-called DateMath syntax is not just for querying dates; it is for supplying dates to be indexed by Solr too! An index-time common usage is to timestamp added data. Using the NOW
syntax as the default
attribute of a timestamp field definition makes this easy. Here's how to do that: <field name="indexedAt" type="tdate" default="NOW/SECOND" />
.
You can easily modify the degree to which a clause in the query string contributes to the ultimate relevancy score by adding a multiplier. This is called boosting. A value between 0
and 1
reduces the score, and numbers greater than 1
increase it. You'll learn more about scoring in the next chapter. In the following example, we search for artists that either have a member named Billy
, or have a name containing the word Smashing
:
a_member_name:Billy^2 OR Smashing
Here, we search for an artist name containing Billy
, and optionally Bob
or Corgan
, but we're less interested in those that are also named Corgan
:
+Billy Bob Corgan^0.7
This is actually not a new syntax case, but an application of range queries. Suppose you wanted to match all of the documents that have an indexed value in a field. Here, we find all of the documents that have something in a_name
:
a_name:[* TO *]
As a_name
is the default field, just [* TO *]
will do.
This can be negated to find documents that do not have a value for a_name
, as shown in the following code:
-a_name:[* TO *]
Just a_name:*
is usually equivalent, and similarly, -a_name:*
for negation. This was an accidental feature that users discovered. However, for some non-text fields such as numbers and dates, it is much slower, as it uses a completely different code path that was designed for wildcard text matching, not the nature of the actual field type. Consequently, we recommend avoiding this syntax. See SOLR-1982.
Like wildcard and fuzzy queries, these are expensive, slowing down as the number of distinct terms in the field increases.
Performance tip
If you need to perform these frequently, consider adding this to your schema: <field name="field_name_ss" type="string" stored="false" multiValued="true" />
. Then, at index time, add the name of fields that have a value to it. There's JavaScript code for this commented in the update-script.js
file invoked by an UpdateRequestProcessor
. The query would then simply be field_name_ss:a_name
, which is as fast as it gets.
The following characters are used by the query syntax as described in this chapter:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : /
In order to use any of these without their syntactical meaning, you need to escape them by a preceding such as seen here:
id:Artist:11650
This also applies to the field name part. In some cases, such as this one, where the character is part of the text that is indexed, the double-quotes phrase query will also work, even though there is only one term:
id:"Artist:11650"
A common situation in which a query needs to be generated, and thus escaped properly, is when generating a simple filter query in response to choosing a field-value facet when faceting. This syntax and suggested situation is getting ahead of us, but I'll show it anyway since it relates to escaping. The query uses the term
query parser as {!term f=a_type}group
. What follows }
is not escaped at all; even a is interpreted literally, and so with this trick, you needn't worry about escaping rules.