In the previous chapter, you were introduced to the dismax
query parser as the preferred choice for user queries. The parser for user queries is set with the defType
parameter. The syntax, the fields that are queried (with boosts)—qf
, the min-should-match syntax—mm
, and the default query—q.alt
, were already described. We're now going to cover the remaining features: the ones that most closely relate to scoring.
The ability to search across multiple fields with different boosts in this query parser is a feature powered by Lucene's DisjunctionMaxQuery query class. Let's start with an example. If the query string is simply rock
, then DisMax might be configured to turn this into a DisjunctionMaxQuery
similar to this Boolean query:
fieldA:rock^2 OR fieldB:rock^1.2 OR fieldC:rock^0.5
The difference between that Boolean OR
query and DisjunctionMaxQuery
(we will call it just DisMax henceforth) is only in the scoring. Without getting into the details, if the intention is to search for the same text across multiple fields, then it's better to use the maximum subclause score rather than the sum. DisMax will take the max, whereas Boolean uses the sum.
The dismax
query parser has a tie
parameter, which is between zero (the default) and one. By raising this value above zero, it serves as a tie-breaker to give an edge to a document that matched a term in multiple fields versus one. At the highest value of 1
, it scores very similarly to that of a Boolean query.
Suppose a user searches for Billy
Joel
. This is interpreted as two terms to search for, and depending on how the request handler is configured, either both must be found in the document or just one. Perhaps for one of the matching documents, Billy
is the sole name of a band, and it has a member named Joel
. Solr will match this document and perhaps it is of interest to the user since it contained both words the user typed. However, it's a fairly intuitive observation that a document field containing the entirety of what the user typed, Billy
Joel
, represents a closer match to what the user is looking for. Solr would certainly find such a document too, without question, but it's hard to predict what the relative scoring might be. To improve the scoring, you might be tempted to automatically quote the user's query, but that would omit documents that don't have the adjacent words. What the DisMax handler can do is add a phrased version of the user's query onto the original query as an optional clause. So, in a nutshell, it rewrites the following query:
Billy Joel
It then turns it into:
+(Billy Joel) "Billy Joel"
The rewritten query depicts that the original query is mandatory by using +
, and it shows that we've added an optional phrase. A document containing the phrase Billy
Joel
not only matches that clause of the rewritten query, but it also matches Billy
and Joel
—three clauses in total. If in another document the phrase didn't match, but it had both words, then only two clauses would match. Lucene's scoring algorithm would give a higher coordination factor
to the first document, and would score it higher, all other factors being equal.
Automatic phrase boosting is not enabled by default. In order to use this feature, you must use the pf
parameter, which is an abbreviation of
phrase fields. The syntax is identical to qf
. You should start with the same value and then make adjustments. Common reasons to vary pf
from qf
are as follows:
outputUnigrams
to false
.pf tips
Start with the same value used as qf
, but with boosts cut in half. Remove fields that are always one term, such as an identifier. Use common-grams or shingling, as described in Chapter 10, Scaling Solr, to increase performance.
The previous chapter described phrase slop, also known as term proximity. The syntax follows a phrase with a tilde and a number, as follows:
"Billy Joel"~1
The dismax
query parser adds two parameters to automatically set the slop: qs
for any explicit phrase queries that the user entered and ps
for the phrase boosting mentioned previously. If slop is not specified, then there is no slop, which is equivalent to a value of zero. For more information about slop, see the corresponding discussion in the previous chapter. Here is a sample configuration of both slop settings:
qs=1&ps=0
In addition to boosting the entire query as a phrase, edismax
supports boosting consecutive word pairs if there are more than two queried words, and consecutive triples if there are more than three queried words. Setting pf2
and pf3
, respectively, in the same manner that the pf
parameter is defined, configures these. For example, consider the following query:
how now brown cow
It would now become:
+(how now brown cow) "how now brown cow" "how now" "now brown" "brown cow" "how now brown" "now brown cow"
This feature is not affected by the ps
(phrase slop) parameter, which only applies to the entire phrase boost; there's ps2
and ps3
to set these slops.
You can expect the relevancy to improve for longer queries, but of course, these queries are going to be even slower now. To speed up such queries, use common-grams or shingling, described in Chapter 10, Scaling Solr. If you are using pf2
or pf3
, consider a maxShingleSize
of 3 (but monitor its impact on index size), and consider omitting larger text fields from pf2
or pf3
.
Continuing with the boosting theme is another way to affect the score of documents: boost queries. The dismax parser lets you specify multiple additional queries using bq
parameter(s), which, like the automatic phrase boost, get added onto the user's query in a similar manner. Remember that boosting only serves to affect the scoring of documents that already matched the user's query in the q
parameter. If a matched document also matches a bq
query, then it will be scored higher than if it didn't.
For a realistic example of using a boost query, we're going to look at MusicBrainz releases data. Releases have an r_type
field containing values such as Album
, Single
, Compilation
, and others, and an r_official
field containing values such as Official
, Promotion
, Bootleg
, and Pseudo-Release
. We don't want to sort search results based on these, since it's most important to consider search relevancy of the query. However, we might want to influence the score based on these fields. For example, let's say albums are the most relevant release type, whereas a compilation is the least relevant. And let's say that an official release is more relevant than bootleg or promotional or pseudo-releases. We might express this using a boost query like this (defined in the request handler):
bq=r_type:Album^2 (*:* -r_type:Compilation)^2 r_official:Official^2
Searching releases for "the aeroplane flies high" (quoted and not a typo) showed that this boost query did what it should by breaking a score tie in which the release names were the same but these attributes varied. In reality, the boosting on each term would not all be 2; they would be tweaked to have the relevancy boost desired by carefully examining the debugQuery
output. One oddity in this query is (*:* -r_type:Compilation)^2
, which boosts all documents except compilations. Using r_type:Compilation^0.5
would not work since it would still be added to the score and only when the document is a compilation—exactly what we don't want. Put another way, you can't under-boost, but you can indirectly do it by boosting the inverse set of documents. To understand why *:*
is needed, read the previous chapter on the limitations of pure negative queries.
Boost functions offer a powerful way to either add or multiply the result of a user-specified formula to a document's score. By formula, I refer to a composition of Solr function queries, which have been described in detail next in this chapter. To add to the score, specify the function query with the bf
parameter. The edismax
query parser adds support to multiply the result to the score in which you specify the function query with the boost
parameter. You can specify bf
and boost
each as many times as you wish.
An example of boosting MusicBrainz tracks by how recently they were released is:
boost= recip(abs(ms(NOW/DAY,r_event_date_earliest)),1,6.3E10,6.3E10)
There cannot be any spaces within the function. The bf
and boost
parameters are actually not parsed in the same way. The bf
parameter allows multiple boost functions within the same parameter, separated by space, as an alternative to using additional bf
parameters. You can also apply a multiplied boost factor to the function in bf
by appending ^100
(or another number) to the end of the function query. This is equivalent to using the mul()
function query, described later.
Ensure newSearcher
in solrconfig.xml
has a sample query using the boost functions you're using. In doing so, you ensure that any referenced fields are loaded into Lucene's field cache instead of penalizing the first query with this cost. Chapter 10, Scaling Solr, has more information on performance tuning.
In a nutshell, if you can tame the difficulty in additive boosting (the bf
param), then you'll probably be more satisfied with the scoring. Multiplicative boosting (the boost
param) is easier to use, especially if the intended boost query is considered less than or equal to the user query, which is usually true.
If you describe how you'd like the scoring to work as, "I'd like two-thirds of the document score to come from the user query and the remainder one-third to be from my formula," (or whatever ratios) then additive scores are for you. The trick is that you need to know the top score for an excellent match on the user query in order to balance out the proportions right. Try an exact match on a title (a highly boosted field in the query) and see what the top score is. Do this a number of times for a variety of documents, looking for reasonable consistency. So if, for example, the top end of the user query ends up being 1.5, and you want the function query to make up about half as much as the user query does in the final score, then adjust the function query so its upper bound is 0.75. Simply multiply by that if you already have the function query in the 0-1 nominal range. Even if these instructions don't seem too bad, in practice tuning additive scores is tricky since Lucene will react to every change you make by changing the queryNorm
part of the score out from under you, which you have no control over. As it does this, keep your eye on the overall ratios that you want between the added boost part and the user query part, not the final score values. Another bigger problem is that your experiments in gauging the maximum score of the user query will change as your data changes, which will mean some ongoing monitoring of whatever values you choose. And another complication is that DisMax's tie
parameter tends to interfere with this way of boosting.
The other way of thinking about your boost function is as a user query score multiplier (a factor). With multiplication you don't need to concern yourself with whatever a "good" user query score is—it has no bearing here. The tricky part of multiplicative boosts is weighting your boost, so it has the relative impact you want. If you simply supply your nominal range (0-1) function directly as the boost, then it has the same weight as the user query. As you shift the function's values above 0, you reduce the influence it has relative to the user query. For example, if you add 1 to your nominal 0-1 range so that it goes from 1-2, then it is weighted roughly half as much [formula: (2-1)/2 = 0.5].
It's possible to use multiplicative boosts that are weighted as more relevant than the user query, but I haven't fully worked out the details. A place to start experimenting with this is boosting the boost function by a power, say 1.7, which appeared to about double the weight.