The DisMax query parser – part 2

In the previous chapter, you were introduced to the dismax query parser as the preferred choice for user queries. The parser for user queries is set with the defType parameter. The syntax, the fields that are queried (with boosts)—qf, the min-should-match syntax—mm, and the default query—q.alt, were already described. We're now going to cover the remaining features: the ones that most closely relate to scoring.

Note

Any mention herein to dismax applies to the edismax query parser too, unless specified otherwise. As explained in the previous chapter, edismax is the extended DisMax parser. It is generally superior to dismax, as you'll see in the upcoming section.

Lucene's DisjunctionMaxQuery

The ability to search across multiple fields with different boosts in this query parser is a feature powered by Lucene's DisjunctionMaxQuery query class. Let's start with an example. If the query string is simply rock, then DisMax might be configured to turn this into a DisjunctionMaxQuery similar to this Boolean query:

fieldA:rock^2 OR fieldB:rock^1.2 OR fieldC:rock^0.5

The difference between that Boolean OR query and DisjunctionMaxQuery (we will call it just DisMax henceforth) is only in the scoring. Without getting into the details, if the intention is to search for the same text across multiple fields, then it's better to use the maximum subclause score rather than the sum. DisMax will take the max, whereas Boolean uses the sum.

The dismax query parser has a tie parameter, which is between zero (the default) and one. By raising this value above zero, it serves as a tie-breaker to give an edge to a document that matched a term in multiple fields versus one. At the highest value of 1, it scores very similarly to that of a Boolean query.

Tip

In practice, setting tie to a small value like 0.1 is effective.

Boosting – automatic phrase boosting

Suppose a user searches for Billy Joel. This is interpreted as two terms to search for, and depending on how the request handler is configured, either both must be found in the document or just one. Perhaps for one of the matching documents, Billy is the sole name of a band, and it has a member named Joel. Solr will match this document and perhaps it is of interest to the user since it contained both words the user typed. However, it's a fairly intuitive observation that a document field containing the entirety of what the user typed, Billy Joel, represents a closer match to what the user is looking for. Solr would certainly find such a document too, without question, but it's hard to predict what the relative scoring might be. To improve the scoring, you might be tempted to automatically quote the user's query, but that would omit documents that don't have the adjacent words. What the DisMax handler can do is add a phrased version of the user's query onto the original query as an optional clause. So, in a nutshell, it rewrites the following query:

Billy Joel

It then turns it into:

+(Billy Joel) "Billy Joel"

Note

The queries here illustrate phrase boosting in its most basic form. It doesn't depict the DisjunctionMaxQuery that dismax uses, because there's no query syntax for it.

The rewritten query depicts that the original query is mandatory by using +, and it shows that we've added an optional phrase. A document containing the phrase Billy Joel not only matches that clause of the rewritten query, but it also matches Billy and Joel—three clauses in total. If in another document the phrase didn't match, but it had both words, then only two clauses would match. Lucene's scoring algorithm would give a higher coordination factor to the first document, and would score it higher, all other factors being equal.

Configuring automatic phrase boosting

Automatic phrase boosting is not enabled by default. In order to use this feature, you must use the pf parameter, which is an abbreviation of phrase fields. The syntax is identical to qf. You should start with the same value and then make adjustments. Common reasons to vary pf from qf are as follows:

  • To use different (typically lower) boost factors so that the impact of phrase boosting isn't overpowering. Experimentation will guide you to make these adjustments.
  • To omit fields that are always one term, such as an identifier, because there's no point in searching the field for phrases.
  • To omit some of the fields that have lots of text since that might slow down search performance too much.
  • To substitute a field for another that has the same data but is analyzed differently. For example, you might choose to speed up these phrase searches by shingling (a text analysis technique described in Chapter 10, Scaling Solr) into a separate field, instead of shingling the original field. Such a shingling configuration would be a little different than described in that chapter; you would set outputUnigrams to false.

    Tip

    pf tips

    Start with the same value used as qf, but with boosts cut in half. Remove fields that are always one term, such as an identifier. Use common-grams or shingling, as described in Chapter 10, Scaling Solr, to increase performance.

Phrase slop configuration

The previous chapter described phrase slop, also known as term proximity. The syntax follows a phrase with a tilde and a number, as follows:

"Billy Joel"~1

The dismax query parser adds two parameters to automatically set the slop: qs for any explicit phrase queries that the user entered and ps for the phrase boosting mentioned previously. If slop is not specified, then there is no slop, which is equivalent to a value of zero. For more information about slop, see the corresponding discussion in the previous chapter. Here is a sample configuration of both slop settings:

qs=1&ps=0

Partial phrase boosting

In addition to boosting the entire query as a phrase, edismax supports boosting consecutive word pairs if there are more than two queried words, and consecutive triples if there are more than three queried words. Setting pf2 and pf3, respectively, in the same manner that the pf parameter is defined, configures these. For example, consider the following query:

how now brown cow

It would now become:

+(how now brown cow) "how now brown cow" "how now" "now brown" "brown cow" "how now brown" "now brown cow"

This feature is not affected by the ps (phrase slop) parameter, which only applies to the entire phrase boost; there's ps2 and ps3 to set these slops.

Tip

You can expect the relevancy to improve for longer queries, but of course, these queries are going to be even slower now. To speed up such queries, use common-grams or shingling, described in Chapter 10, Scaling Solr. If you are using pf2 or pf3, consider a maxShingleSize of 3 (but monitor its impact on index size), and consider omitting larger text fields from pf2 or pf3.

Boosting – boost queries

Continuing with the boosting theme is another way to affect the score of documents: boost queries. The dismax parser lets you specify multiple additional queries using bq parameter(s), which, like the automatic phrase boost, get added onto the user's query in a similar manner. Remember that boosting only serves to affect the scoring of documents that already matched the user's query in the q parameter. If a matched document also matches a bq query, then it will be scored higher than if it didn't.

For a realistic example of using a boost query, we're going to look at MusicBrainz releases data. Releases have an r_type field containing values such as Album, Single, Compilation, and others, and an r_official field containing values such as Official, Promotion, Bootleg, and Pseudo-Release. We don't want to sort search results based on these, since it's most important to consider search relevancy of the query. However, we might want to influence the score based on these fields. For example, let's say albums are the most relevant release type, whereas a compilation is the least relevant. And let's say that an official release is more relevant than bootleg or promotional or pseudo-releases. We might express this using a boost query like this (defined in the request handler):

bq=r_type:Album^2 (*:* -r_type:Compilation)^2 r_official:Official^2

Searching releases for "the aeroplane flies high" (quoted and not a typo) showed that this boost query did what it should by breaking a score tie in which the release names were the same but these attributes varied. In reality, the boosting on each term would not all be 2; they would be tweaked to have the relevancy boost desired by carefully examining the debugQuery output. One oddity in this query is (*:* -r_type:Compilation)^2, which boosts all documents except compilations. Using r_type:Compilation^0.5 would not work since it would still be added to the score and only when the document is a compilation—exactly what we don't want. Put another way, you can't under-boost, but you can indirectly do it by boosting the inverse set of documents. To understand why *:* is needed, read the previous chapter on the limitations of pure negative queries.

Boosting – boost functions

Boost functions offer a powerful way to either add or multiply the result of a user-specified formula to a document's score. By formula, I refer to a composition of Solr function queries, which have been described in detail next in this chapter. To add to the score, specify the function query with the bf parameter. The edismax query parser adds support to multiply the result to the score in which you specify the function query with the boost parameter. You can specify bf and boost each as many times as you wish.

Note

For a thorough explanation of function queries including instructional MusicBrainz examples, see the next section.

An example of boosting MusicBrainz tracks by how recently they were released is:

boost= recip(abs(ms(NOW/DAY,r_event_date_earliest)),1,6.3E10,6.3E10)

There cannot be any spaces within the function. The bf and boost parameters are actually not parsed in the same way. The bf parameter allows multiple boost functions within the same parameter, separated by space, as an alternative to using additional bf parameters. You can also apply a multiplied boost factor to the function in bf by appending ^100 (or another number) to the end of the function query. This is equivalent to using the mul() function query, described later.

Tip

Ensure newSearcher in solrconfig.xml has a sample query using the boost functions you're using. In doing so, you ensure that any referenced fields are loaded into Lucene's field cache instead of penalizing the first query with this cost. Chapter 10, Scaling Solr, has more information on performance tuning.

Add or multiply boosts

In a nutshell, if you can tame the difficulty in additive boosting (the bf param), then you'll probably be more satisfied with the scoring. Multiplicative boosting (the boost param) is easier to use, especially if the intended boost query is considered less than or equal to the user query, which is usually true.

If you describe how you'd like the scoring to work as, "I'd like two-thirds of the document score to come from the user query and the remainder one-third to be from my formula," (or whatever ratios) then additive scores are for you. The trick is that you need to know the top score for an excellent match on the user query in order to balance out the proportions right. Try an exact match on a title (a highly boosted field in the query) and see what the top score is. Do this a number of times for a variety of documents, looking for reasonable consistency. So if, for example, the top end of the user query ends up being 1.5, and you want the function query to make up about half as much as the user query does in the final score, then adjust the function query so its upper bound is 0.75. Simply multiply by that if you already have the function query in the 0-1 nominal range. Even if these instructions don't seem too bad, in practice tuning additive scores is tricky since Lucene will react to every change you make by changing the queryNorm part of the score out from under you, which you have no control over. As it does this, keep your eye on the overall ratios that you want between the added boost part and the user query part, not the final score values. Another bigger problem is that your experiments in gauging the maximum score of the user query will change as your data changes, which will mean some ongoing monitoring of whatever values you choose. And another complication is that DisMax's tie parameter tends to interfere with this way of boosting.

The other way of thinking about your boost function is as a user query score multiplier (a factor). With multiplication you don't need to concern yourself with whatever a "good" user query score is—it has no bearing here. The tricky part of multiplicative boosts is weighting your boost, so it has the relative impact you want. If you simply supply your nominal range (0-1) function directly as the boost, then it has the same weight as the user query. As you shift the function's values above 0, you reduce the influence it has relative to the user query. For example, if you add 1 to your nominal 0-1 range so that it goes from 1-2, then it is weighted roughly half as much [formula: (2-1)/2 = 0.5].

It's possible to use multiplicative boosts that are weighted as more relevant than the user query, but I haven't fully worked out the details. A place to start experimenting with this is boosting the boost function by a power, say 1.7, which appeared to about double the weight.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset