Have you ever searched for something and found a link that wasn't quite what you were looking for but was reasonably close? If you were using an Internet search engine such as Google, then you may have tried the "more like this…" link next to a search result. Some sites use other language like "find similar..." or "related documents…" As these links suggest, they show you pages similar to another page. Solr supports more like this (MLT) too.
The MLT capability in Solr can be used in the following three ways:
mlt
that can more easily be combined with other queries or relevancy boosting than the other options. See the Solr Reference Guide for further information.The essences of the internal workings of MLT operate like this:
mlt.fl
, and then the term information needed is readily there for the taking if the field has termVectors
enabled. Otherwise, get the stored text and reanalyze it to derive the terms (slower).mlt.fl
.OR
query with these interesting terms across all of the fields listed in mlt.fl
.In the following configuration options, the input document is either each search result returned if MLT is used as a component, or it is the first document returned from a query to the MLT request handler, or it is the plain text sent to the request handler. It simply depends on how you use it.
Using the MLT search component adorns an existing search with MLT results for each document returned.
Using the MLT request handler is more like a regular search, except that the results are documents similar to the input document. Additionally, filters (the fq
parameter) are applied.
q
, start
, and rows
: The MLT request handler uses the same standard parameters for the query start offset, and row count as used for querying. But in this case, start
and rows
is for paging into the MLT results instead of the results of the query. The query is typically one that simply references one document, such as id:12345
(if your unique field looks like this). start
defaults to 0
and rows
to 10
.mlt.match.offset
: This parameter is the offset into the results of q
to pick which document is the input document. It defaults to 0
so that the first result from q
is chosen. As q
will typically search for one document, this is rarely modified.mlt.match.include
: The input document is normally included in the response if it is in the index (see the match
element in the output of the example) because this parameter defaults to true
. Set this parameter to false
to exclude it, if that information isn't needed.mlt.interestingTerms
: If this is set to list
or details
, then the so-called interesting terms that MLT uses for the similarity query are returned with the results in an interestingTerms
element. If you enable mlt.boost
, then specifying details
will additionally return the query boost value used for each term. The default, none
, disables this. Aside from diagnostic purposes, it might be useful to display these in the user interface, either listed out or in a tag cloud.facet
, ...: The MLT request handler supports faceting the MLT results. See the previous chapter on how to use faceting.These parameters are common to both the search component and request handler. Some of the thresholds here are to tune which terms are interesting to MLT. In general, expanding thresholds (that is, lowering minimums and increasing maximums) will yield more useful MLT results at the expense of performance. The parameters are explained as follows:
mlt.fl
: This provides a comma- or space-separated list of fields to consider in MLT. The interesting terms are searched within these fields only. These field(s) must be indexed. Furthermore, assuming the input document is in the index instead of supplied externally (as is typical), then each field should ideally have termVectors
set to true
in the schema (best for query performance, although index size is larger). If that isn't done, then the field must be stored so that MLT can re-analyze the text at runtime to derive the term vector information. It isn't necessary to use the same strategy for each field.mlt.qf
: Different field boosts can optionally be specified with this parameter. This uses the same syntax as the qf
parameter that is used by the DisMax query parser (for example: field1^2.0
field2^0.5
). The fields referenced should also be listed in mlt.fl
. If there is a title or similar identifying field, then this field should probably be boosted higher.mlt.mintf
: This parameter specifies the minimum number of times (frequency) a term must be used within a document (across those fields in mlt.fl
anyway) for it to be an interesting term. The default is 2
. For small documents, such as in the case of our MusicBrainz dataset, try lowering this to 1.mlt.mindf
: This specifies the minimum number of documents that a term must be used in for it to be an interesting term. It defaults to 5
, which is fairly reasonable. For very small indexes, as little as 2
is plausible, and maybe larger for large multi-million document indexes with common words.mlt.maxdf
: This specifies the maximum number of documents that a term must be used in for it to be an interesting term. There is no limit, by default.mlt.minwl
: This is used to specify the minimum number of characters in an interesting term. It defaults to 0
, effectively disabling the threshold. Consider raising this to 2 or 3.mlt.maxwl
: This parameter specifies the maximum number of characters in an interesting term. It defaults to 0
and disables the threshold. Some really long terms might be flukes in input data and are out of your control, but most likely this threshold can be skipped.mlt.maxqt
: This specifies the maximum number of interesting terms that will be used in an MLT query. It is limited to 25
by default, which is plenty.mlt.maxntp
: Fields without termVectors
enabled take longer for MLT to analyze. This parameter sets a threshold to limit the number of terms to consider in an input field to further limit the performance impact. It defaults to 5000
.mlt.boost
: This Boolean toggles whether or not to boost each interesting term used in the MLT query differently, depending on how interesting the MLT module deems it to be. It defaults to false
, but try setting it to true
and evaluating the results.Usage advice
For ideal query performance, ensure that termVectors
is enabled for the field(s) referenced in mlt.fl
, particularly in the larger fields. In order to further increase performance, use fewer fields, perhaps just one that is dedicated for use with MLT. Using the copyField
directive in the schema makes this easy. The disadvantage is that the source fields cannot be boosted differently with mlt.qf
. However, you might have two fields for MLT as a compromise. Use a typical full complement of text analysis including lowercasing, synonyms using a stop list (such as StopFilterFactory
), and aggressive stemming in order to normalize the terms as much as possible. The field needn't be stored if its data is copied from some other field that is stored. During an experimentation period, look for interesting terms that are not so interesting for inclusion in the stop word list. Lastly, some of the configuration thresholds that scope the interesting terms can be adjusted based on experimentation.
Firstly, an important disclaimer on this example is in order.
The MusicBrainz dataset is not conducive to applying the MLT feature, because it doesn't have any descriptive text. If there was perhaps an artist description and/or widespread use of user-supplied tags, then there might be sufficient information to make MLT useful. However, to provide an example of the input and output of MLT, we will use MLT with MusicBrainz anyway.
We'll be using the request handler method—the recommended approach. The MLT request handler needs to be configured in solrconfig.xml
. The important bit is the reference to the class, the rest of it is our prerogative.
<requestHandler name="/mlt_tracks" class="solr.MoreLikeThisHandler">
<lst name="defaults">
<str name="mlt.fl">t_name</str>
<str name="mlt.mintf">1</str>
<str name="mlt.mindf">2</str>
<str name="mlt.boost">true</str>
</lst>
</requestHandler>
This configuration shows that we're basing the MLT on just track names. Let's now try a query for tracks similar to the song "The End is the Beginning is the End" by The Smashing Pumpkins. The query was performed with echoParams
to clearly show the options used:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name="params"> <str name="mlt.mintf">1</str> <str name="mlt.mindf">2</str> <str name="mlt.boost">true</str> <str name="mlt.fl">t_name</str> <str name="rows">5</str> <str name="mlt.interestingTerms">details</str> <str name="indent">on</str> <str name="echoParams">all</str> <str name="fl">t_a_name,t_name,score</str> <str name="q">id:"Track:1810669"</str> </lst> </lst> <result name="match" numFound="1" start="0" maxScore="16.06509"> <doc> <float name="score">16.06509</float> <str name="t_a_name">The Smashing Pumpkins</str> <str name="t_name">The End Is the Beginning Is the End</str> </doc> </result> <result name="response" numFound="855211" start="0" maxScore="6.3063927"> <doc> <str name="t_name">End Is the Beginning</str> <str name="t_a_name">In Grey</str> <float name="score">6.3063927</float></doc> <doc> <str name="t_name">Is the End the Beginning</str> <str name="t_a_name">Mangala Vallis</str> <float name="score">5.6426353</float></doc> <doc> <str name="t_name">The End Is the Beginning</str> <str name="t_a_name">Royal Anguish</str> <float name="score">5.6426353</float></doc> <doc> <str name="t_name">The End Is the Beginning</str> <str name="t_a_name">Ape Face</str> <float name="score">5.6426353</float></doc> <doc> <str name="t_name">The End Is the Beginning Is the End</str> <str name="t_a_name">The Smashing Pumpkins</str> <float name="score">5.0179915</float></doc> </result> <lst name="interestingTerms"> <float name="t_name:end">1.0</float> <float name="t_name:is">0.7513826</float> <float name="t_name:the">0.6768603</float> <float name="t_name:beginning">0.62302685</float> </lst> </response>
The <result name="match">
element is there due to mlt.match.include
defaulting to true
. The <result name="response" …>
element has the main MLT search results. The fact that so many documents were found is not material to any MLT response; all it takes is one interesting term in common. The interesting terms were deliberately requested so that we can get an insight on the basis of the similarity. The fact that is
and the
were included shows that we don't have a stop list for this field—an obvious thing to fix to improve the results. Nearly any stop list is going to have such words.
For further diagnostic information on the score computation, set debugQuery
to true
. This is a highly advanced method but it exposes information invaluable to understand the scores. Doing so in our example shows that the main reason the top hit was on top was not only because it contained all of the interesting terms as did the others in the top 5, but also because it is the shortest in length (a high fieldNorm
).