The stats component calculates some mathematical statistics of fields in the index. The main requirement is that the field be indexed. The following statistics are computed over the non-null values (except missing
which counts the nulls):
min
: The smallest valuemax
: The largest valuesum
: The sumcount
: The quantity of non-null values accumulated in these statisticsmissing
: The quantity of records skipped due to missing valuessumOfSquares
: The sum of the square of each value; this is probably the least useful and is used internally to compute stddev
efficientlymean
: The average valuestddev
: The standard deviation of the valuesdistinctValues
: A list of all distinct (non-duplicating) valuescountDistinct
: The size of distinctValuesIf you calculate stats on a string or date field, then only min
, max
, count
, missing
, distinctValues
, and countDistinct
are calculated. The distinctValues
and countDistinct
are only present if stats.calcdistinct
is enabled.
This component is simple to configure and can be done as follows:
stats
: Set this to true
in order to enable the component. It defaults to false
.stats.field
: Set this to the name of the indexed field to calculate statistics on. It is required. This field must be indexed or preferably have DocValues. This parameter can be added multiple times in order to calculate statistics on more than one field. And like facet.field
, it can be preceded with a filter query exclusion in local-params syntax; for example, &stats.field={!ex=t_duration}t_duration&fq={!tag=t_duration}t_duration:1000
.stats.calcdistinct
: A Boolean option to include a list of all distinct (non-duplicating) values for this field. Be judicious about using this! Using it on some fields could trigger an OutOfMemoryError
easily. Solr 5.2 has a scalable option to provide an estimated count.stats.facet
: Optionally, set this to the name of the field in which you want to facet the statistics over. Instead of the results having just one set of stats (assuming one stats.field
), there will be a set for each value in this field, and those statistics will be based on that corresponding subset of data. This is analogous to the GROUP BY
syntax in SQL. This parameter can be specified multiple times to compute the statistics over multiple fields' values. In addition, you can use the field-specific parameter name syntax for cases when you are computing stats on different fields and you want to use a different facet field for each statistic field. For example, you can specify f.t_duration.stats.facet=tracktype
assuming a hypothetical field tracktype
to categorize the t_duration
statistics on. The field should be indexed or have DocValues and not tokenized.Let's look at some statistics for the duration of tracks in MusicBrainz at http://localhost:8983/solr/mbtracks/mb_tracks?rows=0&indent=on &stats=true&stats.field=t_duration
.
And here are the results:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">5202</int> </lst> <result name="response" numFound="6977765" start="0"/> <lst name="stats"> <lst name="stats_fields"> <lst name="t_duration"> <double name="min">0.0</double> <double name="max">36059.0</double> <double name="sum">1.543289275E9</double> <long name="count">6977765</long> <long name="missing">0</long> <double name="sumOfSquares">5.21546498201E11</double> <double name="mean">221.1724348699046</double> <double name="stddev">160.70724790290328</double> </lst> </lst> </lst> </response>
This query shows that on average, a song is 221
seconds (or 3 minutes 41 seconds) in length. An example using stats.facet
would produce a much longer result, which won't be given here in order to leave space for other components. However, there is an example at http://wiki.apache.org/solr/StatsComponent.