Spatial search is the ability to find geometric information in a multidimensional space. Most information retrieval systems that support spatial data, including Solr, are limited to a two-dimensional Cartesian plane, with additional support for geospatial search in which two dimensions reference the location on the surface of a sphere.
That description is a bit abstract, so let's now review some common spatial requirements of an application. If your Solr documents represent businesses and you know where the business resides in terms of a latitude and a longitude, then you probably want to show search results (businesses) filtered to the vicinity of where the user is looking. The user interface might have a map centered at a region of interest, and/or you know approximately where the user is from the GPS of their mobile device, or you might even have a GeoIP database at your disposal to map their IP address to an approximate location. Beyond filtering, you then might want to sort or relevancy-boost by the distance between the center of where the user's search area is and where the business is. Indexing points, filtering them by a rectangle or circle, getting the distance to sort or boost, and displaying that distance to the user, are the most common user requirements.
Spatial in Solr is confusing because there are basically two implementations to pick from: LatLonType (since Solr 3) and SpatialRecursivePrefixTreeFieldType also known as RPT (since Solr 4), and they are quite different. I'll describe the internal workings of both, and then how to use them, pointing out differences along the way.
Solr 3 was the first release to have spatial support, and its implementation is still appropriate for common requirements. It is principally comprised of the LatLonType
and PointType
field types, the geofilt
and bbox
query parsers, and the geodist
, dist
, and sqedist
functions. There are some other lesser used functions too.
LatLonType
is geospatially oriented: latitude and longitude with geodetic math—notably the Haversine distance formula. PointType
holds x and y coordinates on a classic 2D Cartesian plane with faster and simpler Euclidean geometry calculations, such as the Pythagorean Theorem for distance. PointType
uniquely supports a variable number of dimensions; it's an obscure feature.
The underlying implementation resides in Solr itself, not Lucene-spatial. The overall approach is straightforward, notwithstanding some optimization tricks. The latitude and longitude (or x and y) are internally indexed into separate numeric fields. This approach doesn't support multivalued data, such as modeling businesses with multiple locations. If the query is just a rectangular filter, then it's pretty fast since it just needs to do a couple simple numeric range queries. If, on the other hand, the query shape is a point-distance (circle) shape, then the distance is calculated to potentially all points, depending on the circumstances; consequently, it isn't very scalable. Another thing to be aware of is that the FieldCache is used whenever the distance is required, which is for a point-distance query shape, and/or to sort by distance. The FieldCache holds all coordinate values in memory.
Some of the key characteristics have been summarized here:
bbox
is scalable but geofilt
(circle) is notConfiguring LatLonType
and PointType
is easy. In the following excerpt taken from Solr's example schema, given the field name store, there will be two automatically generated fields named store_0_coordinate
and store_1_coordinate
, which you'll see in Solr's schema browser:
<field name="store" type="location" indexed="true" stored="true"/> <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/> <dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>
We'll show you how to index and search for LatLonType
further in this chapter.
Solr 4 introduced SpatialRecursivePrefixTreeFieldType
, referred to as RPT because it's a mouthful. This one field type can be configured for either geodetic or Euclidean math. The Solr code is not much more than an adapter to the technology that mostly lives in the Lucene spatial module, plus a dependency on Spatial4j for the shape implementations. Additionally, the third-party JTS Topology Suite is required for some shapes, such as polygons. The implementation scheme is based on a variable-depth hierarchical grid in which the world is decomposed into grid cells, which are in turn recursively decomposed into smaller grid cells, until the desired precision is reached.
Indexed shapes of basically any kind are represented completely on the grid, and there are scalable algorithms to find them in relation to a query shape at search time. It's more powerful and often faster at filtering than the comparatively simple LatLonType. For distance sorting/boosting, it includes a custom point cache, but this feature should be avoided as it doesn't scale well.
This picture shows how a polygon of France is decomposed into geohash grid cells of varying sizes. This was easily generated using a utility in the web demo of Spatial Solr Sandbox on GitHub that generates a KML file that Google Earth can render.
Its key characteristics can be summarized as follows:
Intersects
, Contains
, and IsWithin
predicates
If you need the distance and you have more than perhaps a million documents, or if you have real-time search requirements, then you should use both spatial implementations: RPT for its fast filtering and LatLonType
to get the distance. On the other hand, if RPT's features aren't useful to you and you have a million or less documents, then I recommend using LatLonType
alone. In a future Solr release, see LUCENE-4698.
The RPT field type has many configuration options; we're going to stick to the ones needed for basic geospatial search requirements right now and address other options later. A basic configuration exists in Solr's example schema:
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true" distErrPct="0.025" maxDistErr="0.000009" units="degrees" />
To use this field type, you need to declare a field that references this field type.
geo
: This is a Boolean that specifies whether this spatial mathematical model is based on a spherical earth model using latitude and longitude coordinates, or a flat 2D plane.units
: Ignore this; it must be set to degrees
. In Solr 5, it will be renamed to distanceUnits
and support values like kilometers
. In Solr 4, it has no effect. When you see a reference to a measurement in degrees (such as maxDistErr
below), know that it refers to 1/360th of the circumference of a sphere when geo=true
. Assuming the earth is that sphere, this works out to 111.2 kilometers per degree (2πR/360, R=6,353km).maxDistErr
: This refers to the precision of the spatial data in terms of the maximum distance error, as measured in degrees. For example, if the greatest precision you care about is a meter (0.001km), set it as measured in degrees (0.001/111.2 = 0.000009). The actual maximum distance error will be smaller than this, since it's used to choose a grid level that meets or exceeds this precision.distErrPct
: If this number is non-zero, then non-point shapes will be approximated as a function of their overall size. This value is a fraction of a shape's approximate radius to be the allowable error, which in turn indicates to what grid level the shape will be maximally represented as. Under no circumstance can you have more precision than what maxDistErr
yields. It applies to both indexed non-point shapes and query non-point shapes. There is a way to use a different value for a query shape if desired.If that's confusing, look again at the gridded map of France and look at the edge. If distErrPct
is raised higher, the edge will eventually be even blockier. There are obviously scalability limitations if you attempt to index non-point shapes with a distErrPct
of 0
. Query shapes can handle it fine.
The simplest and most common spatial data to index is a point.
If you have named locations (for example, Boston, MA) or addresses, then the data needs to be resolved to latitudes and longitudes using a geocoder. You can run your own with Gisgraphy—found at http://www.gisgraphy.com—or use a hosted service from Google, Yahoo, or others. Most hosted services have caps and/or fees.
When providing data to this field, it is formatted as a string with the latitude then the longitude, separated by a comma. Here's an example in Solr's Update-XML format:
<field name="store">43.17614,-90.57341</field>
Whether you use SolrJ, the DIH, or any other client/format, it appears as a string going in and coming out of Solr. If you have multiple points to index, simply supply them as additional values, as you would for any other multivalued field. LatLonType
can't handle this but RPT will.
If the field type is PointType
, then the dimension order is x,y
. If the field type is RPT and geo is false
, then you should supply points as x y
(a space in between). If you use a comma in this circumstance, then the dimensions will be flipped. This is probably a bug, so don't rely on that behavior.
Perhaps the most common geospatial need is to filter search results to those documents within a distance radius from a center point. If you are building an application in which the user is presented with a map, perhaps using Google Maps, then the center point is the center of the map the user is looking at and the distance radius is the distance from the center to the nearest map edge. Such a query is generally specified using a Solr filter query (fq
parameter) leaving q
open for the possibility of a combined keyword search if desired. Both the geofilt
and bbox
query parsers perform geospatial filtering. The geofilt
query parser implements a point-distance based filter, a circle shape, whereas bbox
uses the minimum bounding latitude-longitude box surrounding that circle. You can also specify an arbitrary rectangle using Solr's standard range syntax but using points.
LatLonType tuning
If you are using LatLonType
, bbox
and the rectangle range query syntax are faster than geofilt
because they are able to make simple/scalable latitude and longitude numeric range searches, whereas geofilt
computes the distance to every point it sees. If you need geofilt
and if your spatial queries aren't very cacheable (tend to not be in your filter cache), then try adding these local-params: cache=false cost=100
.
Here is a quick example based on Solr's example schema and dataset, showing the URL parameters needed to do the search:
q=*:*&fq={!bbox}&sfield=store&pt=45.15,-93.85&d=5
The parameters that geofilt
and bbox
require can be specified as either local-params (between the parser name and closing bracket) or standard request parameters, as shown above. To be clear, here's the same query using local-params:
q=*:*&fq={!bbox sfield=store pt=45.15,-93.85 d=5}
The advantage of not using local-params is that a combined distance sort can reuse the same parameters, as you'll see in a bit. Here are geofilt
and bbox
's parameters:
sfield
: The name of the spatial fieldpt
: A latitude-comma-longitude pair for the center point of the queryd
: The query distance from pt
in kilometers (see sphere_radius
)sphere_radius
: The radius of the sphere (Earth) in desired units for d
. It defaults to the Earth's mean radius in kilometers.To query by an arbitrary latitude-longitude rectangle, use a range query between the lower-left corner (smallest latitude and longitude) to the upper-right like so:
q=*:*&fq=store:[43.2,-94.1 TO 46.3,-92.0]
Solr can sort search results by the distance between a document's point and another point supplied at query time. That point is typically the center point of a combined spatial filter. The typical way to spatially sort is to use Solr's ability to sort by a function query—a spatial function, geodist()
in particular. The geodist()
function calculates the geospatial distance (the "great circle distance") as calculated using the Haversine formula between a pair of points. The points are each taken from the first available of an argument, the pt
parameter, or the sfield
parameter. Any of these might be absent, but at least two must be specified. When a point is specified as an argument (within the parenthesis), it can either be a geospatial field name or a pair of arguments that are in latitude then longitude order. The latitude or longitude can be constants, or they may reference numeric fields' names. Here's an example of this:
&sort=geodist(store,42.4,-71.1) asc
Before the RPT field supported geodist()
in Solr 4.5, a different, more awkward syntax was needed. You used to have to add a score=distance
local-param to the spatial query and then put the spatial query into q
to sort by score and filter at the same time. Or, even more awkwardly, you could sort by the query()
function query. You can find examples of that syntax, if you must, online.
It's often desirable to show the distance to users in search results. The most straightforward way to do this is to use a new Solr 4 feature that allows putting a function query into the field list: the fl
param. Doing this is independent of any filtering or sorting that may exist, and so it's easy to use.
&fl=*,score,dist:geodist()&sfield=store&pt=45.15,-93.8
That example also gave a label of dist
to this number in the search results; another label could have been chosen. Without a label, it is named by the actual function query itself in the results.
Perhaps you don't want to sort by distance, but you want to influence the relevancy (so-called boosting) by distance. Relevancy tuning is covered in the next chapter, so you will need to read that to better understand this section. The base formula to use for distance-based boosting should be the reciprocal. A Solr function query for the reciprocal used for boosting in this way should look like recip(x,1,c,c)
, where x is the distance and c is 1/10th of what I call the horizon distance. If you have a spatial filter in place, then use the radius of the query shape (center to edge or corner) as the horizon. Otherwise, pick an approximate distance at which any greater distance is unlikely to be relevant to the user. The result of the reciprocal function used in this way will range from 1.0 at the center of the query point to 0.1 at the horizon distance; and it approaches 0 further away.
For example, let's say you have a spatial filter of a 100-kilometer radius at 42.15° N, 93.85° W; you have a keyword search in place using the edismax
query parser; and you want to multiply the distance boost to the score:
&defType=edismax&q=…&qf=…&sort=score desc &fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=100 &boost=recip(geodist(),1,10,10)
Sorting or boosting by distance will require all indexed points to be in memory. Use floats instead of doubles, if possible, to reduce this footprint with LatLonType
. The RPT field type still has a sub-par point cache implementation that has a high memory overhead on a per-point and per-document basis. If you can, use LatLonType
for sorting instead of RPT, until that is rectified.
Sorting and boosting on a search that matches a great many documents can be slow. This is because the distance needs to be computed to each matching point, even if you only return the top 10 of them. Furthermore, the Haversine formula involves a fair amount of trigonometry that can computationally really add up when computed between many points. What can be done to help is use a cheaper distance function that is less accurate in a geospatial sense but is faster to compute. For example, if all the indexed data is just in one region of the world, then you could project the data onto a 2D plane with minimal distortion around the edges. The data would go in a pair of float fields and then you could replace use of geodist()
with sqedist()
—squared Euclidean distance. It takes four arguments, the x and y of the indexed fields, and another pair for the query point. Projecting data is well beyond the scope of this book. For further information, visit http://trac.osgeo.org/proj4j/.
This book sets aside a fair amount of space to cover spatial search in Solr for the needs of most applications, but there's more that couldn't be included:
PointType
and dist()
, and the implications of geo="false"
on RPT.Intersects
, Contains
, IsWithin
, and IsDisjointTo
. When indexing non-point shapes, there are more applicable relations than what can occur with just points.BBoxField
field type supports nearly every predicate and has area-overlap relevancy. It is new in Solr 4.10.For more coverage of these spatial topics, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Spatial+Search.