While most of this book assumes that the content you want to index in Solr is in a neatly structured data format of some kind, such as in a database table, a selection of XML files, or CSV, the reality is that we also store information in the much messier world of binary formats such as PDF, Microsoft Office, or even images and music files.
One of the coauthors of this book, Eric Pugh, first became involved with the Solr community when he needed to ingest the thousands of PDF and Microsoft Word documents that a client had produced over the years. The outgrowth of that early effort is Solr Cell providing a very powerful and simple framework for indexing rich document formats.
Solr Cell is technically called the ExtractingRequestHandler
. The current name came about as a derivation of Content Extraction Library, which appeared more fitting to its author, Grant Ingersoll. Perhaps a name including Tika would have been most appropriate considering that this capability is a small adapter to Tika. You may have noticed that the DIH includes this capability via the appropriately named TikaEntityProcessor
.
The complete reference material for Solr Cell is available at https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika.
Every file format is different and all of them provide different types of metadata, as well as different methods of extracting content. The heavy lifting of providing a single API to an ever-expanding list of formats is delegated to Apache Tika.
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Tika supports a wide variety of formats, from the predictable to the unexpected. Some of the most commonly used formats supported are Adobe PDF, Microsoft Office, including Word, Excel, PowerPoint, Visio, and Outlook. The other formats that are supported include extracting metadata from images such as JPG, GIF, and PNG, as well as from various audio formats such as MP3, MIDI, and Wave audio. Tika itself does not attempt to parse the individual document formats. Instead, it delegates the parsing to various third-party libraries, while providing a high-level stream of XML SAX events as the documents are parsed. A full list of the supported document formats supported by the 1.5 version that are used by Solr 4.8 is available at http://tika.apache.org/1.5/formats.html.
Solr Cell is a fairly thin adapter to Tika consisting of a SAX ContentHandler
that consumes the SAX events and builds the input document from the fields that are specified for extraction.
Some not-so-obvious things to keep in mind when indexing binary documents are:
stream.type
parameter.SolrContentHandler
that is used by Solr Cell is fairly simplistic. You may find that you need to perform extra massaging of the data being indexed beyond what Solr Cell offers to reduce the junk data being indexed. One approach is to implement a custom Solr UpdateRequestProcessor
, described later in this chapter. Another is to subclass ExtractingRequestHandler
and override createFactory()
to provide a custom SolrContentHandler
.You can learn more about the Tika project at http://tika.apache.org/.
A sample request handler for parsing binary documents, in solrconfig.xml
, looks like the following code:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="map.Last-Modified">last_modified</str> <str name="uprefix">metadata_</str> </lst> </requestHandler>
Here, we can see that the Tika metadata attribute Last-Modified
is being mapped to the Solr field last_modified
, assuming we are provided that Tika attribute. The uprefix
parameter specifies the prefix to use when storing any Tika fields that don't have a corresponding matching Solr field.
Solr Cell is distributed as a contrib module and is made up of the solr-cell-4.x.x.jar
and roughly 25 more JARs that support parsing individual document formats. In order to use Solr Cell, you will need to place the Solr Cell JAR and supporting JARs in the lib
directory for the core, as it is not included by default in solr.war
. To share these libs across multiple cores, you would add them to ./examples/cores/lib/
.
Before jumping into examples, we'll review Solr Cell's configuration parameters, all of which are optional. They are organized here and are ordered roughly by their sequence of use internally.
At first, Solr Cell (or, more specifically, Tika) determines the format of the document. It generally makes good guesses, but it can be assisted with these parameters:
Tika converts all input documents into a basic XHTML document, including metadata in the head
section. The metadata becomes fields and all text within the body goes into the content
field. The following parameters further refine this:
capture
: This is the XHTML element name (for example, "p"
) to be copied into its own field; it can be set multiple times.captureAttr
: This is set to true
to capture XHTML attributes into fields named after the attribute. A common example is for Tika to extract href
attributes from all the <a/>
anchor tags for indexing into a separate field.xpath
: This allows you to specify an XPath query to filter which element's text is put into the content
field. To return only the metadata, and discard all the body content of the XHMTL, you would use xpath=/xhtml:html/xhtml:head/descendant:node()
. Notice the use of the xhtml:
namespace prefix for each element. Note that only a limited subset of the XPath specification is supported. See http://tika.apache.org/0.8/api/org/apache/tika/sax/xpath/XPathParser.html. The API fails to mention that it also supports /descendant:node()
.literal.[fieldname]
: This allows you to supply the specified value for this field, for example, for the unique key field.At this point each resulting field name is potentially renamed in order to map into the schema. These parameters control this process:
lowernames
: This is set to true
to lowercase the field names and convert nonalphanumeric characters to an underscore. For example, Last-Updated
becomes last_updated
.fmap.[tikaFieldName]
: This maps a source field name to a target field name. For example, fmap.last_modified=timestamp
maps the metadata field last_modified
generated by Tika to be recorded in the timestamp
field defined in the Solr schema.uprefix
: This prefix is applied to the field name, if the unprefixed name doesn't match an existing field. It is used in conjunction with a dynamic field for mapping individual metadata fields separately:uprefix=meta_ <dynamicField name="meta_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
defaultField
: This is a field to use if uprefix
isn't specified, and no existing fields match. This can be used to map all the metadata fields into one multivalued field:defaultField=meta <field name="meta" type="text_general" indexed="true" stored="true" multiValued="true"/>
The other miscellaneous parameters:
boost.[fieldname]
: Boost the specified field by this factor, a float value, to affect scoring. For example, boost="2.5"
, default value is 1.0
.extractOnly
: Set this to true
to return the XHTML structure of the document as parsed by Tika without indexing the document. This is typically done in conjunction with wt=json&indent=true
to make the XHTML easier to read. The purpose of this option is to aid in debugging.extractFormat
: This defaults to xml
(when extractOnly=true
) to produce the XHMTL structure. Can be set to text
to return the raw text extracted from the document.