Chapter 4. Indexing Data

In this chapter, we're going to explore ways to get data into Solr. The process of doing this is referred to as indexing, although the term importing is also used.

This chapter is structured as follows:

  • Communicating with Solr
  • Sending data using Solr's Update-XML, JSON, and CSV formats
  • Commit, optimize, and rollback the transaction log
  • Atomic updates and optimistic concurrency
  • Importing content from a database or XML using Solr's DataImportHandler (DIH)
  • Extracting text from rich documents through Solr's ExtractingRequestHandler (also known as Solr Cell)
  • Post-processing documents with UpdateRequestProcessors

You will also find some related options in Chapter 9, Integrating Solr, that have to do with language bindings and framework integration, including a web crawler. Most use Solr's Update-XML format.

Tip

In a hurry?

There are many approaches to indexing, and you don't need to be well versed in all of them. The section on commit and optimize is important for everyone because it is universal. If you plan to use a Solr integration framework that handles indexing data, such as Sunspot for Ruby on Rails, then you can follow the documentation for that framework and skip this chapter for now. Otherwise, the DataImportHandler will likely satisfy your needs.

Communicating with Solr

There are quite a few options when it comes to importing data into Solr. In this section, we'll examine a few of those, and then follow up with interaction examples. Details on specific formats, such as Solr's Update-XML, will come later.

The following diagram represents the high-level workflow of the indexing process in Solr. In addition to the predefined importing mechanisms, you can also build custom import handlers. Before generating the index, Solr uses the field definitions and other configurations from schema.xml and solrconfig.xml to process the data for each field.

Communicating with Solr

Using direct HTTP or a convenient client API

Most applications interact with Solr over HTTP. This can either be done using a typical HTTP client, or indirectly via a Solr integration API such as SolrJ or Sunspot. Such APIs are discussed in Chapter 9, Integrating Solr.

Note

Another option is to embed Solr into your Java application instead of running it as a server. The SolrJ API is conveniently used for both remote and embedded use. More information about SolrJ and Embedded Solr can be found in Chapter 9, Integrating Solr.

Pushing data to Solr or have Solr pull it

Even though an application will be communicating with Solr over HTTP, it does not have to include the documents to be indexed in the request. Solr supports what it calls remote streaming in which it's given a URL to the data. It might be an HTTP URL, but it's more likely to be a filesystem-based URL, applicable when the data is already on Solr's machine or a locally accessible filesystem. This option avoids the overhead of sending documents over HTTP. Another way to ask Solr to pull data is to use the DataImportHandler (DIH), which can pull data from a database and other sources. The DIH offers an extensible framework that can be adapted to custom data sources.

Data formats

The following are various data formats for indexing data into Solr:

  • Solr's Update-XML: Solr accepts documents that are expressed in XML conforming to a simple Solr-specific format. This XML option also has support for commands such as delete, commit, and optimize.

    Other XML: Any arbitrary XML can be given to Solr along with an XSLT file that Solr will use to translate the XML to the Update-XML format for further processing. There is a short example of this in the Importing XML from a file with XSLT section, by way of comparison.

  • Solr's Update-JSON: This is a JavaScript Object Notation (JSON) variation of Solr's Update-XML. For more details, see https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations.
  • Java-Bin: This is an efficient binary variation of Solr's Update-XML. Officially, only the SolrJ client API supports this, but there is a third-party Ruby port too.
  • CSV: A comma (or other character) separated value format.
  • Rich documents: Most user file formats such as PDF, XLS, DOC, and PPT; text; and metadata are extracted from these formats and put into various Solr fields. This is enabled via the Solr Cell contrib module.

    Tip

    The DataImportHandler contrib module is a flexible data import framework. It has out-of-the-box support for dealing with arbitrary XML and even e-mail. It is commonly used for pulling data from relational databases. For some enterprises, it might be more appropriate to integrate Solr with Apache Camel, a versatile open-source integration framework based on well-known Enterprise Integration Patterns. It provides a nice Domain Specific Language (DSL) for wiring together different data sources, performing transformations, and finally sending data to Solr. For more details, see http://camel.apache.org/solr.html.

We'll demonstrate Solr's capability to import MusicBrainz data in XML, CSV, and from a database. Other examples will include rich document importing via the DIH and Solr Cell. Most likely, an application would use just one format.

Before these approaches are described, we'll discuss remote streaming—a foundational topic.

Solr's HTTP POST options

Solr receives commands and possibly documents through HTTP POST.

Tip

Solr lets you use HTTP GET too, such as direct web browser access. However, this is an inappropriate HTTP verb for anything other than retrieving data and the size of the request is limited in most of the web servers, hence too long requests are not processed correctly. For more information on this concept, read about REST at http://en.wikipedia.org/wiki/Representational_State_Transfer.

One way to send an HTTP POST is through the Unix command-line program curl (also available on Windows through Cygwin: http://www.cygwin.com). An alternative is cross-platform import tool that comes with Solr is post.jar (also known as SimplePostTool) located in Solr's example/solr directory. To get some basic guidance on how to use it, run the following command:

>> java –jar example/solr/post.jar -help

You'll see in a bit that you can post name-value pair options as HTML form data. However, post.jar doesn't support that, so you'll have to specify the URL and put the options in the query string.

Note

The post.jar tool also has an automode, which guesses the content type for you, and also sets a default ID and filename when sending to Solr. Also, a recursive option lets you automatically post a whole directory (including the subdirectories).

For the next set of examples, we'll use the command-line program curl.

There are several ways to tell Solr to index data, and all of them are through HTTP POST:

  • Send the data as the entire POST payload. The curl program can do this with --data-binary (among other ways) and an appropriate content-type header for whatever the format is.
  • Send name-value pairs akin to an HTML form submission. With curl, such pairs are preceded by -F. If you're giving data to Solr to be indexed as opposed to it looking for data in a database, then there are a few ways to do that:
    • Put the data into the stream.body parameter. If it's small, perhaps less than a megabyte, this approach is fine. The limit is configured with the multipartUploadLimitInKB setting in solrconfig.xml, defaulting to 2 GB. If you're tempted to increase this limit, you should reconsider your approach.
    • Refer to the data through either a local file on the Solr server using the stream.file parameter or a URL that Solr will fetch through the stream.url parameter. These choices are a feature that Solr calls remote streaming.

Here is an example of the first choice. Let's say we have a Solr Update-XML file named artists.xml in the current directory. We can post it to Solr using the following command line:

>> curl http://localhost:8983/solr/mbartists/update -H 'Content-type:text/xml; charset=utf-8' --data-binary @artists.xml

If it succeeds, you'll have output that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
    <int name="status">0</int><int name="QTime">128</int>
</lst>
</response>

To use the stream.body feature for the preceding example, do the following:

curl http://localhost:8983/solr/mbartists/update -F [email protected]

In both cases, the @ character instructs curl to get the data from the file instead of being @artists.xml literally. If the XML is short, you can just as easily specify it, literally, on the command line:

curl http://localhost:8983/solr/mbartists/update -F stream.body=' <commit />'

Notice the leading space in the value. This was intentional. In this example, curl treats @ and < to mean things we don't want. In this case, it might be more appropriate to use form-string instead of -F. However, it requires more typing, and we're feeling lazy.

Remote streaming

In the preceding examples, we've given Solr the data to index in the HTTP message. Alternatively, the POST request can give Solr a pointer to the data in the form of either a file path accessible to Solr or an HTTP URL to it.

Note

The file path is relative to the server running Solr, not the client. Additionally, the files must have the proper filesystem permissions so that Solr can access them.

Just as in the earlier case, the originating request does not return a response until Solr has finished processing it. If the file is of a decent size or is already at some known URL, then you may find remote streaming faster and/or more convenient, depending on your situation.

Here is an example of Solr accessing a local file:

curl http://localhost:8983/solr/mbartists/update -F stream.file=/tmp/artists.xml

To use a URL, the parameter would change to stream.url, and we'd specify a URL. We're passing a name-value parameter (stream.file and the path), not the actual data.

Tip

Security risk

Use of remote streaming (stream.file or stream.url) is enabled by default in solrconfig.xml with the enableRemoteStreaming setting. This can be considered a security risk; so only keep it on if Solr is protected. See Chapter 11, Deployment, for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset