Configuring the Sphinx search engine

The Sphinx full-text search engine was created by Andrew Aksyonoff and is available from http://www.sphinxsearch.com. It is a standalone search engine, meaning it is built and run as a background application and communication occurs directly from client applications, like our Django site. Installing on Windows is simply a matter of downloading and unzipping the pre-compiled binaries from the Downloads section of the homepage and configuring it as a Windows service. Linux and Mac OS X users will need to compile from source (see the Installation section of the Documentation for details).

Sphinx can index data from a variety of sources. In our examples, we will connect it to our Django database running on MySQL. It also supports the PostgreSQL database and can even index text files, including HTML and XML documents. For our purposes we will examine a simple MySQL configuration using our Django database.

Sphinx includes two important tools: the search daemon (searchd) and an indexing application (indexer). As Sphinx is a standalone application, the full-text indexes used in searching must be generated by the indexer tool. The indexes are not stored as part of your database, but as Sphinx-specific files. The location of these files, as well as all other indexer parameters, is specified in a configuration file, usually called sphinx.conf.

The sphinx.conf file is the heart of the search configuration. It generally specifies two things: sources and indexes. Sources correspond to the database queries we will build indexes against. Indexes include all the different parameters for indexing our data, including morphology, stop words, minimum word lengths, and so on. We will detail these settings shortly.

Defining the data source

We begin by defining our data source. Using our Products model from earlier in the book, we can define a source that includes the name and description fields for our products. Django automatically generated tables for our models, but we need to know which tables we want to tell Sphinx to index. The Django convention is to use application and model names for our tables, but this can be overridden. If you aren't sure what tables are used in your project you can always verify by running django-admin.py sql all.

In our case, the Product table should be products_product. The Sphinx configuration file has a very simple syntax. We start by defining a source and giving it a name. This name will be used for reference when we define indexes later. Here is a complete sources section for products_product:

source products_product
{
   type = mysql
   sql_host = localhost
   sql_user = root
   sql_pass = xxxxxx
   sql_db   = coleman_book
  
   sql_query = SELECT id, name, slug, description FROM products_product
}

The usual database parameters are included at the top. The important part is the sql_query statement. This is the query used to retrieve the data we want to index. It should include all of the fields we plan to search as well as the primary key value. Sphinx only supports primary keys that are positive integers, which is the default for Django models.

This source defines the ID, name, slug, and description columns in the products_product table as the source's data. You can write any SQL here that the database backend will support. This means we could include a WHERE clause to filter the source data arbitrarily. The results can also be filtered by application logic, however, so most of the time it's better to index larger data sets and let our Django application filter what it needs. For very large sets of data, however, it may be useful to segment sources this way.

There can be as many source definitions in the sphinx.conf as are needed for a project. These will generally differ based on the tables and fields that need indexing. For example, we could define a simpler source that just included the name field for an index that would exclusively search product names.

When we define indexes, we can index as many different sources as needed and all of the results will be combined into one large set. This is convenient, but complicated by the fact that all primary key values must be unique across all sources. If this is not true, searches will return unexpected results. This makes it difficult to combine Django models, for instance, because they all typically define a similar set of primary key values. To keep things simple, in our example usage, we will only index the single source defined above.

Defining the indexes

Index definitions are very similar to source definitions. We start by naming the index and then define its source and other properties, as below:

index products_product
{
    source  = products_product
    path    = /usr/local/sphinx/var/data/products_product
    morphology = stem_en
    stopwords = stopwords-en.txt
    min_word_len = 2
}

The index definition includes the source, defined earlier in the file, as well as a path to the filesystem location where Sphinx will store this index. Sphinx uses special flat index files to persist indexes. These are loaded from disk when the search daemon begins and must be reloaded as new documents appear in the sources and the index is updated. We will discuss more on updating indexes later in this section.

In addition to these mandatory statements, we have morphology, stop words, and min_word_len parameters. These affect the construction of the index in complicated but very useful ways. Unlike a simple MySQL FULLTEXT index, Sphinx supports these advanced pre-processing functions.

The morphology parameter allows Sphinx to apply "morphology preprocessors". This is a list of filters that uses natural language knowledge in attempt to generate more accurate search results. The stem_en preprocessor is a Sphinx built-in that applies an English language "stemmer." A stemmer normalizes words by indexing their root, instead of longer variations. For example, the word 'cranberries' would be indexed as 'cranberry' in an effort to provide accurate search results for either term. The stemming function is applied to indexed data and search terms.

The stopwords parameter specifies a text file that includes a list of "stop words." These are words that will be ignored by the indexer and usually include very common words in the indexed language. In our example, we're providing a list of words in English, such as 'the' or 'and'. The stopwords parameter allows you to specify multiple files separated by commas.

Finally, the min_word_len parameter instructs Sphinx not to index words shorter than two characters. This is an optional parameter (as are morphology and stopwords), but it is useful in tweaking the search results for some data sets.

There are numerous other indexer options, but these are some of the more powerful ones. The Sphinx documentation provides a complete list with thorough descriptions of their usages.

Building and testing our index

Now that we've defined our sources and indexes, we can build the first index of our data. This is accomplished by running the indexer tool included in the Sphinx package. This will generate the index files and store them in the location specified in our index section's path statement.

Once indexes have been created, we can test our search results in the command line using Sphinx's search tool:

$ search cranberry juice
Sphinx 0.9.8.1-release (r1533)
Copyright © 2001-2008, Andrew Aksyonoff

displaying matches:
1.	document=9628, weight=2
id=9628
name=CranStore Cranberry Juice
slug=cranstore-cranberry-juice
description=A refreshing Cranberry cocktail
...

words:
1.	cranberry: 3 documents, 11 hits
2.	'juice: 43 documents, 114 hits

This is a simple way to test your Sphinx indexes and to debug search results. It can be especially useful when trying to understand how Sphinx is using preprocessing filters, such as word stemming, to alter your search results. This command line utility is flexible and includes several flags for added power. For example, if you have multiple indexes, you can specify which index to run a test search against with the -i or --index parameter:

$ search—index products_product cranberry juice

When the configuration is to your liking, you can run the Sphinx search engine background process by issuing the searchd command. Under a Windows environment, this will usually happen automatically when you install Sphinx as a service. Searchd can be added to your system's initialize scripts to automatically execute when the system starts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset