The Whoosh search engine

Whoosh is a search engine written entirely in Python. It's slightly easier to install and run than the Sphinx search engine and some would argue it generally feels "more pythonic". To install Whoosh you can simply easy_install Whoosh or visit http://whoosh.ca to get the latest development version.

The fact that Whoosh is pure Python is very convenient for developers who are not interested in or lack knowledge of Java or compiling UNIX software. It can get you up and running quickly and it supports integration with Django, as we'll see shortly.

Just like in Sphinx, Whoosh needs to define and build a set of indexes on our data. The Whoosh documentation is very extensive (another advantage of a pure Python tool) and explains all the indexing options in great detail. We will present a quick tutorial here, before continuing on to using Whoosh with Django.

In Sphinx we defined our indexes using the index section of our sphinx.conf. In Whoosh, they use a Schema object, which is defined purely in Python (of course). The schema performs the same role as the index section: it defines the fields and additional options for the index we will build around our data. Unlike Sphinx, Whoosh requires the schema to include "type" information for each field we want to index. This type information affects how Whoosh indexes our data.

The Sphinx approach to indexing was to treat each field in our source definition as text whose contents would be chunked and preprocessed. Whoosh supports this as well, but it also provides alternative methods of handling fields. This is the idea behind field types. The TEXT type is used for fields of text, like Sphinx, but the KEYWORD and ID field types are treated differently.

Fields with the KEYWORD type are treated as lists of keywords or terms, separated by commas or spaces. This would be useful for indexing lists of tags. The ID type is used on fields that store single term values. The documentation suggests using this type for things like URLs and e-mail addresses.

For our Product model, the field types for the name and description would be TEXT. The slug field is probably best treated as TEXT, but could be considered as ID field (if you wanted to search based on slug). The corresponding Whoosh schema definition would like this:

from whoosh.fields import Schema, TEXT
from whoosh.analysis import StemmingAnalyzer

schema = Schema(name=TEXT, slug=TEXT, description=TEXT)

The Whoosh search engine stores indexes on the disk in files specified when you create an index. This is similar to the path statement in the index section of our Sphinx configuration. Remember, this is a pure Python implementation, so instead of using an indexer tool or other utility to generate our indexes, we have to write a Python script.

To do this we need a Whoosh index object, which defines the location of the index files as well as the schema to be used for indexing. Whoosh includes a convenient function we can use to create this object, called create_index_in.

from whoosh.index import create_index_in

index = create_index_in('indexes', schema)

This will create our index files in a directory called indexes and use the schema object we defined earlier. Likewise, when we're ready to load indexes we've previously created, Whoosh includes a convenient function for that too:

index = open_dir('indexes')

To actually index data, we have to obtain a Whoosh IndexWriter object and pass it our field data. The following code snippet will index all of our Product objects:

writer = index.writer()

for product in Product.objects.all():
    writer.add_document(name=product.name,  
                        description=product.description)
writer.commit()

This indexes all of our Product objects as documents in the index defined earlier. The last line calls the commit() method of the IndexWriter object, which writes the full index to the disk in the location we specified with create_index_in.

If this feels like a very manual process, similar to the exploration of the Sphinx Python API from earlier, it's because it is. We're almost finished, however. Now that we've created an index for all of our Product objects, we can perform a search query by creating a Whoosh Searcher object from our index:

searcher = index.searcher()

And then, we search our index by calling the find method:

results = searcher.find("name", u"cranberry sauce")

Notice how this method requires us to specify the search field in addition to the search term. The results are returned as a dictionary object whose keys are the field names and corresponding values. The Whoosh API includes additional objects and methods for searches and queries, including a special query language. But the Whoosh engine alone doesn't help us add search to our Django models directly. For that we can use Haystack.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset