Search engines overview

There has been an explosion of search engine technology available in the open source community in recent years. Many of these have grown out of various academic and commercial projects and most are extremely high quality. We'll discuss a handful of these tools in this section and go on to integrate two of them with Django later in this chapter.

Sphinx

The Sphinx full-text search engine is a free and open source search engine product that has the added benefit of official, paid-support packages. It is available at http://www.sphinxsearch.com/. Sphinx is written in C++ and is available for most UNIX platforms, including Mac OS X. A Windows version is also available, but is officially not recommended for production use. It is fast and has good relevance.

Sphinx includes a search daemon that can be run as a background process on any system. Sphinx also includes API libraries for several popular programming languages, including Python. Sphinx requires you to define your search indexes using a configuration file. This file is read by the included indexer utility, which produces the full-text indexes. This indexer must be run on a timely basis to update the index for your database tables.

For use with Django, Sphinx is a good choice due to David Cramer's excellent django-sphinx utility. This is a wrapper around the Sphinx API that allows Sphinx-specified functionality to be attached to Django models. The django-sphinx utility is available from: http://github.com/dcramer/django-sphinx.

Unlike some of the other Django community projects for search, the django-sphinx application doesn't do any index definition or other search engine tweaking from Django itself. The search engine and indexes are configured entirely via the Sphinx configuration file. This is a matter of preference, as some of the other Django tools allow you to define indexes directly in Python.

Solr

Solr is a search engine built on top of the Apache Lucene search library and it is available at http://lucene.apache.org/solr/. Solr is written in Java and includes many advanced features. It requires a Java Servlet Container within which to run. This makes it a popular choice for enterprise applications where Java and Servlets are very common. To run a standalone Solr server, you'll need a container like Apache's Tomcat or the Jetty application server.

Solr support can be achieved in Django using the Haystack application, which we will discuss later in this chapter. Additional Solr support is also available from the django-solr-search project (also called solango). If you are already using Solr as a search platform internally, these solutions make it easy to attach support to any Django applications you may build. Django-solr is at: http://code.google.com/p/django-solr-search/.

One nice feature of the Django integration is that you can define the search parameters for your Django models and have solango generate the necessary Solr XML schemas automatically. Solango also supports more advanced Solr features like faceting and highlighting.

Whoosh

Whoosh is a search engine written entirely in Python. As a result it requires no compilation, making it relatively simple to install and maintain. It's an excellent solution for Python projects that need to add search functionality and don't already have a search engine in place. It features "fast indexing and retrieval" but the performance is likely less than you would see on some of the other search engines we've discussed, which are written in C++ or Java. Whoosh is available at: http://whoosh.ca/.

Whoosh is a very compelling tool if you're interested in a pure-Python solution or don't want to bother with the more complicated setups required with other tools. It is relatively full-featured, and includes much of the advanced functionality you'll find in Sphinx, Solr, or Xapian. We will work with Whoosh and Haystack in the coming pages.

It should be noted that at the time of this writing, Whoosh is probably not a practical choice for anything but very small-scale or testing situations. Besides potential performance problems, Whoosh has several technical issues that have rendered it unusable on many Django projects.

Xapian

The Xapian project is another open source, C++ search engine library. It is available at http://xapian.org/. It has bindings for many languages, including Python. It is a very powerful search engine and has an excellent community project for Django called djapian, available at http://code.google.com/p/djapian/.

The Xapian search engine library is a little different than the other tools we've covered. Unlike Solr or Sphinx, for example, there is no application to run under Xapian, it is simply a library. Djapian does a nice job of wrapping this functionality and storing the generated indexes in our Django database. It even includes Django management commands to perform index builds and other operations from the command line.

Haystack

Haystack is an ambitious Django project created by Daniel Lindsley that exposes a common interface to Django for a variety of pluggable search engine backends. These include Whoosh, Solr, and Xapian. This is an excellent tool for getting up and running quickly on any of these search platforms. The advantage of Haystack is that the interface for index definition and search queries remains the same, regardless of the engine being used on the backend. It is available at http://haystacksearch.org/.

Haystack also includes a lot of extra functionality, such as a set of predefined URL patterns, views, and forms. Not to mention niceties like Django management commands and template tags for "More like this" and term highlighting. Haystack is very Django-specific, however, and it is not recommended as a general-purpose search tool or in cases where heavy usage is expected.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset