Adding a search engine with Solr and Haystack

Now, we are going to add search capabilities to our blog. The Django ORM allows you to perform case-insensitive lookups using the icontains filter. For example, you can use the following query to find posts that contain the word framework in their body:

Post.objects.filter(body__icontains='framework')

However, if you need more powerful search functionalities, you have to use a proper search engine. We are going to use Solr in conjunction with Django to build a search engine for our blog. Solr is a popular open-source search platform that offers full-text search, term boosting, hit highlighting, faceted search, and dynamic clustering, among other advanced search features.

In order to integrate Solr in our project, we are going to use Haystack. Haystack is a Django application that works as an abstraction layer for multiple search engines. It offers a simple search API very similar to Django QuerySets. Let's start by installing and configuring Solr and Haystack.

Installing Solr

You will need the Java Runtime Environment version 1.7 or higher to install Solr. You can check your java version using the command java -version in the shell prompt. The output might vary but you need to make sure the installed version is at least 1.7:

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

If you don't have Java installed or your version is lower than the required one, then you can download Java from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

After checking your Java version, download Solr version 4.10.4 from http://archive.apache.org/dist/lucene/solr/. Unzip the downloaded file and go to the example directory within the Solr installation directory (that is, cd solr-4.10.4/example/). This directory contains a ready to use Solr configuration. From this directory, run Solr with the built-in Jetty web server using the command:

java -jar start.jar

Open your browser and enter the URL http://127.0.0.1:8983/solr/. You should see something like the following:

Installing Solr

This is the Solr administration console. This console shows you usage statistics and allows you to manage your search backend, check the indexed data, and perform queries.

Creating a Solr core

Solr allows you to isolate instances in cores. Each Solr core is a Lucene instance along with a Solr configuration, a data schema, and other required configuration to use it. Solr allows you to create and manage cores on the fly. The example configuration includes a core called collection1. You can see the information of this core if you click on the Core Admin menu tab, as shown in the following screenshot:

Creating a Solr core

We are going to create a core for our blog application. First, we need to create the file structure for our core. Inside the example directory within the solr-4.10.4/ directory, create a new directory and name it blog. Then create the following empty files and directories inside it:

blog/
    data/
    conf/
        protwords.txt
        schema.xml
        solrconfig.xml
        stopwords.txt
        synonyms.txt
        lang/
            stopwords_en.txt

Add the following XML code to the solrconfig.xml file:

<?xml version="1.0" encoding="utf-8" ?>
<config>
  <luceneMatchVersion>LUCENE_36</luceneMatchVersion> 
  <requestHandler name="/select" class="solr.StandardRequestHandler" default="true" />
  <requestHandler name="/update" class="solr.UpdateRequestHandler" />
  <requestHandler name="/admin" class="solr.admin.AdminHandlers" />
  <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
    <lst name="invariants">
      <str name="qt">search</str>
      <str name="q">*:*</str>
    </lst>
  </requestHandler>
</config>

You can also copy this file from the code that comes along with this chapter. This is a minimal Solr configuration. Edit the schema.xml file and add the following XML code:

<?xml version="1.0" ?>
<schema name="default" version="1.5">
</schema>

This is an empty schema. The schema defines the fields and their types for the data that will be indexed in the search engine. We are going to use a custom schema later.

Now, click on the Core Admin menu tab and then click on the Add Core button. You will see a form like the following that allows you to specify the information for your core:

Creating a Solr core

Fill the form with the following data:

  • name: blog
  • instanceDir: blog
  • dataDir: data
  • config: solrconfig.xml
  • schema: schema.xml

The name field is the name you want to give to this core. The instanceDir field is the directory of your core. The dataDir is the directory where indexed data will reside. It is located inside the instanceDir. The config field is the name of your Solr XML configuration file and the schema field is the name of your Solr XML data schema file.

Now, click the Add Core button. If you see the following, then your new core has been successfully added to Solr:

Creating a Solr core

Installing Haystack

To use Solr with Django, we need Haystack. Install Haystack via pip using the following command:

pip install django-haystack==2.4.0

Haystack can interact with several search engine backends. To use the Solr backend, you also need to install the pysolr module. Run the following command to install it:

pip install pysolr==3.3.2

After installing django-haystack and pysolr, you need to activate Haystack in your project. Open the settings.py file and add haystack to the INSTALLED_APPS setting like this:

INSTALLED_APPS = (
    # ...
    'haystack',
)

You need to define the search engine backends for haystack. You can do this by adding a HAYSTACK_CONNECTIONS setting. Add the following into your settings.py file:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.solr_backend.SolrEngine',
        'URL': 'http://127.0.0.1:8983/solr/blog'
    },
}

Notice that the URL points to our blog core. Haystack is now installed and ready to be used with Solr.

Building indexes

Now, we have to register the models we want to store in the search engine. The convention for Haystack is to create a search_indexes.py file into your application and register your models there. Create a new file into your blog application directory and name it search_indexes.py. Add the following code to it:

from haystack import indexes
from .models import Post

class PostIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    publish = indexes.DateTimeField(model_attr='publish')

    def get_model(self):
        return Post

    def index_queryset(self, using=None):
        return self.get_model().published.all()

This is a custom SearchIndex for the Post model. With this index, we tell Haystack which data from this model has to be indexed in the search engine. The index is built by subclassing indexes.SearchIndex and indexes.Indexable. Every SearchIndex requires that one of its fields has document=True. The convention is to name this field text. This field is the primary search field. With use_template=True, we are telling Haystack that this field will be rendered to a data template to build the document the search engine will index. The publish field is a datetime field that will be also indexed. We indicate that this field corresponds to the publish field of the Post model by using the model_attr parameter. The field will be indexed with the content of the publish field of the indexed Post object.

Additional fields like this one are useful to provide additional filters to searches. The get_model() method has to return the model for the documents that will be stored in this index. The index_queryset() method returns the QuerySet for the objects that will be indexed. Notice that we are only including published posts.

Now, create the path and file search/indexes/blog/post_text.txt in the templates directory of the blog application and add the following code to it:

{{ object.title }}
{{ object.tags.all|join:", " }}
{{ object.body }}

This is the default path for the document template for the text field of our index. Haystack uses the application name and the model name to build the path dynamically. Every time we are going to index an object, Haystack will build a document based on this template and then index the document in the Solr search engine.

Now that we have a custom search index, we have to create the appropriate Solr schema. Solr's configuration is XML-based, so we have to generate an XML schema for the data we are going to index. Fortunately, Haystack offers a way to generate the schema dynamically, based on our search indexes. Open the terminal and run the following command:

python manage.py build_solr_schema

You should see an XML output. If you take a look at the bottom of the generated XML code, you will see that Haystack generated fields for your PostIndex automatically:

<field name="text" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="publish" type="date" indexed="true" stored="true" multiValued="false" />

Copy the whole XML output from the initial tag <?xml version="1.0" ?> to the last tag </schema>, including both tags.

This XML is the schema to index data into Solr. Paste the new schema into the blog/conf/schema.xml file inside the example directory of your Solr installation. The schema.xml file is included in the code that comes along with this chapter, so you can also copy it directly from this file.

Open http://127.0.0.1:8983/solr/ in your browser and click on Core Admin menu tab, then click on the blog core, and then click the Reload button:

Building indexes

We reload the core so that it takes into account the schema.xml changes. When the core finishes reloading, the new schema is ready to index new data.

Indexing data

Let's index the posts of our blog into Solr. Open the terminal and execute the following command:

python manage.py rebuild_index

You should see the following warning:

WARNING: This will irreparably remove EVERYTHING from your search index in connection 'default'.
Your choices after this are to restore from backups or rebuild via the `rebuild_index` command.
Are you sure you wish to continue? [y/N]

Enter y for yes. Haystack will clear the search index and insert all published blog posts. You should see an output like this:

Removing all documents from your index because you said so.
All documents removed.
Indexing 4 posts

Open http://127.0.0.1:8983/solr/#/blog in your browser. Under Statistics, you should be able to see the number of indexed documents as follows:

Indexing data

Now, open http://127.0.0.1:8983/solr/#/blog/query in your browser. This is a query interface provided by Solr. Click the Execute query button. The default query requests all documents indexed in your core. You will see a JSON output with the results of the query. The outputted documents look like the following:

{
    "id": "blog.post.1",
    "text": "Who was Django Reinhardt?
jazz, music
The Django web framework was named after the amazing jazz guitarist Django Reinhardt.",
    "django_id": "1",
    "publish": "2015-09-20T12:49:52Z",
    "django_ct": "blog.post"
  },

This is the data stored for each post in the search index. The text field contains the title, tags separated by commas, and the body of the post, as this field is built with the template we defined before.

You have used python manage.py rebuild_index to remove everything in the index and to index the documents again. To update your index without removing all objects, you can use python manage.py update_index. Alternatively, you can use the parameter --age=<num_hours> to update less objects. You can set up a Cron job for this in order to keep your Solr index updated.

Creating a search view

Now, we are going to create a custom view to allow our users to search posts. First, we need a search form. Edit the forms.py file of your blog application and add the following form:

class SearchForm(forms.Form):
    query = forms.CharField()

We will use the query field to let the users introduce search terms. Edit the views.py file of your blog application and add the following code to it:

from .forms import EmailPostForm, CommentForm, SearchForm
from haystack.query import SearchQuerySet

def post_search(request):
    form = SearchForm()
    if 'query' in request.GET:
        form = SearchForm(request.GET)
        if form.is_valid():
            cd = form.cleaned_data
            results = SearchQuerySet().models(Post)
                          .filter(content=cd['query']).load_all()
            # count total results
            total_results = results.count()
    return render(request,
                  'blog/post/search.html',
                  {'form': form,
                   'cd': cd,
                   'results': results,
                   'total_results': total_results})

In this view, first we instantiate the SearchForm that we created before. We are going to submit the form using the GET method so that the resulting URL includes the query parameter. To see if the form has been submitted, we look for the query parameter in the request.GET dictionary. When the form is submitted, we instantiate it with the submitted GET data and we check that the given data is valid. If the form is valid, we use the we use SearchQuerySet to perform a search for indexed Post objects whose main content contains the given query. The load_all() method loads all related Post objects from the database at once. With this method, we populate the search results with the database objects to avoid per-object access to the database when iterating over results to access object data. Finally, we store the total number of results in a total_results variable and pass the local variables as context to render a template.

The search view is ready. We need to create a template to display the form and the results when the user performs a search. Create a new file inside the templates/blog/post/ directory, name it search.html, and add the following code to it:

{% extends "blog/base.html" %}

{% block title %}Search{% endblock %}

{% block content %}
  {% if "query" in request.GET %}
    <h1>Posts containing "{{ cd.query }}"</h1>
    <h3>Found {{ total_results }} result{{ total_results|pluralize }}</h3>
    {% for result in results %}
      {% with post=result.object %}
        <h4><a href="{{ post.get_absolute_url }}">{{ post.title }}</a></h4>
        {{ post.body|truncatewords:5 }}
      {% endwith %}
    {% empty %}
      <p>There are no results for your query.</p>
    {% endfor %}
    <p><a href="{% url "blog:post_search" %}">Search again</a></p>
  {% else %}
    <h1>Search for posts</h1>
    <form action="." method="get">
      {{ form.as_p }}
      <input type="submit" value="Search">
    </form>
  {% endif %}
{% endblock %}

As in the search view, we distinguish if the form has been submitted based on the presence of the query parameter. Before the post is submitted, we display the form and a submit button. After the post is submitted, we display the query performed, the total number of results, and the list of results. Each result is a document returned by Solr and encapsulated by Haystack. We need to use result.object to access the actual Post object related to this result.

Finally, edit the urls.py file of your blog application and add the following URL pattern:

  url(r'^search/$', views.post_search, name='post_search'),

Now, open http://127.0.0.1:8000/blog/search/ in your browser. You should see a search form like this:

Creating a search view

Now, enter a query and click the Search button. You will see the results of the search query like the this:

Creating a search view

Now, you have a powerful search engine built into your project, but starting from here there are a plenty of things you can do with Solr and Haystack. Haystack includes search views, forms, and advanced functionalities for search engines. You can read the Haystack documentation at http://django-haystack.readthedocs.org/en/latest/.

The Solr search engine can be adapted to any need by customizing your schema. You can combine analyzers, tokenizers, and token filters that are executed at index or search time to provide a more accurate search for your site's content. You can see all possibilities for this at https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset