Sphinx (http://www.sphinxsearch.com) is a free, open source, full-text search engine, designed from the ground up to integrate well with databases. It has DBMS-like features, is very fast, supports distributed searching, and scales well. It is also designed for efficient memory and disk I/O, which is important because they’re often the limiting factors for large operations.
Sphinx works well with MySQL. It can be used to accelerate a variety of queries, including full-text searches; you can also use it to perform fast grouping and sorting operations, among other applications. It speaks MySQL’s wire protocol and a mostly MySQL-compatible SQL-like dialect, so you can actually query it just like a MySQL database. Additionally, there is a pluggable storage engine that lets a programmer or administrator access Sphinx directly through MySQL. Sphinx is especially useful for certain queries that MySQL’s general-purpose architecture doesn’t optimize very well for large datasets in real-world settings. In short, Sphinx can enhance MySQL’s functionality and performance.
The source of data for a Sphinx index is usually the result of a
MySQL SELECT
query, but you can build
an index from an unlimited number of sources of varying types, and each
instance of Sphinx can search an unlimited number of indexes. For example,
you can pull some of the documents in an index from a MySQL instance
running on one remote server, some from a PostgreSQL instance running on
another server, and some from the output of a local script through an XML
pipe mechanism.
In this appendix, we explore some use cases where Sphinx’s capabilities enable enhanced performance, show a summary of the steps needed to install and configure it, explain its features in detail, and we discuss several examples of real-world implementations.
We start with a simple but complete Sphinx usage example to provide a starting point for further discussion. We use PHP because of its popularity, although APIs are available for a number of other languages, too.
Assume that we’re implementing full-text searching for a comparison-shopping engine. Our requirements are to:
Maintain a searchable full-text index on a product table stored in MySQL
Allow full-text searches over product titles and descriptions
Be able to narrow down searches to a given category if needed
Be able to sort the result not only by relevance, but by item price or submission date
We begin by setting up a data source and an index in the Sphinx configuration file:
source products { type = mysql sql_host = localhost sql_user = shopping sql_pass = mysecretpassword sql_db = shopping sql_query = SELECT id, title, description, cat_id, price, UNIX_TIMESTAMP(added_date) AS added_ts FROM products sql_attr_uint = cat_id sql_attr_float = price sql_attr_timestamp = added_ts } index products { source = products path = /usr/local/sphinx/var/data/products docinfo = extern }
This example assumes that the MySQL shopping
database contains a products
table with the columns we request in
our SELECT
query to populate our
Sphinx index. The Sphinx index is also named products
. After creating a new source and
index, we run the indexer program to create the
initial full-text index data files and then (re)start the searchd daemon to pick up the
changes:
$cd /usr/local/sphinx/bin
$./indexer products
$./searchd --stop
$./searchd
The index is now ready to answer queries. We can test it with Sphinx’s bundled test.php sample script:
$ php -q test.php -i products ipod
Query 'ipod ' retrieved 3 of 3 matches in 0.010 sec.
Query stats:
'ipod' found 3 times in 3 documents
Matches:
1. doc_id=123, weight=100, cat_id=100, price=159.99, added_ts=2008-01-03 22:38:26
2. doc_id=124, weight=100, cat_id=100, price=199.99, added_ts=2008-01-03 22:38:26
3. doc_id=125, weight=100, cat_id=100, price=249.99, added_ts=2008-01-03 22:38:26
The final step is to add searching to our web application. We need to set sorting and filtering options based on user input and format the output nicely. Also, because Sphinx returns only document IDs and configured attributes to the client—it doesn’t store any of the original text data—we need to pull additional row data from MySQL ourselves:
1 <?php 2 include ( "sphinxapi.php" ); 3 // ... other includes, MySQL connection code, 4 // displaying page header and search form, etc. all go here 5 6 // set query options based on end-user input 7 $cl = new SphinxClient (); 8 $sortby = $_REQUEST["sortby"]; 9 if ( !in_array ( $sortby, array ( "price", "added_ts" ) ) ) 10 $sortby = "price"; 11 if ( $_REQUEST["sortorder"]=="asc" ) 12 $cl->SetSortMode ( SPH_SORT_ATTR_ASC, $sortby ); 13 else 14 $cl->SetSortMode ( SPH_SORT_ATTR_DESC, $sortby ); 15 $offset = ($_REQUEST["page"]-1)*$rows_per_page; 16 $cl->SetLimits ( $offset, $rows_per_page ); 17 18 // issue the query, get the results 19 $res = $cl->Query ( $_REQUEST["query"], "products" ); 20 21 // handle search errors 22 if ( !$res ) 23 { 24 print "<b>Search error:</b>" . $cl->GetLastError (); 25 die; 26 } 27 28 // fetch additional columns from MySQL 29 $ids = join ( ",", array_keys ( $res["matches"] ); 30 $r = mysql_query ( "SELECT id, title FROM products WHERE id IN ($ids)" ) 31 or die ( "MySQL error: " . mysql_error() ); 32 while ( $row = mysql_fetch_assoc($r) ) 33 { 34 $id = $row["id"]; 35 $res["matches"][$id]["sql"] = $row; 36 } 37 38 // display the results in the order returned from Sphinx 39 $n = 1 + $offset; 40 foreach ( $res["matches"] as $id=>$match ) 41 { 42 printf ( "%d. <a href=details.php?id=%d>%s</a>, USD %.2f<br> ", 43 $n++, $id, $match["sql"]["title"], $match["attrs"]["price"] ); 44 } 45 46 ?>
Even though the snippet just shown is pretty simple, there are a few things worth highlighting:
The SetLimits()
call
tells Sphinx to fetch only the number of rows that the client wants
to display on a page. It’s cheap to impose this limit in Sphinx
(unlike in MySQL’s built-in search facility), and the number of
results that would have been returned without the limit are
available in $result['total_found']
at no extra
cost.
Because Sphinx only indexes the title
column and doesn’t
store it, we must fetch that data from
MySQL.
We retrieve data from MySQL with a single combined query for
the whole document batch using the clause WHERE id IN (...)
, instead of running one
query for each match (which would be inefficient).
We inject the rows pulled from MySQL into our full-text search result set, to keep the original sorting order. We explain this more in a moment.
We display the rows using data pulled from both Sphinx and MySQL.
The row injection code, which is PHP-specific, deserves a more
detailed explanation. We couldn’t simply iterate over the result set
from the MySQL query, because the row order can (and in most cases
actually will) be different from that specified in the WHERE id IN (...)
clause. PHP hashes
(associative arrays), however, keep the order in which the matches were
inserted into them, so iterating over $result["matches"]
will produce rows in the
proper sorting order as returned by Sphinx. To keep the matches in the
proper order returned from Sphinx (rather than the semirandom order
returned from MySQL), therefore, we inject the MySQL query results one
by one into the hash that PHP stores from the Sphinx result set of
matches.
There are also a few major implementation and performance
differences between MySQL and Sphinx when it comes to counting matches
and applying a LIMIT
clause. First,
LIMIT
is cheap in Sphinx. Consider a
LIMIT 500,10
clause. MySQL will
retrieve 510 semirandom rows (which is slow) and throw away 500, whereas
Sphinx will return the IDs that you will use to retrieve the 10 rows you
actually need from MySQL. Second, Sphinx will always return the exact
number of matches it actually found in the result set, no matter what’s
in the LIMIT
clause. MySQL can’t do
this efficiently, although in MySQL 5.6 it will have partial
improvements for this limitation.
Sphinx can complement a MySQL-based application in many ways, bolstering performance where MySQL is not a good solution and adding functionality MySQL can’t provide. Typical usage scenarios include:
Fast, efficient, scalable, relevant full-text searches
Optimizing WHERE
conditions
on low-selectivity indexes or columns without indexes
Optimizing ORDER BY ...
LIMIT
N
queries and GROUP BY
queries
Generating result sets in parallel
Scaling up and scaling out
Aggregating partitioned data
We explore each of these scenarios in the following sections. This list is not exhaustive, though, and Sphinx users find new applications regularly. For example, one of Sphinx’s most important uses—scanning and filtering records quickly—was a user innovation, not one of Sphinx’s original design goals.
MyISAM’s full-text search capability is fast for smaller datasets but performs badly when the data size grows. With millions of records and gigabytes of indexed text, query times can vary from a second to more than 10 minutes, which is unacceptable for a high-performance web application. Although it’s possible to scale MyISAM’s full-text searches by distributing the data in many locations, this requires you to perform searches in parallel and merge the results in your application.
Sphinx works significantly faster than MyISAM’s built-in full-text indexes. For instance, it can search over 1 GB of text within 10 to 100 milliseconds—and that scales well up to 10–100 GB per CPU. Sphinx also has the following advantages:
It can index data stored with InnoDB and other engines, not just MyISAM.
It can create indexes on data combined from many source tables, instead of being limited to columns in a single table.
It can dynamically combine search results from multiple indexes.
In addition to indexing textual columns, its indexes can contain an unlimited number of numeric attributes, which are analogous to “extra columns.” Sphinx attributes can be integers, floating-point numbers, and Unix timestamps.
It can optimize full-text searches with additional conditions on attributes.
Its phrase-based ranking algorithm helps it return more relevant results. For instance, if you search a table of song lyrics for “I love you, dear,” a song that contains that exact phrase will turn up at the top, before songs that just contain “love” or “dear” many times.
It makes scaling out much easier.
Sometimes you’ll need to run SELECT
queries against very large tables
(containing millions of records), with several WHERE
conditions on columns that have poor
index selectivity (i.e., return too many rows for a given WHERE
condition) or could not be indexed at
all. Common examples include searching for users in a social network
and searching for items on an auction site. Typical search interfaces
let the user apply WHERE
conditions
to 10 or more columns, while requiring the results to be sorted by
other columns. See the indexing case study in Chapter 5 for an example of such an
application and the required indexing strategies.
With the proper schema and query optimizations, MySQL can work
acceptably for such queries, as long as the WHERE
clauses don’t contain too many
columns. But as the number of columns grows, the number of indexes
required to support all possible searches grows exponentially.
Covering all the possible combinations for just four columns strains
MySQL’s limits. It becomes very slow and expensive to maintain the
indexes, too. This means it’s practically impossible to have all the
required indexes for many WHERE
conditions, and you have to run the queries without indexes.
More importantly, even if you can add indexes, they won’t give
much benefit unless they’re selective. The classic example is a
gender
column, which isn’t much
help because it typically selects around half of all rows. MySQL will
generally revert to a full table scan when the index isn’t selective
enough to help it.
Sphinx can perform such queries much faster than MySQL. You can
build a Sphinx index with only the required columns from the data.
Sphinx then allows two types of access to the data: an indexed search
on a keyword or a full scan. In both cases, Sphinx applies filters, which are its equivalent
of a WHERE
clause. Unlike MySQL,
which decides internally whether to use an index or a full scan,
Sphinx lets you choose which access method to use.
To use a full scan with filters, specify an empty string as the
search query. To use an indexed search, add pseudokeywords to your
full-text fields while building the index and then search for those
keywords. For example, if you wanted to search for items in category
123, you’d add a “category123” keyword to the document during indexing
and then perform a full-text search for “category123.” You can either
add keywords to one of the existing fields using the CONCAT()
function, or
create a special full-text field for the pseudokeywords for more
flexibility. Normally, you should use filters for nonselective values
that cover over 30% of the rows, and fake keywords for selective ones
that select 10% or less. If the values are in the 10–30% gray zone,
your mileage may vary, and you should use benchmarks to find the best
solution.
Sphinx will perform both indexed searches and scans faster than MySQL. Sometimes Sphinx actually performs a full scan faster than MySQL can perform an index read.
Web applications frequently need the top
N
results in order. As we discussed
previously, this is hard to optimize in MySQL 5.5 and older
versions.
The worst case is when the WHERE
condition finds many rows (let’s say 1
million) and the ORDER BY
columns
aren’t indexed. MySQL uses the index to identify all the matching
rows, reads the records one by one into the sort buffer with
semirandom disk reads, sorts them all with a filesort, and then
discards most of them. It will temporarily store and process the
entire result, ignoring the LIMIT
clause and churning RAM. And if the result set doesn’t fit in the sort
buffer, it will need to go to disk, causing even more disk I/O.
This is an extreme case, and you might think it happens rarely in the real world, but in fact the problems it illustrates happen often. MySQL’s limitations on indexes for sorting—using only the leftmost part of the index, not supporting loose index scans, and allowing only a single range condition—mean many real-world queries can’t benefit from indexes. And even when they can, using semirandom disk I/O to retrieve rows is a performance killer.
Paginated result sets, which usually require queries of the form
SELECT ... LIMIT
N,
M
, are another performance problem in MySQL. They read
N + M
rows from disk, causing a large
amount of random I/O and wasting memory resources. Sphinx can
accelerate such queries significantly by eliminating the two biggest
problems:
Sphinx’s RAM usage is always strictly limited, and
the limit is configurable. Sphinx supports a result set offset
and size similar to the MySQL LIMIT
N, M
syntax, but it also has a max_matches
option. This controls the
equivalent of the “sort buffer” size, on both a per-server and a
per-query basis. Sphinx’s RAM footprint is guaranteed to be
within the specified limits.
If attributes are stored in RAM, Sphinx does not do any I/O at all. And even if attributes are stored on disk, Sphinx will perform sequential I/O to read them, which is much faster than MySQL’s semirandom retrieval of rows from disks.
You can sort search results by a combination of relevance
(weight), attribute values, and (when using GROUP BY
) aggregate function values. The
sorting clause syntax is similar to a SQL ORDER BY
clause:
<?php $cl = new SphinxClient (); $cl->SetSortMode ( SPH_SORT_EXTENDED, 'price DESC, @weight ASC' ); // more code and Query() call here... ?>
In this example, price
is a
user-specified attribute stored in the index, and @weight
is a special attribute, created at
runtime, that contains each result’s computed relevance. You can also
sort by an arithmetic expression involving attribute values, common
math operators, and functions:
<?php $cl = new SphinxClient (); $cl->SetSortMode ( SPH_SORT_EXPR, '@weight + log(pageviews)*1.5' ); // more code and Query() call here... ?>
Support for everyday SQL-like clauses would be
incomplete without GROUP BY
functionality, so Sphinx has that, too. But unlike MySQL’s
general-purpose implementation, Sphinx specializes in solving a
practical subset of GROUP BY
tasks
efficiently. This subset covers the generation of reports from big
(1–100 million row) datasets when one of the following cases
holds:
The result is only a “small” number of grouped rows (where “small” is on the order of 100,000 to 1 million rows).
Very fast execution speed is required and approximate
COUNT(*)
results are
acceptable, when many groups are retrieved from data distributed
over a cluster of machines.
This is not as restrictive as it might sound. The first scenario covers practically all imaginable time-based reports. For example, a detailed per-hour report for a period of 10 years will return fewer than 90,000 records. The second scenario could be expressed in plain English as something like “as quickly and accurately as possible, find the 20 most important records in a 100-million-row sharded table.”
These two types of queries can accelerate general-purpose queries, but you can also use them for full-text search applications. Many applications need to display not only full-text matches, but some aggregate results as well. For example, many search result pages show how many matches were found in each product category, or display a graph of matching document counts over time. Another common requirement is to group the results and show the most relevant match from each category. Sphinx’s group-by support lets you combine grouping and full-text searching, eliminating the overhead of doing the grouping in your application or in MySQL.
As with sorting, grouping in Sphinx uses fixed memory. It is slightly (10% to 50%) more efficient than similar MySQL queries on datasets that fit in RAM. In this case, most of Sphinx’s power comes from its ability to distribute the load and greatly reduce the latency. For huge datasets that could never fit in RAM, you can build a special disk-based index for reporting, using inline attributes (defined later). Queries against such indexes execute about as fast as the disk can read the data—about 30–100 MB/sec on modern hardware. In this case, the performance can be many times better than MySQL’s, though the results will be approximate.
The most important difference from MySQL’s GROUP BY
is that Sphinx may, under certain
circumstances, yield approximate results. There are two reasons for
this:
Grouping uses a fixed amount of memory. If there are too many groups to hold in RAM and the matches are in a certain “unfortunate” order, per-group counts might be smaller than the actual values.
A distributed search sends only the aggregate results, not the matches themselves, from node to node. If there are duplicate records in different nodes, per-group distinct counts might be greater than the actual values, because the information that can remove the duplicates is not transmitted between nodes.
In practice, it is often acceptable to have fast approximate group-by counts. If this isn’t acceptable, it’s often possible to get exact results by configuring the daemon and client application carefully.
You can generate the equivalent of COUNT(DISTINCT
<attribute>
)
, too. For example, you can use this to
compute the number of distinct sellers per category in an auction
site.
Finally, Sphinx lets you choose criteria to select the single “best” document within each group. For example, you can select the most relevant document from each domain, while grouping by domain and sorting the result set by per-domain match counts. This is not possible in MySQL without a complex query.
Sphinx lets you generate several results from the same data simultaneously, again using a fixed amount of memory. Compared to the traditional SQL approach of either running two queries (and hoping that some data stays in the cache between runs) or creating a temporary table for each search result set, this yields a noticeable improvement.
For example, assume you need per-day, per-week, and per-month
reports over a period of time. To generate these with MySQL you’d have
to run three queries with different GROUP
BY
clauses, processing the source data three times. Sphinx,
however, can process the underlying data once and accumulate all three
reports in parallel.
Sphinx does this with a multi-query mechanism. Instead of issuing queries one by one, you batch several queries and submit them in one request:
<?php $cl = new SphinxClient (); $cl->SetSortMode ( SPH_SORT_EXTENDED, "price desc" ); $cl->AddQuery ( "ipod" ); $cl->SetGroupBy ( "category_id", SPH_GROUPBY_ATTR, "@count desc" ); $cl->AddQuery ( "ipod" ); $cl->RunQueries (); ?>
Sphinx will analyze the request, identify query parts it can combine, and parallelize the queries where possible.
For example, Sphinx might notice that only the sorting and
grouping modes differ, and that the queries are otherwise the same.
This is the case in the sample code just shown, where the sorting is
by price
but the grouping is by
category_id
. Sphinx will create
several sorting queues to process these queries. When it runs the
queries, it will retrieve the rows once and submit them to all queues.
Compared to running the queries one by one, this eliminates several
redundant full-text search or full scan operations.
Note that generating parallel result sets, although it’s a common and important optimization, is only a particular case of the more generalized multi-query mechanism. It is not the only possible optimization. The rule of thumb is to combine queries in one request where possible, which generally allows Sphinx to apply internal optimizations. Even if Sphinx can’t parallelize the queries, it still saves network round-trips. And if Sphinx adds more optimizations in the future, your queries will use them automatically with no further changes.
Sphinx scales well both horizontally (scaling out) and vertically (scaling up).
Sphinx is fully distributable across many machines. All the use cases we’ve mentioned can benefit from distributing the work across several CPUs.
The Sphinx search daemon (searchd) supports special distributed indexes, which know which local and remote indexes should be queried and aggregated. This means scaling out is a trivial configuration change. You simply partition the data across the nodes, configure the master node to issue several remote queries in parallel with local ones, and that’s it.
You can also scale up, as in using more cores or CPUs on a single machine to improve latency. To accomplish this, you can just run several instances of searchd on a single machine and query them all from another machine via a distributed index. Alternatively, you can configure a single instance to communicate with itself so that the parallel “remote” queries actually run on a single machine, but on different CPUs or cores.
In other words, with Sphinx a single query can be made to use more than one CPU (multiple concurrent queries will use multiple CPUs automatically). This is a major difference from MySQL, where one query always gets one CPU, no matter how many are available. Also, Sphinx does not need any synchronization between concurrently running queries. That lets it avoid mutexes (a synchronization mechanism), which are a notorious MySQL performance bottleneck on multi-CPU systems.
Another important aspect of scaling up is scaling disk I/O. Different indexes (including parts of a larger distributed index) can easily be put on different physical disks or RAID volumes to improve latency and throughput. This approach has some of the same benefits as MySQL 5.1’s partitioned tables, which can also partition data into multiple locations. However, distributed indexes have some advantages over partitioned tables. Sphinx uses distributed indexes both to distribute the load and to process all parts of a query in parallel. In contrast, MySQL’s partitioning can optimize some queries (but not all) by pruning partitions, but the query processing will not be parallelized. And even though both Sphinx and MySQL partitioning will improve query throughput, if your queries are I/O-bound, you can expect linear latency improvement from Sphinx on all queries, whereas MySQL’s partitioning will improve latency only on those queries where the optimizer can prune entire partitions.
The distributed searching workflow is straightforward:
Issue remote queries on all remote servers.
Perform sequential local index searches.
Read the partial search results from the remote servers.
Merge all the partial results into the final result set, and return it to the client.
If your hardware resources permit it, you can search through several indexes on the same machine in parallel, too. If there are several physical disk drives and several CPU cores, the concurrent searches can run without interfering with each other. You can pretend that some of the indexes are remote and configure searchd to contact itself to launch a parallel query on the same machine:
index distributed_sample { type = distributed local = chunk1 # resides on HDD1 agent = localhost:3312:chunk2 # resides on HDD2, searchd contacts itself }
From the client’s point of view, distributed indexes are absolutely no different from local indexes. This lets you create “trees” of distributed indexes by using nodes as proxies for sets of other nodes. For example, the first-level node could proxy the queries to a number of the second-level nodes, which could in turn either search locally themselves or pass the queries to other nodes, to an arbitrary depth.
Building a scalable system often involves sharding (partitioning) the data across different physical MySQL servers. We discussed this in depth in Chapter 11.
When the data is sharded at a fine level of granularity, simply
fetching a few rows with a selective WHERE
(which should be fast) means
contacting many servers, checking for errors, and merging the results
together in the application. Sphinx alleviates this problem, because
all the necessary functionality is already implemented inside the
search daemon.
Consider an example where a 1 TB table with a billion blog posts is sharded by user ID over 10 physical MySQL servers, so a given user’s posts always go to the same server. As long as queries are restricted to a single user, everything is fine: we choose the server based on user ID and work with it as usual.
Now assume that we need to implement an archive page that shows the user’s friends’ posts. How are we going to display “Other sysbench features,” with entries 981 to 1000, sorted by post date? Most likely, the various friends’ data will be on different servers. With only 10 friends, there’s about a 90% chance that more than 8 servers will be used, and that probability increases to 99% if there are 20 friends. So, for most queries, we will need to contact all the servers. Worse, we’ll need to pull 1,000 posts from each server and sort them all in the application. Following the suggestions we’ve made previously in this book, we’d trim down the required data to the post ID and timestamp only, but that’s still 10,000 records to sort in the application. Most modern scripting languages consume a lot of CPU time for that sorting step alone. In addition, we’ll either have to fetch the records from each server sequentially (which will be slow) or write some code to juggle the parallel querying threads (which will be difficult to implement and maintain).
In such situations, it makes sense to use Sphinx instead of reinventing the wheel. All we’ll have to do in this case is set up several Sphinx instances, mirror the frequently accessed post attributes from each table—in this example, the post ID, user ID, and timestamp—and query the master Sphinx instance for entries 981 to 1000, sorted by post date, in approximately three lines of code. This is a much smarter way to scale.
Sphinx is a standalone set of programs. The two main programs are:
A program that fetches documents from specified sources (e.g., from MySQL query results) and creates a full-text index over them. This is a background batch job, which sites usually run regularly.
A daemon that serves search queries from the indexes indexer builds. This provides the runtime support for applications.
The Sphinx distribution also includes native searchd client APIs in a number of programming languages (PHP, Python, Perl, Ruby, and Java, at the time of this writing), and SphinxSE, which is a client implemented as a pluggable storage engine for MySQL 5.0 and newer. The APIs and SphinxSE allow a client application to connect to searchd, pass it the search query, and fetch back the search results.
Each Sphinx full-text index can be compared to a table in a database; in place of rows in a table, the Sphinx index consists of documents. (Sphinx also has a separate data structure called a multivalued attribute, discussed later.) Each document has a unique 32-bit or 64-bit integer identifier that should be drawn from the database table being indexed (for instance, from a primary key column). In addition, each document has one or more full-text fields (each corresponding to a text column from the database) and numerical attributes. Like a database table, the Sphinx index has the same fields and attributes for all of its documents. Table F-1 shows the analogy between a database table and a Sphinx index.
Sphinx does not store the text fields from the database but just uses their contents to build a search index.
Sphinx installation is straightforward and typically includes the following steps:
After that, the search functionality is immediately available for client programs:
<?php include ( 'sphinxapi.php' ); $cl = new SphinxClient (); $res = $cl->Query ( 'test query', 'myindex' ); // use $res search result here ?>
The only thing left to do is run indexer regularly to update the full-text index data. Indexes that searchd is currently serving will stay fully functional during reindexing: indexer will detect that they are in use, create a “shadow” index copy instead, and notify searchd to pick up that copy on completion.
Full-text indexes are stored in the filesystem (at the location specified in the configuration file) and are in a special “monolithic” format, which is not well suited for incremental updates. The normal way to update the index data is to rebuild it from scratch. This is not as big a problem as it might seem, though, for the following reasons:
Indexing is fast. Sphinx can index plain text (without HTML markup) at a rate of 4–8 MB/sec on modern hardware.
You can partition the data in several indexes, as shown in the next section, and reindex only the updated part from scratch on each run of indexer.
There is no need to “defragment” the indexes—they are built for optimal I/O, which improves search speed.
Numeric attributes can be updated without a complete rebuild.
A future version will offer an additional index backend, which will support real-time index updates.
Let’s discuss partitioning in a bit more detail. The simplest partitioning scheme is the main + delta approach, in which two indexes are created to index one document collection. main indexes the whole document set, while delta indexes only documents that have changed since the last time the main index was built.
This scheme matches many data modification patterns perfectly. Forums, blogs, email and news archives, and vertical search engines are all good examples. Most of the data in those repositories never changes once it is entered, and only a tiny fraction of documents are changed or added on a regular basis. This means the delta index is small and can be rebuilt as frequently as required (e.g., once every 1–15 minutes). This is equivalent to indexing just the newly inserted rows.
You don’t need to rebuild the indexes to change attributes associated with documents—you can do this online via searchd. You can mark rows as deleted by simply setting a “deleted” attribute in the main index. Thus, you can handle updates by marking this attribute on documents in the main index, then rebuilding the delta index. Searching for all documents that are not marked as “deleted” will return the correct results.
Note that the indexed data can come from the results of any
SELECT
statement; it doesn’t have
to come from just a single SQL table. There are no restrictions on the
SELECT
statements. That means you
can preprocess the results in the database before they’re indexed.
Common preprocessing examples include joins with other tables,
creating additional fields on the fly, excluding some fields from
indexing, and manipulating values.
Besides “just” indexing and searching through database content, Sphinx offers several other special features. Here’s a partial list of the most important ones:
The searching and ranking algorithms take word positions and the query phrase’s proximity to the document content into account.
You can bind numeric attributes to documents, including multivalued attributes.
You can sort, filter, and group by attribute values.
You can create document snippets with search query keyword highlighting.
You can distribute searching across several machines.
You can optimize queries that generate several result sets from the same data.
You can access the search results from within MySQL using SphinxSE.
You can fine-tune the load Sphinx imposes on the server.
We covered some of these features earlier. This section covers a few of the remaining features.
Sphinx remembers word positions within each document, as do other open source full-text search systems. But unlike most other ones, it uses the positions to rank matches and return more relevant results.
A number of factors might contribute to a document’s final rank. To compute the rank, most other systems use only keyword frequency: the number of times each keyword occurs. The classic BM25 weighting function[232] that virtually all full-text search systems use is built around giving more weight to words that either occur frequently in the particular document being searched or occur rarely in the whole collection. The BM25 result is usually returned as the final rank value.
In contrast, Sphinx also computes query phrase proximity, which is simply the length of the longest verbatim query subphrase contained in the document, counted in words. For instance, the phrase “John Doe Jr” queried against a document with the text “John Black, John White Jr, and Jane Dunne” will produce a phrase proximity of 1, because no two words in the query appear together in the query order. The same query against “Mr. John Doe Jr and friends” will yield a proximity of 3, because three query words occur in the document in the query order. The document “John Gray, Jane Doe Jr” will produce a proximity of 2, thanks to its “Doe Jr” query subphrase.
By default, Sphinx ranks matches using phrase proximity first and the classic BM25 weight second. This means that verbatim query quotes are guaranteed to be at the very top, quotes that are off by a single word will be right below those, and so on.
When and how does phrase proximity affect results? Consider searching 1,000,000 pages of text for the phrase “To be or not to be.” Sphinx will put the pages with verbatim quotes at the very top of the search results, whereas BM25-based systems will first return the pages with the most mentions of “to,” “be,” “or,” and “not”—pages with an exact quote match but only a few instances of “to” will be buried deep in the results.
Most major web search engines today rank results with keyword positions as well. Searching for a phrase on Google will likely result in pages with perfect or near-perfect phrase matches appearing at the very top of the search results, followed by the “bag of words” documents.
However, analyzing keyword positions requires additional CPU time, and sometimes you might need to skip it for performance reasons. There are also cases when phrase ranking produces undesired, unexpected results. For example, searching for tags in a cloud is better without keyword positions: it makes no difference whether the tags from the query are next to each other in the document.
To allow for flexibility, Sphinx offers a choice of ranking modes. Besides the default mode of proximity plus BM25, you can choose from a number of others that include BM25-only weighting, fully disabled weighting (which provides a nice optimization if you’re not sorting by rank), and more.
Each document might contain an unlimited number of numeric attributes. Attributes are user-specified and can contain any additional information required for a specific task. Examples include a blog post’s author ID, an inventory item’s price, a category ID, and so on.
Attributes enable efficient full-text searches with additional filtering, sorting, and grouping of the search results. In theory, they could be stored in MySQL and pulled from there every time a search is performed. But in practice, if a full-text search locates even hundreds or thousands of rows (which is not many), retrieving them from MySQL is unacceptably slow.
Sphinx supports two ways to store attributes: inline in the document lists or externally in a separate file. Inlining requires all attribute values to be stored in the index many times, once for each time a document ID is stored. This inflates the index size and increases I/O, but reduces use of RAM. Storing the attributes externally requires preloading them into RAM upon searchd startup.
Attributes normally fit in RAM, so the usual practice is to store them externally. This makes filtering, sorting, and grouping very fast, because accessing data is a matter of quick in-memory lookup. Also, only the externally stored attributes can be updated at runtime. Inline storage should be used only when there is not enough free RAM to hold the attribute data.
Sphinx also supports multivalued attributes (MVAs). MVA content consists of an arbitrarily long list of integer values associated with each document. Examples of good uses for MVAs are lists of tag IDs, product categories, and access control lists.
Having access to attribute values in the full-text engine allows Sphinx to filter and reject candidate matches as early as possible while searching. Technically, the filter check occurs after verification that the document contains all the required keywords, but before certain computationally intensive calculations (such as ranking) are done. Because of these optimizations, using Sphinx to combine full-text searching with filtering and sorting can be 10 to 100 times faster than using Sphinx for searching and then filtering results in MySQL.
Sphinx supports two types of filters, which are analogous to
simple WHERE
conditions in
SQL:
An attribute value matches a specified range of values
(analogous to a BETWEEN
clause,
or numeric comparisons).
An attribute value matches a specified set of values
(analogous to an IN()
list).
If the filters will have a fixed number of values (“set” filters instead of “range” filters), and if such values are selective, it makes sense to replace the integer values with “fake keywords” and index them as full-text content instead of attributes. This applies to both normal numeric attributes and MVAs. We’ll see some examples of how to do this later.
Sphinx can also use filters to optimize full scans. Sphinx remembers minimum and maximum attribute values for short continuous row blocks (128 rows, by default) and can quickly throw away whole blocks based on filtering conditions. Rows are stored in the order of ascending document IDs, so this optimization works best for columns that are correlated with the ID. For instance, if you have a row-insertion timestamp that grows along with the ID, a full scan with filtering on that timestamp will be very fast.
Full-text search results received from Sphinx almost
always require additional work involving MySQL—at the very least, to
pull out the text column values that the Sphinx index does not store.
As a result, you’ll frequently need to JOIN
search results from Sphinx with other
MySQL tables.
Although you can achieve this by sending the result’s document IDs to MySQL in a query, that strategy leads to neither the cleanest nor the fastest code. For high-volume situations, you should consider using SphinxSE, a pluggable storage engine that you can compile into MySQL 5.0 or newer, or load into MySQL 5.1 or newer as a plugin.
SphinxSE lets programmers query searchd and
access search results from within MySQL. The usage is as simple as
creating a special table with an ENGINE=SPHINX
clause (and an optional
CONNECTION
clause to locate the
Sphinx server if it’s at a nondefault location), and then running
queries against that table:
mysql>CREATE TABLE search_table (
->id INTEGER NOT NULL,
->weight INTEGER NOT NULL,
->query VARCHAR(3072) NOT NULL,
->group_id INTEGER,
->INDEX(query)
->) ENGINE=SPHINX CONNECTION="sphinx://localhost:3312/test";
Query OK, 0 rows affected (0.12 sec) mysql>SELECT * FROM search_table WHERE query='test;mode=all' G
*************************** 1. row *************************** id: 123 weight: 1 query: test;mode=all group_id: 45 1 row in set (0.00 sec)
Each SELECT
passes a Sphinx
query as the query
column in the
WHERE
clause. The Sphinx
searchd server returns the results. The SphinxSE
storage engine then translates these into MySQL results and returns
them to the SELECT
statement.
Queries might include JOIN
s
with any other tables stored using any other storage engines.
The SphinxSE engine supports most searching options available via the API, too. You can specify options such as filtering and limits by plugging additional clauses into the query string:
mysql>SELECT * FROM search_table WHERE query='test;mode=all;
->filter=group_id,5,7,11;maxmatches=3000';
Per-query and per-word statistics that are returned by the API
are also accessible through SHOW
STATUS
:
mysql> SHOW ENGINE SPHINX STATUS G
*************************** 1. row ***************************
Type: SPH INX
Name: stats
Status: total: 3, total found: 3, time: 8, words: 1
*************************** 2. row ***************************
Type: SPHINX
Name: words
Status: test:3:5
2 rows in set (0.00 sec)
Even when you’re using SphinxSE, the rule of thumb still is to
allow searchd to perform sorting, filtering, and
grouping—i.e., to add all the required clauses to the query string
rather than use WHERE
, ORDER BY
, or GROUP
BY
. This is especially important for WHERE
conditions. The reason is that
SphinxSE is only a client to searchd, not a
full-blown built-in search library. Thus, you need to pass everything
that you can to the Sphinx engine to get the best performance.
Both indexing and searching operations could impose a significant additional load on either the search server or the database server. Fortunately, a number of settings let you limit the load coming from Sphinx.
An undesired database-side load can be caused by indexer queries that either stall MySQL completely with their locks or just occur too quickly and hog resources from other concurrent queries.
The first case is a notorious problem with MyISAM, where
long-running reads lock the tables and stall other pending reads and
writes—you can’t simply do SELECT * FROM
big_table
on a production server, because you risk
disrupting all other operations. To work around that, Sphinx offers
ranged queries. Instead of configuring a single
huge query, you can specify one query that quickly computes the
indexable row ranges and another query that pulls out the data step by
step, in small chunks:
sql_query_range = SELECT MIN(id),MAX(id) FROM documents sql_range_step = 1000 sql_query = SELECT id, title, body FROM documents WHERE id>=$start AND id<=$end
This feature is extremely helpful for indexing MyISAM tables,
but it should also be considered when using InnoDB tables. Although
InnoDB won’t just lock the table and stall other queries when running
a big SELECT *
, it will still use
significant machine resources because of its MVCC architecture.
Multiversioning for a thousand transactions that cover a thousand rows
each can be less expensive than a single long-running million-row
transaction.
The second cause of excessive load happens when
indexer is able to process the data more quickly
than MySQL provides it. You should also use ranged queries in this
case. The sql_ranged_throttle
option forces indexer to sleep for a given time
period (in milliseconds) between
subsequent ranged query steps, increasing indexing time but easing the
load on MySQL.
Interestingly enough, there’s a special case where you can configure Sphinx to achieve exactly the opposite effect: that is, you can improve indexing time by placing more load on MySQL. When the connection between the indexer box and the database box is 100 Mbps, and the rows compress well (which is typical for text data), the MySQL compression protocol can improve overall indexing time. The improvement comes at a cost of more CPU time spent on both the MySQL and indexer sides to compress and uncompress the rows transmitted over the network, respectively but the overall indexing time could be up to 20–30% less because of greatly reduced network traffic.
Search clusters can suffer from occasional overload, too, so Sphinx provides a few ways to help avoid searchd going off on a spin.
First, a max_children
option
simply limits the total number of concurrently running queries and
tells clients to retry when that limit is reached.
Then there are query-level limits. You can specify that query
processing stop either at a given threshold of matches found or a
given threshold of elapsed time, using the SetLimits()
and SetMaxQueryTime()
API calls, respectively.
This is done on a per-query basis, so you can ensure that more
important queries always complete fully.
Finally, periodic indexer runs can cause
bursts of additional I/O that will in turn cause intermittent
searchd slowdowns. To prevent that, options that
limit indexer disk I/O exist. max_iops
enforces a minimal delay between
I/O operations that ensures that no more than max_iops
disk operations per second will be
performed. But even a single operation could be too much; consider a
100 MB read()
call as an example.
The max_iosize
option takes cares
of that, guaranteeing that the length of every disk read or write will
be under a given boundary. Larger operations are automatically split
into smaller ones, and these smaller ones are then controlled by
max_iops
settings.
Each of the features we’ve described can be found successfully deployed in production. The following sections review several of these real-world Sphinx deployments, briefly describing the sites and some implementation details.
A popular torrent search engine, Mininova (http://www.mininova.org) provides a clear example of how to optimize “just” full-text searching. Sphinx replaced several MySQL replicas using MySQL built-in full-text indexes, which were unable to handle the load. After the replacement, the search servers were underloaded; the current load average is now in the 0.3–0.4 range.
Here are the database size and load numbers:
The site has a small database, with about 300,000–500,000 records and about 300–500 MB of index.
The site load is quite high: about 8–10 million searches per day at the time of this writing.
The data mostly consists of user-supplied filenames, frequently without proper punctuation. For this reason, prefix indexing is used instead of whole-word indexing. The resulting index is several times larger than it would otherwise be, but it is still small enough that it can be built quickly and its data can be cached effectively.
Search results for the 1,000 most frequent queries are cached on the application side. About 20–30% of all queries are served from the cache. Because of the “long tail” query distribution, a larger cache would not help much more.
For high availability, the site uses two servers with complete full-text index replicas. The indexes are rebuilt from scratch every few minutes. Indexing takes less than one minute, so there’s no point in implementing more complex schemes.
The following are lessons learned from this example:
Caching search results in the application helps a lot.
There might be no need to have a huge cache, even for busy applications. A mere 1,000 to 10,000 entries can be enough.
For databases on the order of 1 GB in size, simple periodic reindexing instead of more complicated schemes is OK, even for busy sites.
Mininova is an extreme high-load project case—there’s not that much data, but there are a lot of queries against that data. BoardReader (http://www.boardreader.com) is just the opposite: a forum search engine that performs many fewer searches on a much larger dataset. Sphinx replaced a commercial full-text search engine that took up to 10 seconds per query to search through a 1 GB collection. Sphinx allowed BoardReader to scale greatly, both in terms of data size and query throughput.
Here’s some general information:
There are more than 1 billion documents and 1.5 TB of text in the database.
There are about 500,000 page views and between 700,000 and 1 million searches per day.
At the time of this writing, the search cluster consists of six servers, each with four logical CPUs (two dual-core Xeons), 16 GB of RAM, and 0.5 TB of disk space. The database itself is stored on a separate cluster. The search cluster is used only for indexing and searching.
Each of the six servers runs four searchd instances, so all four cores are used. One of the four instances aggregates the results from the other three. That makes a total of 24 searchd instances. The data is distributed evenly across all of them. Every searchd copy carries several indexes over approximately 1/24 of the total data (about 60 GB).
The search results from the six “first-tier” searchd nodes are in turn aggregated by another searchd instance running on the frontend web server. This instance carries several purely distributed indexes, which reference the six search cluster servers but have no local data at all.
Why have four searchd instances per node? Why not have only one searchd instance per server, configure it to carry four index chunks, and make it contact itself as though it’s a remote server to utilize multiple CPUs, as we suggested earlier? Having four instances instead of just one has its benefits. First, it reduces startup time. There are several gigabytes of attribute data that need to be preloaded in RAM; starting several daemons at a time lets us parallelize that. Second, it improves availability. In the event of searchd failures or updates, only 1/24 of the whole index is inaccessible, instead of 1/6.
Within each of the 24 instances on the search cluster, we used time-based partitioning to reduce the load even further. Many queries need to be run only on the most recent data, so the data is divided into three disjoint index sets: data from the last week, from the last three months, and from all time. These indexes are distributed over several different physical disks on a per-instance basis. This way, each instance has its own CPU and physical disk drive and won’t interfere with the others.
Local cron jobs update the indexes periodically. They pull the data from MySQL over the network but create the index files locally.
Using several explicitly separated “raw” disks proved to be faster than a single RAID volume. Raw disks give control over which files go on which physical disk. That is not the case with RAID, where the controller decides which block goes on which physical disk. Raw disks also guarantee fully parallel I/O on different index chunks, but concurrent searches on RAID are subject to I/O stepping. We chose RAID 0, which has no redundancy, because we don’t care about disk failures; we can easily rebuild the indexes on the search nodes. We could also have used several RAID 1 (mirror) volumes to give the same throughput as raw disks while improving reliability.
Another interesting thing to learn from BoardReader is how Sphinx version updates are performed. Obviously, the whole cluster cannot be taken down. Therefore, backward compatibility is critical. Fortunately, Sphinx provides it—newer searchd versions usually can read older index files, and they are always able to communicate to older clients over the network. Note that the first-tier nodes that aggregate the search results look just like clients to the second-tier nodes, which do most of the actual searching. Thus, the second-tier nodes are updated first, then the first-tier ones, and finally the web frontend.
Lessons learned from this example are:
Sahibinden (http://www.sahibinden.com), a leading Turkish online auction site, had a number of performance problems, including full-text search performance. After deploying Sphinx and profiling some queries, we found that Sphinx could perform many of the frequent application-specific queries with filters faster than MySQL—even when there was an index on one of the participating columns in MySQL. Besides, using Sphinx for non-full-text searches resulted in unified application code that was simpler to write and support.
MySQL was underperforming because the selectivity on each individual column was not enough to reduce the search space significantly. In fact, it was almost impossible to create and maintain all the required indexes, because too many columns required them. The product information tables had about 100 columns, each of which the web application could technically use for filtering or sorting.
Active insertion and updates to the “hot” products table slowed to a crawl, because of too many index updates.
For that reason, Sphinx was a natural choice for
all the SELECT
queries on the product information tables, not just the full-text
search queries.
Here are the database size and load numbers for the site:
The database contains about 400,000 records and 500 MB of data.
The load is about 3 million queries per day.
To emulate normal SELECT
queries with WHERE
conditions, the
Sphinx indexing process included special keywords in the full-text
index. The keywords were of the form _
_CAT
N
_ _
_, where N
was replaced with the
corresponding category ID. This replacement happened during indexing
with the CONCAT()
function in
the MySQL query, so the source data was not altered.
The indexes needed to be rebuilt as frequently as possible. We settled on rebuilding them every minute. A full reindexing took 9–15 seconds on one of many CPUs, so the main + delta scheme discussed earlier was not necessary.
The PHP API turned out to spend a noticeable amount of time (7–9 milliseconds per query) parsing the result set when it had many attributes. Normally, this overhead would not be an issue because the full-text search costs, especially over big collections, would be higher than the parsing cost. But in this specific case, we also needed non-full-text queries against a small collection. To alleviate the issue, the indexes were separated into pairs: a “lightweight” one with the 34 most frequently used attributes, and a “complete” one with all 99 attributes.
Other possible solutions would have been to use SphinxSE or to implement a feature to pull only the specified columns into Sphinx. However, the workaround with two indexes was by far the fastest to implement, and time was a concern.
The following are the lessons learned from this example:
Sometimes, a full scan in Sphinx performs better than an index read in MySQL.
For selective conditions, use a “fake keyword” instead of filtering on an attribute, so the full-text search engine can do more of the work.
APIs in scripting languages can be a bottleneck in certain extreme but real-world cases.
An improvement to the BoardReader service required counting hyperlinks and
building various reports from the linking data. For instance, one of
the reports needed to show the top N
second-level domains linked to during the last week. Another counted
the top N
second- and third-level domains
that linked to a given site, such as YouTube. The queries to build
these reports had the following common characteristics:
They always group by domain.
They sort by count per group or by the count of distinct values per group.
They process a lot of data (up to millions of records), but the result set with the best groups is always small.
Approximate results are acceptable.
During the prototype-testing phase, MySQL took up to 300 seconds to execute these queries. In theory, by partitioning the data, splitting it across servers, and manually aggregating the results in the application, it would have been possible to optimize the queries to around 10 seconds. But this is a complicated architecture to build; even the partitioning implementation is far from straightforward.
Because we had successfully distributed the search load with
Sphinx, we decided to implement an approximate distributed GROUP BY
with Sphinx,
too. This required preprocessing the data before indexing to convert
all the interesting substrings into standalone “words.” Here’s a
sample URL before and after preprocessing:
source_url = http://my.blogger.com/my/best-post.php processed_url = my$blogger$com, blogger$com, my$blogger$com$my, my$blogger$com$my$best, my$blogger$com$my$best$post.php
Dollar signs ($) are merely a unified replacement for URL separator characters so that searches can be conducted on any URL part, be it domain or path. This type of preprocessing extracts all “interesting” substrings into single keywords that are the fastest to search. Technically, we could have used phrase queries or prefix indexing, but that would have resulted in bigger indexes and slower performance.
Links are preprocessed at indexing time using a specially crafted MySQL UDF. We also enhanced Sphinx with the ability to count distinct values for this task. After that, we were able to move the queries completely to the search cluster, distribute them easily, and reduce query latency greatly.
Here are the database size and load numbers:
There are about 150–200 million records, which becomes about 50–100 GB of data after preprocessing.
The load is approximately 60,000–100,000 GROUP BY
queries per day.
The indexes for the distributed GROUP
BY
were deployed on the same search cluster of 6 machines
and 24 logical CPUs described previously. This is a minor
complementary load to the main search load over the 1.5 TB text
database.
Sphinx replaced MySQL’s exact, slow, single-CPU computations
with approximate, fast, distributed computations. All of the factors
that introduce approximation errors are present here: the incoming
data frequently contains too many rows to fit in the “sort buffer” (we
use a fixed RAM limit of 100K rows), we use COUNT(DISTINCT)
, and the result sets are
aggregated over the network. Despite that, the results for the first
10 to 1000 groups—which are actually required for the reports—are from
99% to 100% correct.
The indexed data is very different from the data that would be used for an ordinary full-text search. There are a huge number of documents and keywords, even though the documents are very small. The document numbering is nonsequential, because a special numbering convention (source server, source table, and primary key) that does not fit in 32 bits is used. The huge amount of search “keywords” was also causing frequent CRC32 collisions (Sphinx uses CRC32 to map keywords to internal word IDs). For these reasons, we were forced to use 64-bit identifiers everywhere internally.
The current performance is satisfactory. For the most complex domains, queries normally complete in 0.1 to 1.0 seconds.
The following are the lessons learned from this example:
For GROUP BY
queries,
some precision can be traded for speed.
With huge textual collections or moderately sized special collections, 64-bit identifiers might be required.
Grouply (http://www.grouply.com) built a Sphinx-based solution to search its multimillion-record database of tagged messages, using Sphinx’s MVA support. The database is split across many physical servers for massive scalability, so it might be necessary to query tables that are located on different servers. Arbitrary large-scale joins are impossible because there are too many participating servers, databases, and tables.
Grouply uses Sphinx’s MVA attributes to store message tags. The
tag list is retrieved from a Sphinx cluster via the PHP API. This
replaces multiple sequential SELECTs
from
several MySQL servers. To reduce the number of SQL queries as
well, certain presentation-only
data (for example, a small list of users who last read the message) is
also stored in a separate MVA attribute and accessed through
Sphinx.
Two key innovations here are using Sphinx to prebuild JOIN
results and using its distributed
capabilities to merge data scattered over many shards. This would be
next to impossible with MySQL alone. Efficient merging would require
partitioning the data over as few physical servers and tables as
possible, but that would hurt both scalability and
extensibility.
Lessons learned from this example are:
Sphinx can be used to aggregate highly partitioned data efficiently.
MVAs can be used to store and optimize prebuilt JOIN
results.
We’ve discussed the Sphinx full-text search system only briefly in this appendix. To keep it short, we intentionally omitted discussions of many other Sphinx features, such as HTML indexing support, ranged queries for better MyISAM support, morphology and synonym support, prefix and infix indexing, and CJK indexing. Nevertheless, this appendix should give you some idea of how Sphinx can solve many different real-world problems efficiently. It is not limited to full-text searching; it can solve a number of difficult problems that would traditionally be done in SQL.
Sphinx is neither a silver bullet nor a replacement for MySQL. However, in many cases (which are becoming common in modern web applications), it can be used as a very useful complement to MySQL. You can use it to simply offload some work, or even to create new possibilities for your application.
Download it at http://www.sphinxsearch.com—and don’t forget to share your own usage ideas!