Chapter 13. Full-text searching

Robert C. Cain

Search is everywhere. In addition to the powerful search engines available to us, it seems like every website we visit has a search box for searching within that site. Wouldn’t it be great to incorporate search within your applications? Fortunately, SQL Server provides a powerful text search engine that’s as easy to use as one-two-three!

Foundations of full-text searching

Before we begin the step-by-step process of creating and using full-text indexes, there are a few fundamentals that you’ll need to understand. Full-text search isn’t a fancy way of doing a LIKE search with SQL. Instead, every word is placed into a special type of index called a full-text index. These indexes are organized and stored in full-text catalogs, which act as containers to organize our indexes.

Each word in a full-text index also includes a unique key for that record. You should note that in order to full-text index a table, SQL Server requires the table to have a unique, single-column key. This single-column key is used as part of the ranking functions we’ll cover later in this chapter.

All of the text-based data types are eligible for full-text searching. The complete list is char, nchar, varchar, nvarchar, text, ntext, xml, image, and varbinary(max). According to online documentation from Microsoft, text, ntext, and image data types will be deprecated in future versions of SQL Server, so I suggest avoiding these if you can.

Char, nchar, varchar, and nvarchar all make sense as candidates for full-text indexing. XML also makes sense, because it’s text based, but adds the advantage that markup tags are ignored—only the data is full-text indexed. The data type that might have you scratching your head is varbinary(max). To understand this, we have to briefly delve into the history of the full-text engine.

The code base for the full-text search engine included with SQL Server descended from a product called Microsoft Index Server. With it, you could index various document types stored on your server, be it a Windows NT 4.0 server or IIS (Internet Information Server). The ability to look inside documents and index their content was retained and lives on in SQL Server’s full-text search engine.

SQL Server allows you to store various types of unstructured documents, such as Microsoft Word, Excel, and many others inside a varbinary(max) field. If the full-text engine recognizes the type of document stored in a varbinary(max) field, it’ll open the document and index all words contained in the document.

We’re almost ready to dig into some code, but before we do, you should note that all of the examples in this chapter use the AdventureWorks2008 database. This is freely available from Microsoft’s CodePlex site. As of this writing, you can find AdventureWorks at http://www.codeplex.com/MSFTDBProdSamples/Release/ProjectReleases.aspx.

Creating and maintaining catalogs

The “one” in our one-two-three concerns the catalog. The catalog is a logical container to hold a group of one or more full-text indexes. Creating a catalog is fairly straightforward. Let’s look at the basic statement to create one (for a complete syntax diagram, refer to the SQL Server Books On Line):

CREATE FULLTEXT CATALOG AdventureWorksFTC
AS DEFAULT;

First, note that you’ll want to supply the name of your catalog in place of AdventureWorksFTC. If you only have one full-text catalog for your database, I suggest using the same name as the database followed by FTC (for full-text catalog), as in the example.

The optional AS DEFAULT tells SQL Server to use this particular catalog as the default for all full-text commands if no catalog is specified. It’s a good idea to specify at least one catalog as the default, and if you only have one, you definitely want to add this to the statement.

That’s all there is to it; you now have an empty catalog waiting for your indexes. Before we start loading it with full-text indexes, though, let’s take a moment to look at a few commands available for maintaining the catalog.

The first two are similar to each other, in that they update all of the indexes in the catalog, but they do it in quite different ways. The first is the REBUILD command.

ALTER FULLTEXT CATALOG AdventureWorksFTC REBUILD;

This will go through each index and rebuild it from the source tables. It’s the fastest, most efficient way to rebuild an entire catalog, but it has the side effect of taking the catalog offline—your catalog won’t be available for your users to do any full-text searching. If your operation is a 9-to-5 shop and you’re doing a rebuild during off hours, then REBUILD is the way to go. But what if your operation runs 24 hours a day?

For those situations, we have the REORGANIZE command:

ALTER FULLTEXT CATALOG AdventureWorksFTC REORGANIZE;

The REORGANIZE command will rebuild all indexes, without taking the catalog offline. Your users will still be able to use and query the catalog normally. The downside is that this is a lot slower than doing a rebuild.

Fortunately, doing either a rebuild or a reorganize to your catalog is fairly rare. The most likely call for this is during database updates that span the majority of tables in the database. When making mass updates, you may find a significant speed increase by turning off full-text indexing (using techniques later in this chapter), doing the updates, turning indexing back on, and then rebuilding/reorganizing the entire catalog.

The final command is quite simple: it sets a catalog to be the new default catalog.

ALTER FULLTEXT CATALOG AdventureWorksFTC AS DEFAULT;

Like the previous commands, this isn’t something you’ll use often. Perhaps in a long script, you may want to change the default catalog temporarily to make your coding easier.

Creating and maintaining full-text indexes

Now that the catalog exists, we’re ready to create indexes to put in it. In this section, we’ll see not only how to create a full-text index, but how to maintain it.

Creating the full-text index

The second step in our one-two-three process is to create a full-text index. In the AdventureWorks database is a table called Production.ProductDescription. Full-text searching through product descriptions seems like a logical thing users would want to do, so we’ll use this table. The next piece of information we need to know is what columns to search on. If you examine the table in SQL Server Management Studio, you’ll see it only has one column that’s eligible for full-text searching: Description. The final thing we need to know is the name of the unique index. Expanding the Keys branch in Management Studio shows us one key, named PK_ProductDescription_ProductDescriptionID. Armed with this information, we can now issue the command to create our full-text index on this table:

CREATE FULLTEXT INDEX ON Production.ProductDescription
([Description])
KEY INDEX PK_ProductDescription_ProductDescriptionID
ON AdventureWorksFTC
WITH CHANGE_TRACKING AUTO;

We start by issuing CREATE FULLTEXT INDEX ON XXX (replacing XXX with the name of the table we want to index). Note something interesting, though: at no point do we give the full-text index a name. With full-text indexing, each table is allowed to have one and only one full-text index. Because of this, SQL Server takes care of creating a unique index name for us, allowing us to refer to it by the table name.

The single full-text index per table isn’t the limitation it might seem at first, because you can have as many columns as you want in the index, as line two of the preceding code shows. List each column in parentheses, separated by commas. You can also add and remove columns later, as we’ll see momentarily.

The next line, KEY INDEX, asks you to specify the unique index for your table. This will typically be your primary key index. The important thing is that it be a single-column, non-nullable unique index. SQL Server requires this in order to perform its ranking functions, discussed later in this chapter.

The fourth line, ON AdventureWorksFTC, is optional if you have AdventureWorks as your default catalog. If you omit it, the full-text index will be placed into the default catalog. If you have multiple catalogs for your system, you can add the catalog name in order to place your new full-text index in a catalog other than the default.

The next line, WITH CHANGE_TRACKING, is probably the most important line in the statement. It defines how SQL Server will manage your full-text index; therefore, understanding the options is key to understanding how your index will get updated. The AUTO option is the most straightforward, so we’ll tackle it first.

With AUTO, every time a row in your table is updated, SQL Server will update the full-text index associated with that table. This is by far the easiest way to manage your full-text indexes, but it can cause performance penalties if your table has a large number of updates in a short time span. I’d like to give you a more definitive statement than “large,” but it depends on a variety of variables. How beefy is your server? How much RAM is installed? What’s the speed of the disks? And is the catalog on the same drive as the database or a different one? All of these come into play; my best advice is to set up your index in a test environment with the change tracking set to AUTO, and then test with a load that simulates your production environment. If you can measure an unacceptable decrease in performance, you can instead set change tracking to MANUAL.

With MANUAL change tracking, each time a row is updated in your table, SQL Server sets an internal flag that marks that row as having been updated—but no action is taken to update your full-text index. To update the full-text index, you must issue an ALTER command, which we’ll cover in detail shortly. This method is much more efficient and faster than using AUTO. It does have a downside, though, in that there’s a time delay. You have to set up a job using SQL Server’s job agent to issue the ALTER command at a frequency acceptable to your users. Thus, there will be some time delay between when a user updates a record in a table and when that data is available to be full-text searched on. For tables with large numbers of updates, MANUAL is definitely the preferred method.

The last option, OFF, will create the full-text index and populate it, but then cause no further updates to the index to be exercised. It won’t track changes to the table, as with manual mode, nor will it set up the index to be automatically updated. OFF mode would be useful with static tables—tables where you don’t plan on doing updates. Perhaps these are lookup tables, or they’re tables from a legacy system you want to be able to report on for historical purposes, but that will never be updated.

With all three options, when you create the index, SQL Server immediately populates the full-text index from the source table. There may be times when this is undesirable. With AUTO or MANUAL, you don’t have much choice, but with OFF mode, there’s an additional option: OFF, NO POPULATION. When you tack on NO POPULATION, SQL Server will create the full-text index but not populate it. This would be useful when you want to break your scripts into two parts—one to create the full-text indexes, and a second you’d use later to populate them, perhaps during off hours, using the ALTER statement as shown later in this chapter.

Now that the full-text index has been created, we’re ready for step three, querying data from our full-text index. Before we proceed, let’s take a few moments to examine how to maintain our full-text indexes.

Maintaining full-text indexes

Anyone with five minutes of experience in the computer industry knows that the one thing that’s constant is change. SQL Server provides many ways to change our full-text indexes, most of which are variations of the ALTER command. Let’s look at some ways to maintain the index we just created in the previous section. The first statement is

ALTER FULLTEXT INDEX ON Production.ProductDescription
START UPDATE POPULATION;

This is the command to update the full-text index when CHANGE_TRACKING is set to MANUAL, and probably the command you’ll use the most. When issued, SQL Server will roll through all of the rows in the table, and will update the corresponding full-text index for rows that have been marked as updated. To make life easier, you could issue this command from a scheduled SQL Server job on a timed basis.

A corresponding command is the full population command:

ALTER FULLTEXT INDEX ON Production.ProductDescription
START FULL POPULATION;

This command will rebuild the entire full-text index for this table from the ground up. You’d likely want to use this if you had turned off full-text indexing in order to update the source table and were now ready to get it back in sync with the full-text index.

The next two commands will allow us to add and remove columns from our full-text index:

ALTER FULLTEXT INDEX ON Production.ProductDescription
ADD ([Description]);

ALTER FULLTEXT INDEX ON Production.ProductDescription
DROP ([Description]);

All you need to do is indicate the column you want to add or drop, and SQL Server will take care of the rest.

It’s also possible to alter the change-tracking mode after you create the full-text index. The change-tracking mode works like it does when creating the index. For example, if we wanted to change the tracking mode on the product description table, we’d issue this:

ALTER FULLTEXT INDEX ON Production.ProductDescription
SET CHANGE_TRACKING MANUAL;

To set it back to AUTO:

ALTER FULLTEXT INDEX ON Production.ProductDescription
SET CHANGE_TRACKING AUTO;

Or to turn it off:

ALTER FULLTEXT INDEX ON Production.ProductDescription
SET CHANGE_TRACKING OFF;

If you need to, you can also disable or enable a full-text index:

ALTER FULLTEXT INDEX ON Production.ProductDescription DISABLE;

ALTER FULLTEXT INDEX ON Production.ProductDescription ENABLE;

The DISABLE command will turn off all change tracking, but will leave the data in the index intact. You may want to disable an index if you’ll no longer be using a table for updates, or if you’re about to make a huge number of updates to the table and have change tracking set to AUTO. Be aware, though, that while disabled, no change tracking is performed. Thus, immediately after issuing the ENABLE command, you’ll want to issue the ALTER command with START FULL POPULATION to rebuild the index.

Finally, you may want to remove the full-text index altogether. To do so, issue the DROP command:

DROP FULLTEXT INDEX ON Production.ProductDescription;

Querying full-text indexes

So far we’ve created a catalog, and then created a full-text index to put in the catalog. But users are strange creatures; not only do they expect us to keep their data safe, but they expect to get it back! So let’s proceed to step three in our one-two-three—querying our full-text index.

Basic searches

SQL Server provides an assortment of ways to query our full-text index. The first we’ll look at is the CONTAINS keyword, which is added to the WHERE clause of a query. The form is CONTAINS(column, 'word'), where column is the name of the column you want to look for the text in and 'word' is the word or phrase you want to look for. If you want to look at all of the columns in the table you’ve indexed, you can use an asterisk (*) in place of the column name. Let’s do a simple search of our product description table:

SELECT ProductDescriptionID as PDID, [Description]
FROM [Production].[ProductDescription] pd
WHERE CONTAINS(pd.[Description], 'ride'),

Running this query returns 10 rows back to us. Note that we could also have used CONTAINS(*, 'ride') and received the same results, because we only have one column in the full-text index. CONTAINS looks for an exact match of the word you pass in. Most of the time, searching for an exact match is what you’ll want to do. Sometimes, though, you’ll want something less exact, which returns a broader scope of results. For those times, SQL Server provides the FREETEXT keyword. The format is identical to CONTAINS: FREETEXT(column, 'text').

SELECT ProductDescriptionID AS PDID, [Description]
FROM [Production].[ProductDescription] pd
WHERE FREETEXT(pd.[Description], 'ride'),

When you run this, you get 22 rows back. If you run this and look through the results, in addition to the word ride, you’ll also see descriptions that contain the word riding. This is because of the way FREETEXT functions. FREETEXT works in a two-step process:

  1. Stemming— The full-text search engine takes the word, in this case ride, and adds the variants of the word to its search. Thus it would have ride, rides, rode, ridden, and riding in the list.
  2. Thesaurus— After doing the stemming, FREETEXT then goes to the thesaurus and retrieves the list of words that go with ride. For example, it might add words such as drive, commute, and transportation to the list. The process is repeated for each word found in the stemming process.

FREETEXT then performs a search on all of the words and returns the results to you.

Whereas CONTAINS looks for an exact match, FREETEXT looks to match the meaning of the word you’re searching for. Which will be better for your application depends on your users and their needs.

So far, we’ve looked at some fairly simple queries, but we don’t have to limit ourselves. Here’s a slightly more complex example that uses FREETEXT and joins data from three different tables in the AdventureWorks database. This example is closer to what you might do in a production application:

SELECT [Name], ProductNumber, [Description]
FROM [Production].[Product] p
, [Production].[ProductDescription] pd
, [Production].[ProductModelProductDescriptionCulture] pmpdc
WHERE p.ProductModelID = pmpdc.ProductModelID
AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID
AND FREETEXT(pd.[Description], 'shift'),

FORMSOF

CONTAINS and FREETEXT are both powerful, but there are times when you want something a little looser than CONTAINS, but perhaps not quite as free as FREETEXT. SQL Server provides a way to get to this middle ground by using FORMSOF inside the text string we’re searching for. The syntax is admittedly arcane, so let’s look at an example:

SELECT [Name], ProductNumber, [Description]
FROM [Production].[Product] p
, [Production].[ProductDescription] pd
, [Production].[ProductModelProductDescriptionCulture] pmpdc
WHERE p.ProductModelID = pmpdc.ProductModelID
AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID
AND CONTAINS(pd.[Description], 'FORMSOF(INFLECTIONAL, light)' );

As you can see, to use this, place FORMSOF inside the text string you’re passing into the CONTAINS keyword. This strange syntax throws a lot of people, so I’m going to say it again to emphasize the point. You must enclose FORMSOF inside the string you’re passing to the CONTAINS keyword.

After FORMSOF, use either the word INFLECTIONAL or THESAURUS, then a comma, and then the word we’re looking for—in this example, light. When using FORMSOF INFLECTIONAL, the full-text engine will go through the stemming process, generating the words light, lightest, lit, and so on. But unlike FULLTEXT, it’ll stop and not go to the thesaurus to add any more words. Conversely, you can also use the thesaurus without the stemming by using FORMSOF THESAURUS:

SELECT [Name], ProductNumber, [Description]
FROM [Production].[Product] p
, [Production].[ProductDescription] pd
, [Production].[ProductModelProductDescriptionCulture] pmpdc
WHERE p.ProductModelID = pmpdc.ProductModelID
AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID
AND CONTAINS(pd.[Description], 'FORMSOF(THESAURUS, light)' );

In this second case, the full-text engine will look for all the words in its thesaurus that match the word light, but will not look for any stemmers of light.

Phrases, NEAR, OR, and prefixed terms

We can pull a few other tricks out of our hat when performing searches. It’s possible to search for exact phrases by enclosing them inside double quotes. If we wanted to search for the phrase stiff ride, all we’d have to do is pass it into the CONTAINS predicate like this: CONTAINS(pd.[Description], '"stiff ride"' ). In this case, the full-text engine will only return results when it finds that exact phrase.

Let’s say your marketing folks are typical of those in most companies: they throw great parties, but aren’t very consistent when it comes to data entry. Sometimes they use stiff ride, but other times stiff, ride and perhaps even stiff stable ride. Yet we still want to find results when stiff and ride are in the same description, in close proximity. For those searches, the NEAR keyword was created. Phrase your CONTAINS clause like CONTAINS(pd.[Description], 'stiff NEAR ride' ) and it will return all of the results you’re looking for.

In order to use NEAR most effectively, it should be combined with the ranking features provided by CONTAINSTABLE and FREETEXTTABLE, discussed later in this chapter. Behind the scenes, NEAR returns all results where both words are found. It then assigns a rank to them, based on how far apart in the text the two words occur. The closer together, the higher the rank; the more distant, the lower. A rank of zero is given when the two words are more than 50 words apart. Therefore, you’ll need to combine NEAR with the ranking feature so that results can be sorted in a way most useful to your users.

You may also see times where a tilde (~) is substituted for the word near, as in CONTAINS(pd.[Description], 'stiff ~ ride' ). Although this syntax is acceptable, it’s not nearly as readable and thus not widely used. I highly encourage you to stick to the word NEAR for readability.

It’s also possible to pass an OR clause into the full-text engine: CONTAINS (pd.[Description], 'stiff OR ride' ). This will return all matches where either word—stiff or ride—is in the results.

The final way to search is using prefixed terms. This is the closest to a form of the traditional SQL LIKE syntax, but with some important differences. LIKE performs pattern matching, and will search for the characters. If you were to enter and pd.[Description] like '%light%' as part of a SQL statement, it would return a hit on semilightweight. Assuming you only wanted search results that began with the word light, this would be an undesirable result.

Prefixed term search, on the other hand, only looks for full words (not patterns) that begin with the word you’re searching for. To use a prefixed term, append an asterisk (*) to the end of the word, like so: CONTAINS(pd.[Description], '"light*"' ). If you look at the results of a query with light*, you’ll see the word lightweight returned. This word hasn’t been in any of our previous results because it was neither a stemmer nor a thesaurus match. Instead, the full-text engine went to the index and found all words that began with light and bypassed any stemming or thesaurus activity. It’s necessary to enclose the prefixed term inside double quotes; otherwise the search engine will only look for light and ignore the *. Additionally, even though they’re called prefixed terms, you can place the * at the front of the search word as well as the end.

Note that you may have seen light-weight in some results. This is because the hyphen (-) acted as a word breaker and the full-text engine considered light-weight to be two words—light and weight.

Ranking

When searching, especially with FREETEXT, it can be desirable to order your search results in terms of relevance—returning results that most closely match your search term or phrase first, then those that match the least last. To achieve this, SQL Server provides two more functions: CONTAINSTABLE and FREETEXTTABLE. These directly correspond to the way CONTAINS and FREETEXT searching work, but instead of being used in the WHERE clause, they return a table. Because they both act the same in terms of how to use them, we’ll use FREETEXTTABLE for our example:

SELECT [KEY], RANK
FROM FREETEXTTABLE([Production].[ProductDescription]
, [Description]
, 'light' );

The first thing to notice is that FREETEXTTABLE is used in the FROM clause. As stated, both of these return a table for you to work with. Into the function we pass three parameters. The first is the name of the table we’re free-text searching, in this case Production.ProductDescription. The second item is the column we’re searching, here Description. We could’ve also passed in an * to search through all full-text-indexed columns in our table. The final item is the word we’re looking for; here I used light.

Note our SELECT statement. Both FREETEXTTABLE and CONTAINSTABLE return two columns: KEY and RANK. KEY is the unique key for the row in the source table. RANK is an indicator of relevance.

RANK will be a number from 0 to 1000. Remember some important things when dealing with this number. First of all, it’s not sequential. There will be gaps in the numbers returned. Second, the numbers aren’t unique. You may have multiple rows returned that all have the same rank. Table 1 shows the results of the preceding query run against our AdventureWorks database.

Table 1. Simple ranking query results

Key

RANK

249

113

409

310

457

113

704

310

1090

113

1183

437

1185

310

1199

310

1206

310

As you can see, we have multiple rows with a rank of 113, as well as 310. I think you’d agree that this simple query isn’t overly useful. Let’s look at a slightly more complex query:

SELECT fts.[KEY], fts.[RANK], [Description]
FROM [Production].[ProductDescription] AS pd
INNER JOIN FREETEXTTABLE([Production].[ProductDescription]
, [Description]
, 'light' ) AS fts
ON fts.[KEY] = pd.ProductDescriptionID
ORDER BY fts.[RANK];

Table 2 shows the results from the query.

Table 2. Results for medium-complexity ranking query

Key

Rank

Description

249

113

Value-priced bike with many features of our top-of-the-line models. Has the same light, stiff frame, and the quick acceleration we’re famous for.

457

113

This bike is ridden by race winners. Developed with the AdventureWorks Cycles professional race team, it has an extremely light heat-treated aluminum frame, and steering that allows precision control.

1090

113

Our lightest and best quality aluminum frame made from the newest alloy; it is welded and heat-treated for strength. Our innovative design results in maximum comfort and performance.

1185

310

Aluminum cage is lighter than our mountain version; perfect for long-distance trips.

1199

310

Light-weight, wind-resistant, packs to fit into a pocket.

1206

310

Simple and light-weight. Emergency patches stored in handle.

704

310

A light yet stiff aluminum bar for long-distance riding.

409

310

Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads.

1183

437

Affordable light for safe night riding—uses three AAA batteries.

This gives us more information than the previous query, but, as per listing 1, let’s take it further and create a query you might use in a real-world production application.

Listing 1. Real-world example using FREETEXTTABLE
SELECT fts.[KEY], fts.[RANK], [Name]
, ProductNumber, [Description]
FROM [Production].[ProductDescription] pd
INNER JOIN FREETEXTTABLE([Production].[ProductDescription]
, [Description], 'light' ) as fts
ON fts.[KEY] = pd.ProductDescriptionID
INNER JOIN [Production].[ProductModelProductDescriptionCulture] AS pmpdc
ON pmpdc.ProductDescriptionID = pd.ProductDescriptionID
INNER JOIN [Production].[Product] AS p
ON p.ProductModelID = pmpdc.ProductModelID;

Here we combine data from multiple tables and order them in a manner most relevant to the user. Because this query returns 43 rows, I’ll leave it to your inquiring mind to key it in and see the results.

Whereas querying completes step three of our one-two-three, you’ll often want to customize the behavior of your searches, especially when using FREETEXT. To do that we’ll need to looking at customizing the thesaurus and stopwords.

Custom thesaurus and stopwords

Using FREETEXT or FormsOf-Thesaurus, it’s possible to search for words using the thesaurus to augment your search. A natural question is, can we add our own words to the thesaurus?

The answer is a definite yes. In this section, we’ll look at how to customize your thesaurus, and then cover the concept of stopwords.

Custom thesaurus

The first step is tracking down the name of your thesaurus. SQL Server stores the location in the registry. Open regedit or your favorite registry tool and navigate to HKEY_LOCAL_MACHINE > SOFTWARE > Microsoft > Microsoft SQL Server > [insert your instance name here] > MSSearch > Language > [insert your language abbreviation here].


Warning

Be careful not to accidentally change your registry entry. We only want to examine values here.


My instance name is MSSQL10MSSQLSERVER. For residents of the United States, the language abbreviation is ENU (short for English, US). (If you don’t know your abbreviation, don’t worry; shortly we’ll look at a simple query that will let you discover it.) Regardless of where you live, navigate to the branch with your appropriate language abbreviation. Looking at the name TsaurusFile, you’ll discover the name is tsenu.xml.

Now we need to go to the appropriate place on your hard drive. Assuming your instance is typical, this will be in C:Program FilesMicrosoft SQL Server[your instance name]MSSQLFTData. For me, this was

C:Program FilesMicrosoft SQL ServerMSSQL10.MSSQLSERVERMSSQLFTData

Now that we’ve located it, let’s navigate to the folder and open it up in your favorite text editor. (Be sure your text editor can handle Unicode files. Notepad works fine for this.)


Tip

If you’re running on Vista, Windows 7, or Windows Server, make sure you run the text editor in Administrator mode so you can save changes!


Take a look at listing 2.

Listing 2. Default thesaurus XML file
<XML ID="Microsoft Search Thesaurus">
<!-- Commented out
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>
</thesaurus>
-->
</XML>

The first thing to notice is the file is commented out. We’ll want to uncomment it as the first step. Next you’ll see the <thesaurus...> section. We won’t need to change it, nor the diacritics tag, which specifies accent sensitivity.

The next three sections are examples. Ultimately you’ll delete them and replace them with your own, but let’s take a moment to look at what’s there. The first section is an <expansion>...</expansion> tag. With an expansion set, all terms are equivalent. With the expansion tags, if the user enters any of those terms, it’s the same as if they’d entered all of the terms. Thus in the first example, if a user were to type in Internet Explorer, a full-text search would return all records that contained Internet Explorer, IE, or IE5.

Replacement sets, the next section, are something you’ll use less often. With replacements, SQL Server doesn’t look for the word in the <pat> (pattern) tag; instead it looks for the word in the <sub> tag. In this case, if a user types in W2K, the full-text search engine will instead look for Windows 2000. One true-life situation I can think of where this would be useful is addresses. If you know your system converts all street or state abbreviations to their full expanded name, then you could use this as a trap with full-text searching. For example:

<replacement>
<pat>St</pat>
<pat>Str</pat>
<sub>Street</sub>
</replacement>

Thus a user typing in Elm St would be able to find Elm Street in your system. You could also expand this with states. Let me reiterate: this example assumes your source system automatically replaces all abbreviations with their full words. Using a replacement prevents the full-text search engine from looking for words that won’t be there.

Replacements can also be useful in situations of error correction. For example, your database of addresses is used by your company’s offshore support center, where they frequently misspell Street as Stret. You could add <pat>Stret</pat> to the thesaurus and help the users.

Of the two, you’ll probably use expansions most of the time, but know that replacement sets exist and what they can be useful for. Finally, please note there are a handful of restrictions around the thesaurus file:

  • You must have system administrator rights to be able to edit the file.
  • You should make sure the editor understands Unicode format.
  • Entries can’t be empty.
  • Phrases placed in the thesaurus have a maximum of 512 characters.
  • You may not have any duplicate entries among the <sub> tags of expansion sets or the <pat> tags of replacement sets.

So let’s create a simple example to test out our custom thesaurus. Open up your tsenu.xml file, and change it to this:

<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>light</sub>
<sub>doodleysquat</sub>
</expansion>
</thesaurus>
</XML>

Here I’m going to make the word light and the word doodleysquat substitutable for each other. In case you’re wondering, we’re picking a nonsense word that we’re sure we won’t find in the AdventureWorks database. Save the changes.

Unfortunately, saving the changes isn’t enough to make the full-text search engine pick up the updates to our thesaurus file. We have to tell SQL Server the file has been updated. In SQL Server 2008, this is fairly simple. Make sure you’re in the right database for your catalog (in this case, AdventureWorks2008) and execute the stored procedure:

exec sys.sp_fulltext_load_thesaurus_file 1033;
go

The 1033 on the end refers to the local identifier (LCID) for the language of your thesaurus file. 1033 is for English, US. To discover your LCID, use this simple query:

select [name], lcid from sys.fulltext_languages order by [name];

Okay, we’re at the finish line. Assuming your full-text engine has restarted, it should’ve picked up your new, customized thesaurus file. Let’s go back to a query we used earlier, slightly altered:

SELECT [Name], ProductNumber, [Description]
FROM [Production].[Product] p
, [Production].[ProductDescription] pd
, [Production].[ProductModelProductDescriptionCulture] pmpdc
WHERE p.ProductModelID = pmpdc.ProductModelID
AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID
AND CONTAINS(pd.[Description], 'FORMSOF(Thesaurus, doodleysquat)' );

We should get the same results as if we’d used the word light instead of doodleysquat.

As you can see, adding custom entries to the SQL Server Full Text Search Thesaurus isn’t difficult; there are a few steps you need to follow in order to make it happen. Once you know how, you can use the functionality to make searches for your users more productive.

Stopwords and stoplists

Every language has some words that are used so often that indexing them for full-text searching would be useless—words such as a, an, and, the, or, and so forth. These words are known as noise words and will be ignored when you’re doing any full-text indexing. In some companies, certain words become noise words. For example, your company may have a rule that its company name must appear in all comment records in the form of a copyright notice. In that case, your company name would become a noise word because it appears so often as to be meaningless to search for.

SQL Server 2005 had support for custom noise words for its full-text search, but it was in the form of a simple text file that applied to the entire server. SQL Server 2008 has moved to a new concept, stopwords. This change is much deeper than just a rebranding. With the change comes a lot more flexibility and functionality.

SQL Server 2008 introduces two new tools: stopwords and stoplists. A stoplist acts as a named container for a group of stopwords. You can then associate a stoplist with one or more tables. This is a great enhancement over noise words, which applied to the entire server. Now you can associate a group of stopwords, in a stoplist, with specific tables without affecting the rest of the tables in the database or server.

Let’s run a query to demonstrate. We’ll use the same query we’ve used elsewhere in this chapter:

SELECT [Name], ProductNumber, [Description]
FROM [Production].[Product] p
, [Production].[ProductDescription] pd
, [Production].[ProductModelProductDescriptionCulture] pmpdc
WHERE p.ProductModelID = pmpdc.ProductModelID
AND pmpdc.ProductDescriptionID = pd.ProductDescriptionID
AND CONTAINS(pd.[Description], 'shifting'),

To create a stoplist, you can use the first of the new SQL Server 2008 commands, CREATE FULLTEXT STOPLIST:

CREATE FULLTEXT STOPLIST ArcanesStoplist;

The stoplist will act as a holder for a specific set of words that we want to ignore. We refer to that group of words by the name we gave it, ArcanesStoplist. Now we need to add some words to the list. Here are two ways to do so; both use the ALTER FULLTEXT STOPLIST statement:

ALTER FULLTEXT STOPLIST ArcanesStoplist
ADD 'shifting' LANGUAGE 1033;

ALTER FULLTEXT STOPLIST ArcanesStoplist
ADD 'light' LANGUAGE 'English';

The command is straightforward; use the ALTER FULLTEXT STOPLIST statement and provide the name of the list you want to add a word to. Then comes the command ADD, followed by the word you want to add. Next you have to specify the language. You can specify the language in two ways, either by using the language ID (in this example, 1033) or the name for the language.

If you were to jump the gun and rerun our query, you’d think it would now ignore the word shifting because we just added it as a stopword to our stoplist. But there’s still one more step.

You need to attach your stoplist to a table that has a full-text index on it. This is a major improvement over 2005, where stopwords were implemented as noise words, one simple text file that applied to the entire server. SQL Server 2008 now allows you to get granular with the application of custom groups of words. You’re limited to one stoplist per table, though. On the plus side, one stoplist can be applied to multiple tables. Here’s the code to associate our stoplist with a table:

ALTER FULLTEXT INDEX ON [Production].[ProductDescription]
SET STOPLIST ArcanesStoplist;

All we need to do is specify the table name and the stoplist to associate with that table. Now go run our test query. You should get back zero rows.

Congratulations! You’ve now associated your stoplist with the full-text index.

I’m sure you don’t want to leave it this way, so let’s look at what it will take to clean up the mess. First, you can decide you no longer want the stoplist associated with the full-text index. Time to use the ALTER command again:

ALTER FULLTEXT INDEX ON [Production].[ProductDescription]
SET STOPLIST system;

Setting the stoplist to the keyword system will change from your custom stoplist to the system stoplist. You can also use the word OFF instead of SYSTEM to turn off stopwords altogether for the specified table. If you want to use the standard system set of stopwords instead of using a custom set, use the SYSTEM keyword, as in the previous example. You can also use an INSERT INTO statement to copy the system stopwords into your custom stoplist, and then add or remove words as needed.

There may be times when you want to remove only a word or two from a stoplist, but not disassociate the entire list. It’s possible to easily remove individual words from the list:

ALTER FULLTEXT STOPLIST ArcanesStoplist
DROP 'shifting' LANGUAGE 1033;
ALTER FULLTEXT STOPLIST ArcanesStoplist
DROP 'light' LANGUAGE 'English';

The syntax for the DROP is identical to ADD, except for using the word DROP instead of ADD.

Finally, you may want to drop the stoplist altogether. There’s a DROP statement for that as well:

DROP FULLTEXT STOPLIST ArcanesStoplist;

This covers the basic use of stopwords and stoplists in 2008. Let’s take a moment now to look at some advanced queries that will help you manage your stopwords and stoplists.

Useful Queries Pertaining to Stopwords and Stoplists

Our first query returns a list of all the user-defined stoplists in our database:

SELECT stoplist_id, name
FROM sys.fulltext_stoplists;

Our next query returns a list of stopwords for our user-defined stoplists in the database. Note the linking to get the associated stoplist name and language:

SELECT sl.name as StoplistName
, sw.stopword as StopWord, lg.alias as LanguageAlias
, lg.name as LanguageName, lg.lcid as LanguageLCID
FROM sys.fulltext_stopwords sw
JOIN sys.fulltext_stoplists sl
ON sl.stoplist_id = sw.stoplist_id
JOIN master.sys.syslanguages lg
ON lg.lcid = sw.language_id;

This next query gets a list of all of the stopwords that ship with SQL Server 2008:

SELECT ssw.stopword, slg.name
FROM sys.fulltext_system_stopwords ssw
JOIN sys.fulltext_languages slg
ON slg.lcid = ssw.language_id;

Used together, these queries can help you discover the state of your stopwords and stoplists.

Useful system queries

So far we’ve learned how to create catalogs and full-text indexes, along with a variety of ways to get information back out of our indexes. All are great tools for the database developer. If you’re a database administrator, though, there are few more tricks you can use to administer and take care of your full-text indexes.

Basic queries to discover what catalogs, indexes, and columns exist

Let’s start with a few basic queries. First, let’s retrieve a list of the catalogs that are associated with the current database.

SELECT fulltext_catalog_id, name, is_default
FROM sys.fulltext_catalogs;

This may return one or more rows, depending on how many catalogs you have. If you created the AdventureWorksFTC catalog, it’ll appear in the name column. The fulltext_catalog_id is a numbered auto incrementing primary key without any meaning. The is_default column will contain a 1 if this catalog is the default; otherwise the value will be 0.

This is useful, so let’s dig further. Let’s get a list of all of the indexes in our database. To do so, we’ll delve into the sys.fulltext_indexes view, adding it to the information provided by the sys.fulltext_catalogs view we used in the previous example:

-- List names of all FTS indexes
SELECT cat.[name] as CatalogName
, object_name(object_id) as table_name
, is_enabled
, change_tracking_state_desc
FROM sys.fulltext_indexes, sys.fulltext_catalogs cat
ORDER BY cat.[name], table_name;

The results of this query are shown in table 3.

Table 3. Query results to list all full-text indexes

CatalogName

table_name

is_enabled

change_tracking_state_desc

AdventureWorksFTC

Document

1

AUTO

AdventureWorksFTC

JobCandidate

1

AUTO

AdventureWorksFTC

ProductDescription

1

AUTO

AdventureWorksFTC

ProductReview

1

AUTO

Our query returns four rows, including the row for the Document table we created earlier in the chapter. Note that the query shows that all of these indexes are enabled, and it also displays the change-tracking mode being used.

We can extend the prior query one step further, adding in information about the unique index used for full-text searching and the associated stoplist for a given table, as shown in listing 3.

Listing 3. Full information about tables and full-text searching
SELECT c.name as CatalogName
, t.name as TableName
, idx.name as UniqueIndexName
, case i.is_enabled
when 1 then 'Enabled'
else 'Not Enabled'
end as IsEnabled
, i.change_tracking_state_desc
, sl.name as StoplistName
FROM sys.fulltext_indexes i
JOIN sys.fulltext_catalogs c
ON i.fulltext_catalog_id = c.fulltext_catalog_id
JOIN sys.tables t
ON i.object_id = t.object_id
JOIN sys.indexes idx
ON i.unique_index_id = idx.index_id
AND i.object_id = idx.object_id
LEFT JOIN sys.fulltext_stoplists sl
ON sl.stoplist_id = i.stoplist_id;

So far we’ve looked at catalogs and tables. The next logical question most would ask is “Which columns are full-text indexed?” I’m glad you asked! The query shown in listing 4 will tell us. The results of the query are shown in table 4.

Table 4. Query results to list all columns that are full-text indexed

TableName

ColumnName

IsFullTextIndexed

Document

DocumentSummary

True

Document

Document

True

JobCandidate

Resume

True

ProductDescription

Description

True

ProductReview

Comments

True

Listing 4. List all columns that are full-text indexed
SELECT t.[Name] as TableName
, c.[Name] as ColumnName
, (case ColumnProperty ( (t.[object_id], c.[Name], 'IsFulltextIndexed')
when 1 then 'True'
when 0 then 'False'
else 'Invalid Input'
end) as IsFullTextIndexed
FROM sys.tables t
JOIN sys.all_columns c ON t.[object_id] = c.[object_id]
WHERE ColumnProperty(t.[object_id], c.[Name], 'IsFulltextIndexed') = 1
ORDER BY t.[Name], c.column_id;

Note that our ProductDescription.Description column appears in the list. If you want to check the status of all columns, and not only the ones that are full-text indexed, omit the WHERE clause from the query.

Advanced queries

It would be nice to return more information about our catalogs, such as their size, how many items are in them, and so forth. SQL Server provides a useful function called FullTextCatalogProperty. You provide it two parameters: the name of the catalog and the name of the property you wish to examine. A handful of the properties you can check this way turn out to be useful, so let’s look at a query that returns these as a single result set. Listing 5 shows the query, and table 5 shows the results.

Table 5. Query results for catalog information query

Catalog name

Index size (MB)

Item count

Unique key count

Populate status

Is default catalog

AdventureWorksFTC

0

762

3194

Idle

Yes

AW2008FullTextCatalog

0

762

3194

Idle

No

Listing 5. Using FullTextCatalogProperty to get information
SELECT [name] as CatalogName
, FullTextCatalogProperty('AdventureWorksFTC', 'IndexSize')
AS IndexSizeMB
, FullTextCatalogProperty('AdventureWorksFTC', 'ItemCount')
AS ItemCount
, FullTextCatalogProperty('AdventureWorksFTC', 'UniqueKeyCount')
AS UniqueKeyCount
, CASE FullTextCatalogProperty('AdventureWorksFTC', 'PopulateStatus')
WHEN 0 THEN 'Idle'
WHEN 1 THEN 'Full population in progress'
WHEN 2 THEN 'Paused'
WHEN 3 THEN 'Throttled'
WHEN 4 THEN 'Recovering'
WHEN 5 THEN 'Shutdown'
WHEN 6 THEN 'Incremental population in progress'
WHEN 7 THEN 'Building index'
WHEN 8 THEN 'Disk is full. Paused.'
WHEN 9 THEN 'Change tracking'
ELSE 'Error reading FullTextCatalogProperty PopulateStatus'
END AS PopulateStatus
, CASE is_default
WHEN 1 then 'Yes'
ELSE 'No'
END AS IsDefaultCatalog
FROM sys.fulltext_catalogs
ORDER BY [name];

After the catalog name, you’ll see the size of the index, in megabytes. Don’t be alarmed that what you see in the table is 0; this means the catalog is less than 1 megabyte in size. This is due to the fact that we’re running our query against the small sample database. The next column is the Item Count. This indicates the number of rows from the source system that were indexed.

Unique Key Count, on the other hand, is an indication of the number of entries in the catalog—in other words, how many unique word/primary key combinations are in the catalog.

Populate Status returns a number from 0 to 9, indicating what activity the full-text catalog is performing. Because we don’t want to have to remember what each number means, the query uses a simple case statement to let us know in language we can understand.

Finally, the Is Default Catalog column tells us whether the catalog is the default for this database. Normally it returns 1 (for default) or 0 (for not the default), but because the case statement was useful for populate status, we might as well use it here as well.

At the beginning of the chapter, we talked about the varbinary(max) data type. As you’ll recall, varbinary(max) is used to store a document whose type SQL Server understands; the full-text engine will open the document and index the contents. It’s possible to retrieve a list of all document types understood by SQL Server, using the following query:

SELECT document_type, path, [version], manufacturer
FROM sys.fulltext_document_types;

This returns four columns. Because SQL Server 2008 understands 50 types right out of the box, I won’t list them all here. The main column is the document_type; in it you’ll see extensions such as .txt, .doc, .xls, .aspx, .cmd, .cpp, and many more. If the document saved in a varbinary(max) column has one of the extensions from the document_type column, SQL Server will understand it.

SQL Server uses the library located in the path column to do the work of opening and reading the document. The last two columns, version and manufacturer, are informational and let us know who made the library and what version it is.

If your full-text performance begins to suffer over time, you might want to check and see how many fragments exist. Internally, SQL Server stores index data in special tables it calls full-text index fragments. A table can have one or more fragments associated with it, but if that number gets too high, it can degrade performance.

The query shown in listing 6 will tell you how many fragments exist for your full-text index.

Listing 6. Determining the number of fragments for your full-text indexes
-- See how many fragments exist for each full text index.
-- If multiple closed fragments exist for a table do a REORGANIZE to help performance
SELECT t.name as TableName
, f.data_size
, f.row_count
, case f.status
WHEN 0 THEN 'Newly created and not yet used'
WHEN 1 THEN 'Being used for insert'
WHEN 4 THEN 'Closed ready for query'
WHEN 6 THEN 'Being used for merge input and ready for query'
WHEN 8 THEN 'Marked for deletion. Will not be used for query and merge source'
ELSE 'Unknown status code'
END
FROM sys.fulltext_index_fragments f
JOIN sys.tables t on f.table_id = t.object_id;

When this query returns, look for rows whose type is 4, or Closed ready for query. A table will be listed once for each fragment it has. If it turns out that you have a high number of closed fragments, you should consider doing a REORGANIZE on the index (using the ALTER FULLTEXT INDEX statement). Note two things: first you must do a reorganize, as opposed to a rebuild. Second, the exact number of fragments that will cause you issues is somewhat dependant on your hardware. But as a rule of thumb, if it exceeds 50, start planning a reorganize, and if it’s over 100, start planning in a hurry.

The keywords

We’ll close this chapter out by answering one of the most-often-asked questions: how can I find out what words are contained in my full-text index? New with SQL Server 2008 are a pair of dynamic management functions (DMFs) that can help us answer that very question.

The first is sys.dm_fts_index_keywords. To use this function, pass in the database ID and object ID for the table you want to discover the keywords for. It returns a table with many columns; this query shows you the more useful ones. Note that it also references the sys.columns view in order to get the column name:

SELECT keyword, display_term, c.name, document_count
FROM sys.dm_fts_index_keywords(db_id()
, object_id ('Production.ProductDescription')) fik
JOIN sys.columns c on
c.object_id = object_id('Production.ProductDescription')
AND c.column_id = fik.column_id;

The db_id() function allows us to easily retrieve the database ID. We then use the object_id function to get the ID for the table name, passing in the text-based table name. Table 6 shows a sampling of the results.

Table 6. Sample of results for query to find keywords

Keyword

Display term

Column

Document count

0x006C0069006700680074

light

Description

7

0x006C006900670068007400650072

lighter

Description

1

0x006C0069006700680074006500730074

lightest

Description

1

0x006C0069006700680074007700650069006700680074

lightweight

Description

11

0x006C00690067006E0065

ligne

Description

3

0x006C0069006E0065

line

Description

5

0x006C0069006E006B

link

Description

2

The Keyword column contains the Unicode version of the keyword in hexadecimal format, and is used as a way to link the Display Term—the real indexed word—to other views. The Column column is obvious; the Document Count indicates how many times this keyword appears in the table.

One oddity about this particular DMF is that it doesn’t appear in the Object Explorer—at least, the version used in the writing of this chapter doesn’t. But not to worry: the view still works, and it’s found in the Books Online documentation.

To add to the oddities, there’s a second dynamic management function, one that doesn’t display in the Object Explorer. It’s sys.dm_fts_index_keywords_by_document, and can also return valuable information about your keywords. Here’s a query that will tell us not only what the keywords are, but what rows they are located on in the source table:

SELECT keyword, display_term, c.name
, document_id , occurrence_count
FROM sys.dm_fts_index_keywords_by_document(db_id()
, object_id('Production.ProductDescription'))
JOIN sys.columns c on c.object_id =
object_id('Production.ProductDescription')
ORDER BY display_term;

Like its sister DMF, you pass in the database ID and the object ID for the table. Table 7 shows a sampling of the data returned.

Table 7. Sample of results for query to find keywords and their source row

Keyword

Display term

Column name

Document ID

Occurrence count

0x006C0069006700680074

light

Description

249

1

0x006C0069006700680074

light

Description

409

1

0x006C0069006700680074

light

Description

457

1

0x006C0069006700680074

light

Description

704

1

0x006C0069006700680074

light

Description

1183

1

0x006C0069006700680074

light

Description

1199

1

0x006C0069006700680074

light

Description

1206

1

Keyword and Display Term are the same as the previous view, as is the Column Name. The Document ID is the unique key from the source table, and the Occurrence Count is how many times the word appears in the row referenced by the document ID.

Using this information, we can construct a query that combines data from the source table with this view. This will create a valuable tool for debugging indexes as we try to determine why a particular word appears in a result set:

SELECT  d.keyword, d.display_term
, d.document_id --primary key
, d.occurrence_count, p.Description
FROM sys.dm_fts_index_keywords_by_document(db_id()
, object_id('Production.ProductDescription')) d
JOIN Production.ProductDescription p
ON p.ProductDescriptionID = d.document_id
ORDER BY d.display_term;

As you can see from the results shown in table 8, we can pull the description for the row with the keyword we want.

Table 8. Partial results of expanded query combining keywords with source data

Keyword

Display term

Document ID

Occurrence count

Description

0x006C0069006700680074

light

249

1

Value-priced bike with many features of our top-of-the-line models. Has the same light, stiff frame, and the quick acceleration we’re famous for.

0x006C0069006700680074

light

409

1

Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads.

0x006C0069006700680074

light

457

1

This bike is ridden by race winners. Developed with the AdventureWorks Cycles professional race team, it has a extremely light heat-treated aluminum frame, and steering that allows precision control.

0x006C0069006700680074

light

704

1

A light yet stiff aluminum bar for long-distance riding.

0x006C0069006700680074

light

1183

1

Affordable light for safe night riding; uses 3 AAA batteries.

0x006C0069006700680074

light

1199

1

Light-weight, wind-resistant, packs to fit into a pocket.

0x006C0069006700680074

light

1206

1

Simple and light-weight. Emergency patches stored in handle.

Summary

This concludes our look at full-text searching with SQL Server 2008. We began by creating a catalog to hold our indexes, then proceeded to step two, creating the indexes themselves. Our third step was querying the full-text indexes in a variety of ways. Finally, we looked at some queries that will help us maintain and discover the state of our full-text indexes.

Hopefully you’ll find that using full-text searching can be as easy as one-two-three!

About the author

Robert C. Cain is a Microsoft MVP in SQL development, and is a consultant with COMFRAME as a senior business intelligence architect. Prior to his current position, Robert worked for a regional power company, managing, designing, and implementing the SQL Server data warehouse for the nuclear division. He also spent 10 years as a senior consultant, working for a variety of customers in the Birmingham, Alabama, area using Visual Basic and C#. He maintains the popular blog http://arcanecode.com.

In his spare time, Robert enjoys spending time with his wife and two daughters, digital photography, and amateur radio, holding the highest amateur license available and operating under the call sign N4IXT.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset