CHAPTER 10

image

Full-Text Search

Full-text search (FTS) is a powerful SQL Server feature allowing for advanced searches using multiple languages to find information in documents as well as document properties. FTS is tightly integrated with SQL Server 2012 and can be easily managed with SQL Server Management Studio (SSMS) and monitored with standard dynamic management views. FTS broadens the scope of what is thought of as a T-SQL search by providing meaningful results from sometimes seemingly unstructured textual data. SQL Server 2012 also introduces statistical semantics which allow for searching on document meaning as opposed to simply searching content. Based on word distributions and other factors, statistical semantics allows you to find documents with similar contents.

FTS Architecture

As mentioned earlier, the FTS architecture is tightly integrated with the SQL Server database engine. In fact, FTS consists of two main components: the sqlserver process (sqlserver.exe) and the filter daemon host (fdhost.exe). The filter daemon is responsible for retrieving the text data from the tables and applying word breaks as well as determining the type of text is being retrieved. The filter daemon host applies different rules based on whether the document is a Word document, an Excel file, or even XML. Information is passed between the SQL Server process and the filter daemon host. Because the fdhost process has the responsibility to directly access and filter the data, the process requires a separate security account. This keeps the entire FTS process much more secure than in previous implementations.

The SQL Server process is primarily responsible for maintaining full-text indexes, controlling query optimization, and maintaining the stoplist and theasaures objects. A stoplist is a list of non-essentials words which should be ignored in most linguistic searches. A thesaurus is something you fill out in order to extend the reach of searches to find matches that FTS may not have been able to suggest on its own. Figure 10-1 shows how these architectural components are put to together.

9781430245964_Fig10-01.jpg

Figure 10-1.  FTS architecture (simplified)

Here is a quick summary of some of the beneficial features of FTS:

  • The full-text engine is hosted in the SQL Server process, eliminating much of the overhead associated with interservice communications.
  • Integration with the SQL Server process to better predict query performance through the use of new query operators.
  • Full-text indexes are maintained by the SQL Server process for better optimization.
  • Ability to create customized stoplists of words to ignore during FTS, and the ability to create a thesaurus for more efficient and accurate searching.
  • Dynamic management views and functions that provide greater transparency in understanding how FTS queries are processed and executed.

Creating Full-Text Catalogs and Indexes

The first step to take advantage of SQL Server FTS is to create full-text catalogs and full-text indexes. A full-text catalog can contain one or more full-text indexes, and each full-text index can only be assigned to one full-text catalog. You can create full-text catalogs and full-text indexes in SSMS using GUI (graphical user interface) wizards or T-SQL statements.

Creating Full-Text Catalogs

You can access the GUI full-text catalog wizard by right-clicking Full Text Catalogs in the SSMS Object Explorer. The New Full-Text Catalog option on the pop-up context menu starts the wizard (see Figure 10-2).

9781430245964_Fig10-02.jpg

Figure 10-2.  New Full-Text Catalog Context Menu Option

After selecting New Full-Text Catalog, SSMS presents the wizard’s New Full-Text Catalog window. This window allows you to define the name of your full-text catalog, the full-text catalog’s owner, an accent sensitivity setting, and whether or not this full-text catalog is designated as the default for a database. The New Full-Text Catalog window is shown in Figure 10-3.

9781430245964_Fig10-03.jpg

Figure 10-3.  New Full-Text Catalog Window

For this sample full-text catalog, we chose the following options:

  • The full-text catalog is named AdventureWorksFTCat, and dbo is designated as the owner.
  • The first created full-text catalog is designated the default full-text catalog for the database. When a new full-text index is created you will have a choice to create it in the default catalog or in any additional non-default catalogs.
  • The accent sensitivity is set to Insensitive, meaning that words with accent marks are treated as equivalent to those without accent marks (e.g., for search purposes, resumé is the same as resume).

You can also create and manage full-text catalogs using T-SQL statements. Listing 10-1 shows how to create the same full-text catalog that we created previously in this section with the SSMS wizard.

Listing 10-1. Creating a Full-Text Catalog with T-SQL

CREATE FULLTEXT CATALOG AdventureWorksFTCat
  WITH ACCENT_SENSITIVITY = OFF
  AS DEFAULT
  AUTHORIZATION dbo;

Once you’ve created your full-text catalog, the next step is to build full-text indexes. We describe full-text index creation in the next section. Maximum performance full-text catalogs, particularly those you anticipate will become very large, should be created on filegroups that are located on their own physical drives. This is also useful for administrative functions such as performing filegroup backups and restores independent of data and log files.

Creating Full-Text Indexes

As with full-text catalogs, you have two options for creating full-text indexes—you can use the GUI wizard in SSMS, or you can use T-SQL statements. Once you’ve created a full-text catalog, as described in the previous section, it’s time to define your full-text indexes. Begin by right-clicking a table; the example in Figure 10-4 uses the Production.ProductModel table, in the SSMS Object Explorer to pull up the table context menu. From the context menu, choose the Full-Text Index image Define Full-Text Index option, shown in Figure 10-4.

9781430245964_Fig10-04.jpg

Figure 10-4.  “Full-Text Index” Context Menu

The full-text index wizard shows a splash screen the first time you access it. You can choose to turn off the splash screen or just ignore it. On the next screen, shown in Figure 10-5, the wizard allows you to select a single-column unique index on the table. Every full-text index requires a single-column unique index that allows the full-text index to reference individual rows in the table. If you don’t have a single-column unique index defined on the table you’re trying to create a full-text index on, the wizard will display an error message as soon as you try to run it. In this example, we’ve chosen to use the table’s integer primary key for the full-text index.

image Tip  It’s recommended that you specify a single-column unique index defined on an integer column when creating a full-text index. This will help maximize performance and minimize full-text index storage requirements.

9781430245964_Fig10-05.jpg

Figure 10-5.  Selecting a Single-column Unique Index

After you select a unique index, you’ll choose the columns that will provide the searchable content for the full-text index. You can specify char, nchar, varchar, nvarchar, xml, varbinary, varbinary(max), and image columns in this step. In Figure 10-6, the nvarchar and xml data type columns of the table are selected to participate in the full-text index. We’ve also selected English as the word-breaker language for each of these columns. The word-breaker language specification determines the language used for word-breaking and stemming. SQL Server 2012 currently recognizes over 50 different languages.

9781430245964_Fig10-06.jpg

Figure 10-6.  Selecting Columns to Participate in Full-text Searches

image Note  The type column is the name of a column indicating the document type (e.g., Microsoft Word, Excel, PowerPoint, Adobe PDF, and others) when you full-text index documents stored in varbinary(max) or image columns. Be aware that some document types require installation and configuration of additional IFilter components. More information about full-text and the new filetable feature is available on Microsoft TechNet at http://social.technet.microsoft.com/wiki/contents/articles/9809.store-and-index-documents-in-sql-server-2012-an-end-to-end-walkthrough.aspx.

After you’ve selected the columns that will participate in full-text searches against a table, you must select the change-tracking option. Change tracking determines whether SQL Server maintains a change log for the full-text indexed columns, and how the log is used to update the full-text index. Figure 10-7 shows the change-tracking options available through the wizard.

9781430245964_Fig10-07.jpg

Figure 10-7.  Selecting a Change-tracking Option

The change-tracking options available through the wizard include the following:

  • Automatically: SQL Server updates the full-text index automatically when data is modified in the columns that participate in the full-text index. This is the default option.
  • Manually: The change-tracking log is either used to update the full-text index via SQL Agent on a scheduled basis, or through manual intervention. This option is useful when automatic full-text index updates could slow down your server during business hours.
  • Do not track changes: SQL Server does not track changes. Updating the full-text index requires you to issue an ALTER FULLTEXT INDEX statement with the START FULL or INCREMENTAL POPULATION clause to populate the entire full-text index.

image Tip  Keep in mind that automatic updates to the full-text index are not necessarily immediate updates. When automatic change tracking is specified, there may be some lag time between changes in the table data and updates to the full-text index.

The next step in the wizard allows you to assign your full-text index to a full-text catalog. You can choose a preexisting full-text catalog, like the AdventureWorksFTCat shown in Figure 10-8, or you can create a new full-text catalog. You can also choose a filegroup and full-text stoplist for the full-text index in this step.

9781430245964_Fig10-08.jpg

Figure 10-8.  Assigning a Full-text Index to a Catalog

The final steps of the wizard allow you to create a full-text index population schedule and review your previous wizard selections. Since automatic population is used in the example, no schedule is necessary.

image Note  It is possible you may receive an error on the population schedule screen when using SQL Server 2012 Express Advanced Services.  This might be due to a bug in the application. You can ignore the error and continue. Express Advanced Services does support population schedules so you can avoid the error by manually creating the schedule and bypassing the GUI. It also may be possible to create the schedule through the GUI later by selecing the index properties. For more, go to http://connect.microsoft.com/SQLServer/feedback/details/740181/management-studio-does-not-fully-manage-full-text-in-sql-server-express.

In the review window of the wizard, shown in Figure 10-9, you can look at the choices you’ve made in each step of the wizard and go back to previous steps to make changes if necessary. Once you click the Finish button, the full-text index is created in your database.

9781430245964_Fig10-09.jpg

Figure 10-9.  Review Wizard Selections

The SSMS full-text index wizard is very thorough, but you can also create and manage full-text indexes using T-SQL statements. Listing 10-2 shows the T-SQL statements required to create and enable a full-text index with the same options previously selected in the SSMS wizard example.

Listing 10-2. Creating a Full-Text Index with T-SQL Statements

CREATE FULLTEXT INDEX
ON Production.ProductModel
(
  CatalogDescription LANGUAGE English,
  Instructions LANGUAGE English,
  Name LANGUAGE English
  )
KEY INDEX PK_ProductModel_ProductModelID
ON
(
  AdventureWorksFTCat
)
WITH
(
CHANGE_TRACKING AUTO
);
GO
ALTER FULLTEXT INDEX
ON Production.ProductModel ENABLE;
GO

The CREATE FULLTEXT INDEX statement builds the full-text index on the Production.ProductModel table with the specified options. In this example, the CatalogDescription, Instructions, and Name columns are all participating in the full-text index. The LANGUAGE clause specifies that the English language word breaker will be used to index the columns. A word breaker is a naturally occurring break between words based on a language’s lexicon. Setting the word breaker language to English helps FTS understand how the sentences are structured in order to better search on individual words. The KEY INDEX clause specifies the primary key of the table, PK_ProductModel_ProductModelID, as the single-column unique index for the table. Finally, the CHANGE TRACKING AUTO option turns on automatic change tracking for the full-text index.

The ALTER FULLTEXT INDEX statement in the listing enables the full-text index and starts a full population. ALTER FULLTEXT INDEX is a flexible statement that can be used to add columns to, or remove columns from, a full-text index. You can also use it to enable or disable a full-text index, set the change-tracking options, start or stop a full-text index population, or change full-text index stoplist settings.

image Note  Stoplists are lists of words that are considered unimportant for purposes of FTS. These words are known as stopwords. Stopwords are language dependent, with the English system stoplist containing words like a, an, and, and the (and many others). SQL Server 2012 provides a system stoplist and allows you to create your own custom stoplists. We will discuss stoplists later in this chapter.

Full-Text Querying

After you create a full-text catalog and a full-text index, you can take advantage of FTS with SQL Server’s FTS predicates and functions. SQL Server provides four ways to query a full-text index. The FREETEXT and CONTAINS predicates retrieve rows from a table that match a given FTS criteria, in much the same way that the EXISTS predicate returns rows that meet given criteria. The FREETEXTTABLE and CONTAINSTABLE functions return rowsets with two columns: a key column, which is a row identifier (the unique index value specified when the full-text index was created) and a rank column, which is a relevance rating.

The FREETEXT Predicate

The FREETEXT predicate offers the simplest method of using FTS to search character-based columns of a full-text index. FREETEXT searches for words that match inflectional forms and thesaurus expansions and replacements. The FREETEXT predicate accepts a column name or list of columns, a free-text search string, and an optional language identifier (a locale ID, or LCID). Because it is a predicate, FREETEXT can be used in the WHERE clause of a SELECT query or DML statement. All rows for which the FREETEXT predicate returns true (a match) are returned. Listing 10-3 shows a simple FREETEXT query that uses the full-text index created on the Production.ProductModel table in the previous section. The results are shown in Figure 10-10. The wildcard character (*) passed as a parameter to the FREETEXT predicate indicates that all columns participating in the full-text index should be searched for a match. The second FREETEXT parameter is the word you want to match.

Listing 10-3. Simple FREETEXT Full-Text Query

SELECT
   ProductModelID,
   Name,
   CatalogDescription,
   Instructions
FROM Production.ProductModel
WHERE FREETEXT(*, N'sock'),

9781430245964_Fig10-10.jpg

Figure 10-10.  Using FREETEXT to Find Socks

The FREETEXT predicate automatically stems words to find inflectional forms. The query in Listing 10-3 returns rows that contain an inflectional form of the word sock—in this case, FTS finds two rows that contain the plural form of the word, socks. Notice that if you were to replace the word “socks” with “sox” you receive the same result set. This is because FREETEXT also performs FTS thesaurus expansions and replacements automatically, if a thesaurus file is available.

9781430245964_Fig10-11.jpg

Figure 10-11.  FREETEXT Query Execution Plan

The integration of FTS with the SQL Server query engine results in a more efficient FTS experience. In SQL Server 2012, FTS can take advantage of optimized operators like the Table Valued Function [FulltextMatch] operator shown in Figure 10-11. The query plan shown is generated by the query in Listing 10-3.

FTS PERFORMANCE OPTIMIZATION

In previous releases of SQL Server, the FTS functionality was provided via an independent service known as MSFTESQL (Microsoft Full-Text Engine for SQL Server). Because it was completely separate from the SQL Server query engine, the MSFTESQL service could not take advantage of T-SQL operators to optimize performance. As an example, consider the following variation on the query in Listing 10-3:

 

SELECT
  ProductModelID,
  Name,
  CatalogDescription,
  Instructions
FROM Production.ProductModel
WHERE FREETEXT(*, N'sock')
AND ProductModelID < 100;

Imagine for a moment that the Production.ProductModel table has 1,000,000 rows that match the FREETEXT predicate. Versions of SQL Server prior to SQL Server 2008 were incapable of using the additional T-SQL ProductModelID < 100 predicate in the WHERE clause to limit the rows accessed by the FTS service. The MSFTESQL service had to return all 1,000,000 rows from the FREETEXT predicate and then narrow them down. Beginning with SQL 2008 and continuing in SQL Server 2012, the FTS engine can work in tandem with the SQL Server query engine to optimize the query plan and limit the number of rows touched by the FREETEXT predicate.

image Tip  You’ll see heavy use of the phrase inflectional forms throughout this section. Inflectional forms of words include verb conjugations like go, goes, going, gone, and went. Inflectional forms also include plural and singular noun variants of words, like bike and bikes. Searching for any word with FREETEXT automatically results in matches of all supported inflectional forms.

Listing 10-4 demonstrates a FREETEXT query that retrieves all rows that contain inflectional forms of the word ride in the CatalogDescription column. Another word for this process is called stemming. Inflectional forms that are matched in this query include the plural noun riders and the verb riding. In this FREETEXT query, the CatalogDescription column name is identified by name to restrict the search to a single column, and the LANGUAGE specifier is used to indicate LCID 1033, which is US English. The results are shown in Figure 10-12.

Listing 10-4FREETEXT Query with Automatic Word Stemming

SELECT
  ProductModelID,
  Name,
  CatalogDescription,
  Instructions
FROM Production.ProductModel
WHERE FREETEXT(CatalogDescription, N'weld', LANGUAGE 1033);

9781430245964_Fig10-12.jpg

Figure 10-12.  Automatic Stemming with FREETEXT

You can’t see the words that matched in the xml type CatalogDescription (there’s not enough space on the page to reproduce the entire result). Rest assured that FREETEXT has located valid matches in the row. For the first match the XML has the text “The heat treated welded aluminum,” while the second match has the text “it is welded and heat treated.”

The CONTAINS Predicate

In addition to the FREETEXT predicate, SQL Server 2012 supports the CONTAINS predicate. CONTAINS allows more advanced full-text query options than the FREETEXT predicate. Just like FREETEXT, the CONTAINS predicate accepts a column name or list of columns, a search condition, and an optional language identifier as parameters. The CONTAINS predicate can search for simple strings like FREETEXT, but it also allows sophisticated search conditions that include word or phrase prefixes, words that are in close proximity to other words, inflectional word forms, thesaurus synonyms, and combinations of search criteria.

The simplest CONTAINS predicates are basic word searches, similar to FREETEXT. Unlike FREETEXT, however, the CONTAINS predicate does not automatically search for inflectional forms of words or thesaurus expansions and replacements. Listing 10-5 modifies Listing 10-4 to demonstrate a simple CONTAINS query. The results are shown in Figure 10-13. As you can see, a couple of rows that do not contain an exact match for the word weld are eliminated from the results.

Listing 10-5. Simple CONTAINS Query

SELECT
  ProductModelID ,
  Name,
  CatalogDescription,
  Instructions
FROM Production.ProductModel
WHERE CONTAINS (*, N’weld’);

9781430245964_Fig10-13.jpg

Figure 10-13.  Results of the Simple CONTAINS Query

To use inflectional forms or thesaurus expansions and replacements with CONTAINS, use the FORMSOF generation term in your search condition. Listing 10-6 performs a CONTAINS search on the Name and CatalogDescription columns of the Production.ProductModel table. The results, which include matches for inflectional forms of the word sport, like sports and sporting, are shown in Figure 10-14.

Listing 10-6. Sample CONTAINS Query with FORMSOF Inflectional Generation Term

SELECT
  ProductModelID ,
  Name,
  CatalogDescription
FROM Production.ProductModel
WHERE CONTAINS
(
  (
    Name,
    CatalogDescription
  ),
  N'FORMSOF(INFLECTIONAL, sport)'
);

9781430245964_Fig10-14.jpg

Figure 10-14.  Results of the CONTAINS Query with Inflectional FORMSOF Term

The CONTAINS predicate also allows you to combine simple search terms like these with the AND (&), AND NOT (&!), and OR (|) Boolean operators. Listing 10-7 demonstrates combining two search terms in a CONTAINS predicate. The results of this sample query, which retrieves all rows containing inflectional forms of the word sport (like sports) or the word tube in the Name or CatalogDescription columns, are shown in Figure 10-15.

Listing 10-7. Compound CONTAINS Search Term

SELECT
  ProductModelID ,
  Name,
  CatalogDescription
FROM Production.ProductModel
WHERE CONTAINS
(
  (
    Name,
    CatalogDescription
),
N'"tube" | FORMSOF (INFLECTIONAL, sport)'
);

9781430245964_Fig10-15.jpg

Figure 10-15.  Results of the CONTAINS Query with a Compound Search Term

Listing 10-7 uses FORMSOF to return matches for inflectional forms. You can also use the FORMSOF (THESAURUS,...) format to return matches for expansions and replacements of words, as defined in your language-specific thesaurus files.

CONTAINS also supports prefix searches using the wildcard asterisk (*) character. Place the search word or phrase, immediately followed by the wildcard character, in double quotes to specify a prefix search. Listing 10-8 demonstrates a simple prefix search to retrieve all rows that have a word starting with the prefix bot in the Name column. The results are shown in Figure 10-16.

Listing 10-8. CONTAINS Prefix Search

SELECT
  ProductModelID ,
  Name
FROM Production.ProductModel
WHERE CONTAINS (Name, N'"bot*"'),

9781430245964_Fig10-16.jpg

Figure 10-16.  Results of the CONTAINS Prefix Search

The CONTAINS predicate also supports the NEAR () keyword for proximity searches. NEAR will return matches for words that are close to one another in the source columns. Listing 10-9 demonstrates a NEAR proximity search that looks for instances of the word aluminum that occur in close proximity to the word jig in the Instructions column. The results are shown in Figure 10-17. This example is considered a generic proximity search.

image Tip  Avoid using generic proximity searches. These will be deprecated in future versions of SQL Server. Instead, use the custom proximity searches discussed later in this chapter.

Listing 10-9. CONTAINS Proximity Search

SELECT
  ProductModelID ,
  Name
FROM Production.ProductModel
WHERE CONTAINS (Instructions, N'aluminum NEAR jig'),

9781430245964_Fig10-17.jpg

Figure 10-17.  CONTAINS Proximity Query Results

SQL Server 2012 introduces a custom proximity search for the NEAR clause. It allows you to easily search for words within a customizable distance from one another. It also allows you to define the order of the phrases in your search. The distance is determined by the number of non-searchable words between the words included in your search. If we take the example in Listing 10-9 and convert it to a custom proximity search, we find that in order to get the same results we have to include a distance of three. This means that a maximum of three words exist between the words aluminum and jig. Listing 10-10 shows the revised code.

Listing 10-10. CONTAINS Custom Search

SELECT
  ProductModelID ,
  Name
FROM Production.ProductModel
WHERE CONTAINS(Instructions, 'NEAR((aluminum,jig), 3)'),

Listing 10-10 gives you the same results as Figure 10-17. A distance of two will give you no results but any other number above three gives you the same results as the original. Keep in mind the distance between the words also includes stopwords. Remember stopwords are words usually not included in searches. Keep in mind too that the custom proximity clause is not limited to only two search words. You could have also included words like “bike,” “weld,” and “frame”—for example, NEAR((bike, weld, frame), 3). You can even include phrases like “bike riding” or “welding frame.” Whatever you choose, the distance is still based on the distance between the first and last word listed in the condition.

By default the custom proximity search will ignore the order of the search words. In the example above, jig could be within a distance of three either before or after the word aluminum. If you want to control the order of the search words then you need add the TRUE clause in the NEAR statement. Listing 10-11 shows two examples. The first has jig before aluminum and the second has aluminum before jig. Notice that only the second example returns values.

Listing 10-11. Custom Search with TRUE Clause

SELECT
  ProductModelID ,
  Name
FROM Production.ProductModel
WHERE CONTAINS(Instructions, 'NEAR((jig, aluminum),3, TRUE)'),
 
SELECT
  ProductModelID ,
  Name
FROM Production.ProductModel
WHERE CONTAINS(Instructions, 'NEAR((aluminum, jig),3, TRUE)'),

The custom proximity search also allows for search conditions which combine multiple grouping of words using expressions like AND, OR, and AND NOT. The added flexibility of the SQL Server 2012 custom proximity search adds advanced features not available in the generic search. Going forward, all searches should be done using the custom properties.

The FREETEXTTABLE and CONTAINSTABLE Functions

SQL Server provides TVF-based counterparts to the FREETEXT and CONTAINS predicates, known as FREETEXTTABLE and CONTAINSTABLE. These functions operate like the similarly named predicates, but both functions return result sets consisting of a table with two columns, named KEY and RANK. The KEY column contains the key index values relating back to the unique index of matching rows in the source table, and the RANK column contains relevance rankings.

The FREETEXTTABLE function accepts the name of the table to search, a single column name or column list, a search string, and an optional language identifier just like the FREETEXT predicate. FREETEXTTABLE can also take an additional “top n by rank” parameter to limit the rows returned to a specific number of the highest-ranked rows. The results of FREETEXTTABLE are useful for joining back to the source table via the KEY column of the results. Listing 10-12 demonstrates a simple FREETEXTTABLE query that locates rows where the word aluminum appears in the Instructions column of the Production.ProductModel table. The results are joined back to the source table to return the ProductModelID and Name, as shown in Figure 10-18.

Listing 10-12. FREETEXTTABLE Results Joined to Source Table

SELECT
  ftt.[KEY],
  ftt.[RANK],
  pm.ProductModelID ,
  pm.Name FROM FREETEXTTABLE
(
  Production.ProductModel,
  Instructions,
  N'aluminum'
) ftt
INNER JOIN Production.ProductModel pm
  ON ftt.[KEY] = pm.ProductModelID;

9781430245964_Fig10-18.jpg

Figure 10-18.  Results of the FREETEXTTABLE Query

The CONTAINSTABLE function offers the advanced search capabilities of the CONTAINS predicate in a function form. The CONTAINSTABLE function accepts the name of the source table, a single column name or list of columns, and a CONTAINS-style search condition. Like FREETEXTTABLE, the CONTAINSTABLE function also accepts an optional language identifier and “top n by rank” parameter. Listing 10-13 demonstrates the CONTAINSTABLE function in a simple keyword search that retrieves KEY and RANK values for all rows containing inflectional forms of the word tours. The results are shown in Figure 10-19.

Listing 10-13. Simple CONTAINSTABLE Query

SELECT
  [KEY],
  [RANK]
FROM CONTAINSTABLE (
Production.ProductModel,
[Name],
N'FORMSOF(INFLECTIONAL, tours)'
);

9781430245964_Fig10-19.jpg

Figure 10-19.  Results of the CONTAINSTABLE Query with Inflectional Forms

CONTAINSTABLE supports all of the options supported by the CONTAINS predicate, including the ISABOUT term, which allows you to assign weights to the matched words it locates. With ISABOUT, you assign a weight value between 0.0 and 1.0 to each search word. CONTAINSTABLE applies the weight to the relevance rankings returned in the RANK column. Listing 10-14 shows two CONTAINSTABLE queries. The first query returns all products with the words aluminum or polish in their XML Instructions column. The second query uses ISABOUT to assign each of these words a weight between 0.0 and 1.0, which is then applied to the result RANK for each row. The results, shown in Figure 10-20, demonstrate how ISABOUT weights can rearrange the rankings of your CONTAINSTABLE query results.

Listing 10-14. ISABOUT in a CONTAINSTABLE Query

SELECT
  ct.[RANK],
  ct.[KEY],
  pm.[Name]
FROM CONTAINSTABLE
(
  Production.ProductModel,
  Instructions,
  N'aluminum OR polish'
) ct
INNER JOIN Production.ProductModel pm
  ON ct.[KEY] = pm.ProductModelID
ORDER BY ct.[RANK] DESC;
  SELECT
  ct.[RANK],
  ct.[KEY],
  pm.[Name] FROM CONTAINSTABLE
(
  Production.ProductModel,
  Instructions,
N'ISABOUT(aluminum WEIGHT(1.0 ), polish WEIGHT(0.1))'
) ct
INNER JOIN Production.ProductModel pm
  ON ct.[KEY] = pm.ProductModelID
ORDER BY ct.[RANK] DESC;

9781430245964_Fig10-20.jpg

Figure 10-20.  Changing Result Set Rankings with ISABOUT

Thesauruses and Stoplists

The FREETEXT predicate and FREETEXTTABLE function automatically perform word stemming for inflectional forms and thesaurus expansions and replacements. The CONTAINS predicate and CONTAINSTABLE function require you to explicitly specify that you want inflectional forms and thesaurus expansions and replacements with the FORMSOF term. While inflectional forms include verb conjugations and plural forms of words, thesaurus functionality is based on user-managed XML files that define word replacement and expansion patterns.

Each language-specific thesaurusis located in an XML file in the FTData directory of your SQL Server installation. If you installed SQL Server with the default settings then the directory would be located in the path C:Program FilesMicrosoft SQL ServerMSSQL11.MSSQLSERVERMSSQLFTData. The thesaurus files are named using the format tsnnn.xml, where nnn is a three-letter code representing a specific language. The file name tsenu.xml, for instance, is the US English thesaurus. To demonstrate the FTS thesaurus capabilities, we’ll begin by creating a new full-text index on the Production.Product table using the code in Listing 10-15.

Listing 10-15. Creating a Full-Text Index

CREATE  FULLTEXT  INDEX  ON  Production.Product
(
    Name  LANGUAGE  English,
    Color LANGUAGE English
)
KEY  INDEX PK_Product_ProductID
ON  (AdventureWorksFTCat)
WITH
(
    CHANGE_TRACKING AUTO,
    STOPLIST = SYSTEM
);
GO
ALTER FULLTEXT INDEX ON Production.Product
ENABLE;
GO

You can edit the thesaurus XML files with a simple text editor or a more advanced XML editor. For this example, we opened the tsenu.xml thesaurus file in Notepad, made the appropriate changes, and saved the file back to the MSSQLFTData directory. The contents of the tsenu.xml file, after our edits, are shown in Listing 10-16.

Listing 10-16. Tsenu.xml US English XML Thesaurus File

<XML  ID  =  "Microsoft  Search  Thesaurus">
    <thesaurus xmlns = "x-schema:tsSchema.xml">
        <diacritics_sensitive> 0</diacritics_sensitive>
        <expansion>
        <sub> thin</sub>
        <sub> flat</sub>
        </expansion>
        <replacement>
        <pat> sapphire</pat>
        <pat> indigo</pat>
        <pat> navy</pat>
        <sub> blue</sub>
        </replacement>
    </thesaurus>
</XML>

After editing the XML thesaurus file, you can use the sys.spfulltextloadthesaurusfile stored procedure (SP) to reload the thesaurus file. This procedure accepts an integer LCID parameter, as shown in Listing 10-17. The LCID used in the listing is 1033, which specifies US English.

image Note  Starting in SQL Server 2008, reloading a thesaurus in SQL Server did not require an SQL Server service restart.

Listing 10-17. Reloading US English XML Thesaurus

EXEC sys.sp_fulltext_load_thesaurus_file 1033;
GO

The diacritics_sensitive element of the thesaurus file indicates whether accent marks are replaced during expansion and replacement. For instance, if diacritics_sensitive is set to 0, the words cafe and café are considered equivalent for purposes of the thesaurus. If diacritics_sensitive is set to 1, however, these two words would be considered different.

The expansion element indicates substitutions that should be applied during the full-text query. The word being searched is expanded to match the other words in the expansion set. In the example, if the user queries for the word thin, the search is automatically expanded to include matches for the word flat, and vice versa. An expansion set can include as many substitutions as you care to define, and the thesaurus can contain as many expansion sets as you need. The sample FREETEXT query in Listing 10-18 shows the expansion sets in action, with partial results shown in Figure 10-21.

Listing 10-18. FREETEXT Query with Thesaurus Expansion Sets

SELECT
  ProductID,
  Name
FROM Production.Product
WHERE FREETEXT(*, N'flat'),

9781430245964_Fig10-21.jpg

Figure 10-21.  Partial Results of the Full-text Query with Expansion Sets

The replacement section of the thesaurus file indicates replacements for words that are used in a full-text query. In the example, we’ve defined patterns like navy,sapphire, and indigo, which will be replaced with the word blue. The result is that a full-text query for these replacement patterns will be converted internally to a search for blue. Listing 10-19 shows a FREETEXT query that uses the replacement patterns defined in the thesaurus. You can use any of the replacement patterns defined in the thesaurus file in the full-text query to get the same result. Figure 10-22 shows the results.

Listing 10-19. FREETEXT Query with Thesaurus Replacement Patterns

SELECT
  ProductID,
  Name,
  Color
FROM Production.Product
WHERE FREETEXT(*, N'navy'),

9781430245964_Fig10-22.jpg

Figure 10-22.  Partial Results of the Full-text Query with Replacement Sets

Previous versions of FTS had system-defined lists of noise words, which provided a way to essentially ignore commonly occurring words that don’t help the search. Commonly cited noise words included those like the, a, an, and others. The noise word implementation in previous versions stored the noise words in files in the file system.

SQL Server 2012 implements the classic noise words, known in FTS as stopwords. Stopwords are managed inside the SQL Server database using structures known as stoplists. You can use the system-supplied stoplists or create and manage your own language-specific stoplists with the CREATE FULLTEXT STOPLIST, ALTER FULLTEXT STOPLIST, and DROP FULLTEXT STOPLIST statements. The statement in Listing 10-20 creates a stoplist based on the system stoplist.

Listing 10-20. Creating a Full-Text Stoplist

CREATE FULLTEXT STOPLIST AWStoplist
FROM SYSTEM STOPLIST;
GO

Stoplists are more flexible than the old noise word lists since you can easily use T-SQL statements to add words to your stoplists. Consider AdventureWorks product model searches where the word instructions appears in several of the XML documents in the Instructions column. You can add the word instructions to the previously created stoplist with the ALTER FULLTEXT STOPLIST statement, and then associate the stoplist with the full-text index on the Production.ProductModel table via the ALTER FULLTEXT INDEX statement, as shown in Listing 10-21. This will effectively ignore the word instructions during full-text searches on this column.

Listing 10-21. Adding the Word “Instructions” to the Stoplist

ALTER FULLTEXT STOPLIST AWStoplist
ADD N'instructions' LANGUAGE English;
GO
 
ALTER FULLTEXT INDEX ON Production.ProductModel
SET STOPLIST AWStoplist;
GO

After application of the newly created stoplist, a full-text query against the Production.ProductModel table for the word instructions, as shown in Listing 10-22, will return no results.

Listing 10-22. Full-Text Query with Newly Created Stoplist

SELECT
  ProductModelID,
  Name
FROM Production.ProductModel
WHERE FREETEXT(*, N'instructions'),

Stored Procedures and Dynamic Management Views and Functions

SQL Server 2012 provides access to many of the legacy FTS SPs available in previous releases of SQL Server. Most of these procedures have been deprecated, however, and have been replaced by fully integrated T-SQL statements and dynamic management views and functions.

SQL Server 2012 FTS uses the sys.sp_fulltext_load_thesaurus_file procedure that we introduced earlier in this chapter to load an XML thesaurus file. Another procedure is the sys.sp_fulltext_resetfdhostaccount procedure that updates the Windows username and password that SQL Server uses to start the filter daemon service.

A big issue for developers who used FTS in SQL Server 2005 and earlier was the lack of transparency. Basically everything that FTS did was well hidden from view, and developers and administrators had to troubleshoot FTS issues in the dark. SQL Server 2008 introduced some catalog views and dynamic management functions that made FTS more transparent, and this continues to be the case in SQL Server 2012.

If you’re experiencing FTS query performance issues, the sys.fulltext_index_fragments catalog view can provide insight. This catalog view reports full-text index fragments and their status. You can use the information in this catalog view to decide if it’s time to reorganize your full-text index.

The sys.fulltext_stoplists and sys.fulltext_stopwords catalog views let you see the user-defined stopwords and stoplists defined in the current database. The information returned by these catalog views is useful for troubleshooting issues with certain words being ignored (or not being ignored) in full-text queries. The sys.fulltext_system_stopwords catalog view returns a row for every stopword in the system stoplist, which is useful information to have if you want to use the system stoplist as the basis for your own stoplists.

The sys.dm_fts_parser functionis a useful tool for troubleshooting full-text queries. This function accepts a full-text query string, an LCID, a stoplist ID, and an accent sensitivity setting. The result returned by the function shows the results produced by the word breaker and stemmer for any given full-text query. This information is very useful if you need to troubleshoot or just want to better understand exactly how the word breaker and stemmer affect your queries. Listing 10-23 is a simple demonstration of stemming the word had with the sys.dm_fts_parser function. Results are shown in Figure 10-23.

Listing 10-23. Using Sys.dm_fts_parser to See Word Breaking and Stemming

SELECT
  keyword,
  group_id,
  phrase_id,
  occurrence,
  special_term,
  display_term,
  expansion_type,
  source_term
FROM sys.dm_fts_parser
(
   N'FORMSOF(FREETEXT,had)',
   1033,
   NULL,
   0
);

9781430245964_Fig10-23.jpg

Figure 10-23.  Results of Word-breaking and Stemming the Word “Had”

Statistical Semantics

When you created the index (see Figure 10-6) you had the option to select statistical semantics. Statistical semantics is new in SQL Server 2012 and it dramatically changes what it means to search documents. Everything discussed up to now was focused on searching words within a document. If you needed to find all the words similar to “weld,” you could find them by using FTS functions against text data stored in the SQL Server engine. But what if you wanted to find all the documents stored in your SQL Server database that were related to finance or a particular law case? Or, let’s say, you needed to search through hundreds of resumes to determine which ones best fit a particular job application. This is where statistical semantics becomes helpful. Statistical semantics is used to search for the meaning of documents and not just their content.

The statistical semantic feature requires FTS but is installed as a separate feature. The install file is located on the SQL Server install disk. The 64bit version is located at . . . x64Setup and the file name is SemanticLanguageDatabase.msi. The install wizard is straight-forward. The wizard extracts the semantic database files to a directory. The default directory is C:Program FilesMicrosoft Semantic Language Database. You will then want to copy or move these database files to another location, preferably the same location as your other database files, and then attach the database. Once the database is attached, run the command in Listing 10-24.

Listing 10-24. Initializing the Statistical Semantics Database

EXEC sp_fulltext_semantic_register_language_statistics_db @dbname  =  N'semanticsdb';

Once initialized, you can verify the database is ready by running the code in Listing 10-25. Figure 10-24 shows the results.

Listing 10-25. Verifying Active Statistical Semantics Database

SELECT * FROM sys.fulltext_semantic_language_statistics_database

9781430245964_Fig10-24.jpg

Figure 10-24.  Results of Querying the Semantics Database

From here you can now go back to the properties of the Production.ProductModel FTS index we created earlier in the chapter and checkmark the Statistical Semantics column as shown in Figure 10-25.

9781430245964_Fig10-25.jpg

Figure 10-25.  Enabling Statistical Semantics on Table Columns

Now that statistical semantics is installed we can do things like search for key phrases or find related documents. To find a key phrase we use the TVF semantickeyphrasetable. Searching for key phrases on the Production.ProductModel name column yields the results we see in Figure 10-26. Run the code in Listing 10-26 to get the results.

Listing 10-26. Using the Semantickeyphrasetable Function

SELECT TOP(10) KEYP_TBL.keyphrase
FROM SEMANTICKEYPHRASETABLE
    (
    Production.ProductModel,
 Name
    ) AS KEYP_TBL
ORDER BY KEYP_TBL.score DESC;
GO

9781430245964_Fig10-26.jpg

Figure 10-26.  Results from Semantickeyphrasetable Function

Semantic searching offers some interesting possibilities and broadens the scope of traditional FTS. If you include the SQL Server 2012 FileTable feature then the possibilities widen even further. FileTable allows documents stored on a file system to be integrated and managed through SQL Server. Semantic searching can be performed against these and any other document managed by the SQL Server engine.

Summary

FTS functionality is highly integrated with SQL Server, providing more efficient full-text queries than ever before. Full-text indexes and stoplists are stored in the database, making FTS more manageable, flexible, and scalable.

SQL Server provides the powerful FREETEXT and CONTAINS predicates, and FREETEXTTABLE and CONTAINSTABLE functions, to perform full-text searches. SQL Server also supports thesaurus and stoplist functionality to help customize FTS as well as the new CONTAIN custom search and statistical semantics. SQL Server 2012 also provides dynamic management views and functions to make FTS more transparent and easier to troubleshoot than was the case in previous versions of SQL Server.

EXERCISES

  1. [True/False] Stoplists and full-text indexes are stored in the database.
  2. [Choose one] You can create a full-text index with which of the following methods:
    • a.  Using a wizard in SSMS
    • b.  Using the T-SQL CREATE FULLTEXT INDEX statement
    • c.  Both (a) and (b)
    • d.  None of the above
  3. [Fill in the blanks] The FREETEXT predicate automatically performs word stemming and thesaurus _________ and __________.
  4. [Fill in the blank] Stoplists contain stopwords, which are words that are _________ during full-text querying.
  5. [True/False] The sys.dm_fts_parser dynamic management function shows the results produced by word breaking and stemming.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset