Pattern matching

pattern matching is the act of checking a given sequence of words for the presence of the constituents of some pattern. There are several factors affecting the result of pattern matching, including:

  • Text normalization
  • Dictionary
  • Ranking

If the text is not normalized, a text search might not return the expected result. The following examples show how pattern matching can fail with unnormalized text:

car_portal=# SELECT 'elephants'::tsvector @@ 'elephant';
?column?
----------
f
(1 row)

In the preceding query, casting elephants to tsvector and the implicit casting of elephant to the query does not generate normalized lexemes due to missing information about the dictionary. To add dictionary information, to_tsvector and to_tsquery can be used as follows:

car_portal=# SELECT to_tsvector('english', 'elephants') @@ to_tsquery('english', 'elephant');
?column?
----------
t
(1 row)

car_portal=#
car_portal=# SELECT to_tsvector('simple', 'elephants') @@ to_tsquery('simple', 'elephant');
?column?
----------
f
(1 row)

Full text search supports pattern matching based on ranks. The tsvector lexemes can be marked with the labels A, B, C, and D; where D is the default and A has the highest rank. The set weight function can be used to assign a weight to tsvector explicitly, as follows:

car_portal=# SELECT setweight(to_tsvector('english', 'elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B');
?column?
-------------------------
'dolphin':2B 'eleph':1A
(1 row)

For ranking, there are two functions: ts_rank and ts_rank_cd. The ts_rank function is used for standard ranking, while ts_rank_cd is used for the cover density ranking technique. The following example shows the result of ts_rank_cd when used to search eleph and dolphin, respectively:

car_portal=# SELECT ts_rank_cd (setweight(to_tsvector('english','elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B'),'eleph' );
ts_rank_cd
------------
1
(1 row)

car_portal=# SELECT ts_rank_cd (setweight(to_tsvector('english','elephants'),'A') || setweight(to_tsvector('english', 'dolphin'),'B'),'dolphin' );
ts_rank_cd
------------
0.4
(1 row)

Ranking is often used to enhance, filter out, and order the result of pattern matching. In real-life scenarios, different document sections can have different weights. For example, when searching for a movie, the highest weight could be given to the movie title and main character, and less weight could be given to the summary of the movie plot.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset