Automatic phrase detection

Preprocessing typically involves phrase detection, that is, the identification of tokens that are commonly used together and should receive a single vector representation (for example, New York City, see the discussion of n-grams in Chapter 13, Working with Text Data).

The original Word2vec authors use a simple lift scoring method that identifies two words wi, wj as a bigram if their joint occurrence exceeds a given threshold relative to each word's individual appearance, corrected by a discount factor δ:

The scorer can be applied repeatedly to identify successively longer phrases.

An alternative is the normalized point-wise mutual information score that is more accurate, but also more costly to compute. It uses the relative word frequency P(w) and varies between +1 and -1:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset