TF-IDF

TF-IDF stands for term frequency-inverse document frequency, which measures how important a word is to a document in a collection of documents. It is used extensively in informational retrieval and reflects the weightage of the word in the document. The TF-IDF value increases in proportion to the number of occurrences of the words otherwise known as frequency of the word/term and consists of two key elements, the term frequency and the inverse document frequency.

TF is the term frequency, which is the frequency of a word/term in the document.
For a term t, tf measures the number of times term t occurs in document d. tf is implemented in Spark using hashing where a term is mapped into an index by applying a hash function.

IDF is the inverse document frequency, which represents the information a term provides about the tendency of the term to appear in documents. IDF is a log-scaled inverse function of documents containing the term:

IDF = TotalDocuments/Documents containing Term

Once we have TF and IDF, we can compute the TF-IDF value by multiplying the TF and IDF:

TF-IDF = TF * IDF

We will now look at how we can generate TF using the HashingTF Transformer in Spark ML.

Table of Contents for TF-IDF

Create new playlist

Sign In

Sign Up

Table of Contents for
TF-IDF