TF-IDF stands for term frequency-inverse document frequency, which measures how important a word is to a document in a collection of documents. It is used extensively in informational retrieval and reflects the weightage of the word in the document. The TF-IDF value increases in proportion to the number of occurrences of the words otherwise known as frequency of the word/term and consists of two key elements, the term frequency and the inverse document frequency.
TF is the term frequency, which is the frequency of a word/term in the document.
For a term t, tf measures the number of times term t occurs in document d. tf is implemented in Spark using hashing where a term is mapped into an index by applying a hash function.
IDF is the inverse document frequency, which represents the information a term provides about the tendency of the term to appear in documents. IDF is a log-scaled inverse function of documents containing the term:
IDF = TotalDocuments/Documents containing Term
Once we have TF and IDF, we can compute the TF-IDF value by multiplying the TF and IDF:
TF-IDF = TF * IDF
We will now look at how we can generate TF using the HashingTF Transformer in Spark ML.