TF-IDF

There are lots of use cases where we want to be able to find out which words are the most important to a document in a collection or corpus of documents. Or in other words, which words add the most value to the document. Some possible applications for these measurements are search engines like Google or DuckDuckGo.

The most famous and widely used measurement for finding the importance of a word in a document is called TF-IDF. TF-IDF has two components the TF (term frequency) and IDF (inverse document frequency).

Term Frequency

The main idea of TF is that a word is important to a document if the word occurs frequently. So if we for example have a set of documents and want to find out which documents are the most relevant to the search query, "the offside rule" (i.e documents relating to football rules). A simple way to start would be to eliminate all documents that do not contain all three words "the", "offside" and "rule", however this still leaves many documents that might not be relevant. To further distinguish between relative and non-relative documents, we can count the number of times each word occurs in each document, this is the so-called term frequency. However, longer documents could then have a greater term frequency then other documents although that word/term might not be as relevant for the document as for others.

To solve this the relative frequency is taken instead of the absolute frequency i.e the nominator is the raw count of the term $t$ in the document $d$ and the denominator is simply the total number of terms in document $d$ .

$\mathrm{tf}(t,d) = \frac{f_{t,d}}{\sum_{t'\in d}{f_{t',d}}}$

There are also some common alternatives for example using the highest frequency as the denominator.

$\mathrm{tf}(t,d) = \frac{f_{t,d}}{\mathrm{max}(f_{t',d}: t'd \in d)}$

With the introduction of a hyperparameter $k$ you have the so-called augmented term frequency which is also commonly seen with $k=0.5$ for shorter documents.

$\mathrm{tf}(t,d) = k + (1-k) \frac{f_{t,d}}{\mathrm{max}(f_{t',d}: t'd \in d)}$

Inverse Document Frequency

The IDF measures how much information a word provides i.e. if it is common or rare across all documents. For example the word "the" is very common so TF alone might incorrectly rank documents which have the word "the" more frequently higher then other documents. So the word "the" is not a good keyword to distinguish relevant and non-relevant documents and words compared to "offside" and "rules". To avoid this an inverse document frequency factor is used which adds a higher weight to the words that occur rarely.

So for the corpus of documents $D$ with $|D|=N$ being the number of documents in the corpus and $n_t$ being the number of documents the term $t$ occurs in the inverse document frequency can be defined as:

$\mathrm{idf}(t, D) = \log{\frac{N}{n_t}}$

To avoid division by zero also commonly changed to:

$\mathrm{idf}(t, D) = \log{\frac{N}{1 + n_t}}$

We can then finally combine the two components to form the TF-IDF score of a term $t$ in the document $d$ of corpus $D$ :

$\mathrm{tfidf}(t,d,D)=\mathrm{tf}(t,d) \cdot \mathrm{idf}(t,D)$

Term Frequency​

Inverse Document Frequency​

Term Frequency

Inverse Document Frequency