skip to content
luminary.blog
by Oz Akan
words

TF-IDF Simplified

The goal of TF-IDF is to emphasize words that are important in a particular document while filtering out common words that appear frequently across many documents but offer little unique information.

/ 6 min read

Table of Contents

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used in Natural Language Processing (NLP) and machine learning to assess the importance of a word within a document relative to a larger collection of documents (corpus). It helps convert text data into numerical representations, making it useful for applications like text classification, document clustering, and information retrieval.

The goal of TF-IDF is to emphasize words that are important in a particular document while filtering out common words that appear frequently across many documents but offer little unique information (e.g., “the,” “is,” “and”).

Let’s dive into the details with some examples.

The Core Components of TF-IDF

Term Frequency (TF)

Term Frequency measures how often a term appears in a document. There are several ways to calculate it:

  1. Raw Count: Simply count the number of times a term appears
  2. Boolean Frequency: 1 if the term appears, 0 if not
  3. Logarithmically Scaled Frequency: log(1 + raw count)
  4. Normalized Frequency: (Count of term in document) / (Total terms in document)

The normalized frequency is most commonly used because it accounts for document length.

Example: Consider the sentence: “The cat sat on the mat.”

For the term “the”:

  • Raw count: 2
  • Normalized TF: 2/6 = 0.33 (appears 2 times in a 6-word document)

For the term “cat”:

  • Raw count: 1
  • Normalized TF: 1/6 = 0.17

Inverse Document Frequency (IDF)

IDF measures how important or rare a term is across all documents in the corpus. It’s calculated as:

IDF(term) = log(Total number of documents / Number of documents containing the term)

Example: Imagine we have 10 documents, and the word “the” appears in all 10, while “cat” appears in only 3.

For “the”: IDF = log(10/10) = log(1) = 0
For “cat”: IDF = log(10/3) ≈ log(3.33) ≈ 0.52

Notice how common words like “the” get an IDF of 0, effectively removing them from consideration!

TF-IDF Calculation

TF-IDF combines these two measures: TF-IDF = TF × IDF

Continuing our example:

  • For “the”: TF-IDF = 0.33 × 0 = 0
  • For “cat”: TF-IDF = 0.17 × 0.52 ≈ 0.09

This demonstrates why TF-IDF is so powerful: common words like “the” get scored as 0, while more meaningful words receive higher scores based on their importance in the specific document and rarity across all documents.

Real-World Example: Document Comparison

Let’s look at a more substantial example with three short documents:

  • Document 1: “Machine learning algorithms require data.”
  • Document 2: “Neural networks are a type of machine learning algorithm.”
  • Document 3: “Data science uses machine learning and statistics.”

First, let’s calculate term frequencies (using normalized TF):

TermDoc 1 TFDoc 2 TFDoc 3 TF
machine0.20.1430.143
learning0.20.1430.143
algorithms0.200
require0.200
data0.200.143
neural00.1430
networks00.1430
are00.1430
a00.1430
type00.1430
of00.1430
algorithm00.1430
science000.143
uses000.143
and000.143
statistics000.143

Next, let’s calculate the IDF for each term:

TermAppears in # DocsIDF = log(3/count)
machine3log(3/3) = 0
learning3log(3/3) = 0
algorithms1log(3/1) ≈ 0.48
require1log(3/1) ≈ 0.48
data2log(3/2) ≈ 0.18
neural1log(3/1) ≈ 0.48
networks1log(3/1) ≈ 0.48
etc.

Finally, we multiply TF × IDF for each term in each document:

For Document 1:

  • “machine”: 0.2 × 0 = 0
  • “learning”: 0.2 × 0 = 0
  • “algorithms”: 0.2 × 0.48 = 0.096
  • “require”: 0.2 × 0.48 = 0.096
  • “data”: 0.2 × 0.18 = 0.036

This process creates a numerical vector for each document, where distinctive terms have higher values and common terms (appearing across all documents) have zero weight.

Applications of TF-IDF

1. Document Search and Ranking

When you search for “machine learning tutorial” in a search engine, TF-IDF helps rank articles by how relevant they are to your query. Documents with higher TF-IDF scores for your search terms will appear higher in results.

2. Document Clustering

TF-IDF vectors allow algorithms to automatically group similar documents together. Using cosine similarity between TF-IDF vectors, we can determine which documents are conceptually related, even if they use different vocabulary.

3. Content-Based Recommendation Systems

Many recommendation systems use TF-IDF to understand document content and recommend similar items to users. For example, a news aggregator might recommend articles based on the similarity of their TF-IDF vectors to articles you’ve previously read.

4. Text Classification

TF-IDF is frequently used as a feature extraction method for text classification tasks. For example, in spam detection, certain words might have higher TF-IDF scores in spam emails versus legitimate emails.

5. Keyword Extraction

TF-IDF can identify the most important words in a document, which is useful for automatic summarization and keyword extraction.

Implementation Example

Here’s how you might implement TF-IDF using Python with the scikit-learn library:

from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"Machine learning algorithms require data.",
"Neural networks are a type of machine learning algorithm.",
"Data science uses machine learning and statistics."
]
# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Generate the TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(documents)
# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()
# Print the TF-IDF scores for each document
for i, doc in enumerate(documents):
print(f"Document {i+1}:")
# Get the TF-IDF scores for non-zero elements
feature_index = tfidf_matrix[i, :].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
# Sort by TF-IDF score
for idx, score in sorted(tfidf_scores, key=lambda x: x[1], reverse=True):
print(f" {feature_names[idx]}: {score:.4f}")

Avoiding Data Leakage

A critical consideration when using TF-IDF is preventing data leakage. This occurs when information from the test set influences the model training process.

For TF-IDF, this happens when the IDF component is calculated using the entire dataset, including test data. To avoid this:

  1. Calculate TF-IDF parameters (vocabulary and IDF values) using only the training set
  2. Apply these parameters to transform both training and test sets

In scikit-learn, this is implemented by:

  1. Using fit_transform() on the training data
  2. Using only transform() on the test data

Example in Python:

from sklearn.feature_extraction.text import TfidfVectorizer
train_docs = ["The cat sat on the mat.", "The dog barked at the cat."]
test_docs = ["A cat and a dog shared the mat."]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_docs) # Fit on training data
X_test = vectorizer.transform(test_docs) # Transform test data without refitting

This ensures IDF is learned from training data only.

Beyond Basic TF-IDF

Modern NLP has evolved beyond basic TF-IDF with techniques like:

  • N-grams: Considering sequences of words rather than just individual terms
  • BM25: A probabilistic relevance model that improves upon TF-IDF
  • Word Embeddings: Dense vector representations like Word2Vec and GloVe
  • Contextual Embeddings: Models like BERT that consider the context of words

TF-IDF remains relevant due to its simplicity, interpretability, and effectiveness for many tasks. Despite newer techniques, TF-IDF continues to be an essential tool in the NLP toolkit, providing a strong baseline for text analysis and understanding.

Mathematical Representations

  • ( t ) = term (word)
  • ( d ) = a single document
  • ( D ) = the entire collection of documents (corpus)

1. Term Frequency (TF)

TF measures how frequently a word appears in a given document. It is calculated as:

TF(t,d)=Number of times term t appears in document dTotal number of words in document d\text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of words in document } d}

2. Inverse Document Frequency (IDF)

IDF gives more weight to rare words and less weight to common words. It is computed as:

IDF(t,D)=log(NDF(t))\text{IDF}(t,D) = \log \left(\frac{N}{\text{DF}(t)} \right)

Where:

  • ( N ) = total number of documents in the corpus
  • ( DF(t) ) = number of documents that contain the term ( t )

3. Computing TF-IDF

Now, multiply TF × IDF:

TF-IDF(t,d)=TF(t,d)×IDF(t,D)\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t,D)


Resources:

  1. https://ucsc-ospo.github.io/report/osre24/nyu/data-leakage/20240905-kyrillosishak/
  2. https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
  3. https://en.wikipedia.org/wiki/Tf%E2%80%93idf
  4. https://builtin.com/articles/tf-idf
  5. https://datascience.stackexchange.com/questions/47862/how-do-i-use-tfidf-scores-for-my-machine-learning-model
  6. https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/
  7. https://www.kdnuggets.com/2022/10/tfidf-defined.html