TF-IDF

A Simple Explanation - By Varsha Saini

TF-IDF stands for Term Frequency-Inverse Document Frequency. It helps in finding how important a word is in a document or corpus. It is one of the best metrics to determine how significant a term is to a text in a series or a corpus.

While working with textual data, we need to convert text into numbers or words into vectors. Word Embedding can be used to create vectors of words. There may be different numerical representations of the same word depending on the method used for creating word embeddings. TF-IDF is one of the methods that can be used to create word embeddings for a corpus.

It has two parts Term Frequency and Inverse Document Frequency.

Term Frequency

It represents the contribution of a word in the document. The term present frequently has a high term frequency.

Inverse Document Frequency

It is based on the fact that if a word is present in all the sentences in a document, then it may not be contributing much information. Therefore the more common word is supposed to be considered less significant and has low IDF.

TF-IDF = TF * IDF

By multiplying the Term Frequency and Inverse Document Frequency for a word, we get the perfect importance of that word in the document as it gives more importance to the word which is highly frequent within a sentence at the same time not common to all the sentences.

Advantages of TF-IDF

It is a simple and computationally cheap method that gives the word vectors based on their importance in the corpus. It can be used to calculate the similarity between two documents.

Disadvantages of TF-IDF

It does not capture the position in the text, semantics, or co-occurrences in different documents.