tf-idf is one of the most fundamental concepts in information retrieval. It’s the product of 2 statistics, the term frequency (tf), which is the numeric value representing how many times a word appears in a document. The inverse document frequency (idf) illustrates how much information a word provides, i.e., if it’s common or rare in examined documents. Despite its simplicity, it’s still one of the most popular term-weighting schemes; variances of the td-idf are used extensively in search engines as a ranking function to evaluate the relevance between users’ queries and a webpage.
1. Term frequency
Let t be the term we want to find the frequency in document d, the output of the tf(t,d) function is simply the total number of times the term appeared in the document divided by the total number of terms in the document:
For example, we have the document d with the following content: “I am learning information retrieval, and you are learning information retrieval as well,“ and the term t we want to find its frequency is retrieval, then tf(t,d)=2/13 = 0.1538.
Keep in mind that there are various other ways to compute the term frequency, such as only the count frequency of the term itself, boolean frequency (whether a term is in the document or not), etc…
1.1 Bag of words
While computing the term frequency for documents, we can often use bag-of-words for data preprocessing. In this phase, bag-of-words describe the frequency of each word that appeared in the document, disregarding grammar detail and order of words. With the document given in the previous session, we can have a simple bag-of-words representation as follow:
{ "as": 1, "are": 1, "and": 1, "retrieval": 2, "I": 1, "well" : 1, "learning": 2, "information": 2, "am": 1, "you": 1 } |
1.2 Problem with TF
However, let’s say we want to determine whether a query is related to a given document or not; simply counting the term frequency may not be so effective; for example, we have two documents:
And the query q is “the learning process”, we calculate the relevancy score by summing up the frequency of each term in q in the d1 and d2 documents. {score}(q,d) = {{\sum_{w \in q}{tf_{w,d}}.
score(q, d1) = 2/9 + 0/9 + 0/9 = 2/9 \approx 0.22
score(q, d2) = 0/13 + 1/13 + 0/13 = 1/13 \approx 0.077
Document d1 has a higher score, while clearly, we know that document d2 is more relevant.
2. Inverse Document Frequency
As we can see, if using the only term frequency functions, we can run into a situation in which unimportant terms are given more weight than important ones. In the example of the previous section, the word “the” is one of the stop words, which is insignificant and should be filtered out before processing.
The inverse document frequency function measures how much information a term provides in documents, i.e., whether the term rarely appears or is common between documents. Stop or common words can exist in most documents without adding much value. At the same time, a rare term only appears in a small number of documents, indeed giving us more relevant information. The inverse document frequency will assign a lower score for common unimportant terms and give us a higher score for more important ones. The idf function is presented as follows:
- Where N is the number of the documents
- The denominator is the number of documents in which the term t appears. If the term doesn’t appear in any document, this can lead to divide-by-zero; hence it’s common to add 1 to the denominator, i.e. 1 + {|{d \in D: t \in d}|}
We can see that the inverse document frequency function got its name because instead of dividing the number of documents where the terms appeared by the total number of documents, we inverse them. And we take the logarithm (here, we use base e) at the end to compute the frequency score.
Now, let’s compute the idf of this 2 documents D:
- idf("something", D) = \log\frac{2}{2} = 0
- idf("learn", D) = \log\frac{2}{1} = 0.69314
3. Term Frequency-inverse Document Frequency
The two weight schemes tf and idf can be combined to calculate the relevant score as the following:
A high weighting score given by this function is high if both values provided by tf(t,d) and idf(t, D) are high, i.e., when the term has a high frequency in a given document and a low happening frequency in the total number of existing documents. The tdidf(t,d,D) will give us a 0 score if the term t appears in every document (because idf(t,D) = 0), hence using tfidf we can filter out common terms between documents.
4. Code Example
Now we are ready to create a simple program that calculates the relevancy of documents to a user’s query using the td-idf function. We first create a class that models our document:
record Document(int docId, String doc) {} |
Then, we create a function that calculates the term frequency score:
public static double getTermFrequencyScore(String term, Document document) { |
Next, we make another function to calculate the inverse document frequency score; the input will be the term along with the list of documents:
public static double getInverseDocumentFrequencyScore(String term, List<Document> documents) { int size = documents.size(); |
Finally, we can have the function getTfIdfScore, which is the product of returned values from 2 previous functions:
public static double getTfIdfScore(String term, List<Document> allDocs, Document doc) { |
We’re ready to make use of our methods and put them in the utility class RelevancyScoreUtils; next, we create another class named DocumentRelevancyCalculator which has a function that takes a user’s query as input and returns a list of pertaining documents:
record DocumentTermScore(String term, Document doc, double relScore) { } class DocumentRelevancyCalculator { public DocumentRelevancyCalculator(List<Document> documents) { public List<DocumentTermScore> getRelevantDocuments(String query, int k) { |
Finally, we can create a test class that harnesses the function we’ve just created:
class DocumentRelevancyCalculatorTest { @BeforeEach @Test |
To wrap up, term frequency and inverse document frequency scores are often computed together to evaluate the relevancy of a given term to documents or a corpus. The tf-idf function gives a higher weighting score when a term appears more in a document and rarely in other documents, and it gives us a 0 score when the term appears in every document.