Newsfeed Recommendation Engine - Building up its Tech stack from Scratch

Part 01

Introduction

Nowadays, users are bombarded with a bunch of daily news from multiple sources which makes them waste their time to find out their interesting news. Like other companies such as Netflix, Amazon, Google, Yandex, etc exploiting recommender systems to help their users identify the precise product, movies, videos, or news, Cốc Cốc Newsfeed - The Personalized Recommendation System has been developed to relieve information and suggest the right articles to the right readers at the right time. It is a challenge because the recommender systems usually consume and process a large volume of information to distill the most vital information from tons of raw data provided by millions of users and other factors before seeking the best match between users and article items.

With the Personalized Recommendation System serving approximately 600-700K daily users, Cốc Cốc Newsfeed benefits both the services and users:

Identify the most relevant articles/news to users
Users find articles/news that fit with their interest
Personalized content
Improve user engagement on Cốc Cốc Product (intention, utilization)
Approach new users & attract a higher volume of traffic/interaction to news websites

Personalized Recommendation Architecture

Recommender Systems (RS) are the systems that are designed to recommend things to the user based on myriads of various factors, e.g user’s preference, popularity. When mentioning a recommender system, people often concentrate more on how algorithms are implemented, or on the latest recommendation algorithm. Nevertheless, recommender systems are much more complex than algorithms themselves. In case of deploying it on production, you will soon recognize it as a systematic project with many issues related to performance, data storage, workload, resources. Therefore, the recommendation systems can be understood as an integration of both recommendation algorithms and system engineering.

Recommender systems = Recommendation algorithms + System engineering

The below figure illustrates the recommender framework of Cốc Cốc Newsfeed which consists of 5 discrete components:

Data Crawler
Data Storage
Data processing
Training offline
Serving online

Technical Architecture for NewsFeed Recommendation Systems

♦ Data crawler

Using Cốc Cốc Newsfeed, users always approach the hottest information that is appropriate to their preference and interest. Articles are crawled frequently from over 200 trusty sources such as dantri.com, 24h.com.vn, vnexpress.net to name but a few. The crawled attributes of each article contain vital information for recommendation namely event_time, URL, title, description, content, tags, and category as shown in the below block.

{

'event_time': '2021-11-11 09:02:00',

'id': '7863004916896207200',

'url': 'https://dantri.com.vn/the-thao/bao-thai-lan-noi-gi-ve-viec hlv-park-gia-han-hop-dong-voi-tuyen-viet-nam-20211110233317757.htm',

'title': 'Báo Thái Lan nói gì về việc HLV Park gia hạn hợp đồng với tuyển Việt Nam?',

'description': '(Dân trí) - Thông tin HLV Park Hang Seo tiếp tục ở lại với đội tuyển Việt Nam thêm một năm là thông tin được truyền thông Đông Nam Á quan tâm, bởi nó ảnh',

'content': 'Tờ Siam Sport của Thái Lan viết: "Tin vui cho đội tuyển Việt Nam trước cuộc đọ sức với Nhật Bản, HLV Park Hang Seo gia hạn hợp đồng đến năm 2023. Đây là động thái chấm dứt những tin đồn về việc ông Park có thể chia tay đội tuyển Việt Nam, từng xuất hiện trước đó. Liên đoàn bóng đá Việt Nam (VFF) xác nhận chính thức về việc gia hạn hợp đồng, điều này sẽ giúp ông Park nắm đội tuyển Việt Nam đến ngày 31/1/2023", vẫn là thông tin được đăng tải từ tờ nhật báo thể thao hàng đầu Thái Lan.'

'tags': 'HLV Park Hang Seo, đội tuyển Việt Nam, vòng loại World Cup',

'category_id': 15000,

'domain': 'dantri.com.vn',

'image_url': '{BUCKET_URL}/20211111/7863004916896207200.jpg',

}

10 main categories for news at Vnexpress

Normally, on each site, a writer or editor will determine and assign an article based on its content into some predefined main categories like Entertainment, Sport, Law, … as shown in the above figures. In some cases, deeper categories are utilized, e.g Football, Tennis is a subcategory of Sport. In fact, news categorization makes publishers easier to organize, distribute and analyze content consistently and effectively. However, there is no standard hierarchical category, for instance, 12 main categories at Zingnews and 10 main categories at Vnexpress, that makes crawled articles messily. For this reason, we standardize those categories and define 14 main categories for news that

help newsreaders/consumers are easier to explore their favorite content and control their feeds.
support analytics in which types of news are the most popular, and which are less appreciated.
ensure news about specific subjects to be surfaced accurately and consistently.

Users can control their feed by activating category buttons
on Cốc Cốc Newsfeed

Since more than 10K articles are crawled daily, manual news categorization is not feasible. Therefore, we research and implement an automated news categorization model based on its title, description, and content. This is a supervised model trained on 316K unbalanced articles from 14 main categories. The detail of the training & testing process will be discussed in another post.

Along with the category attribute, each article is also represented in a topic distribution space. The topic modeling is an unsupervised machine learning technique and is trained on a large dataset (~6M articles) with 150 topics. In fact, each topic consists of a list of keywords that often appears in a document together. For instance, the first topic has keywords namely film, actor, actress, oscar, producer, cinema, character, etc, so we named it /ENTERTAINMENT/FILM & TV SHOWS while for the second topic it relates to /EDUCATION with some specific educational keywords as education, program, teacher, evolution, students, high school, or so.

♦ Data Storage

As stated before, Recommender Systems consumes and processes tons of various types of data. Along with article data, we also have user data and behavior data. The user profile data include users' gender and age, news they read, their reading preferences while the behavior refers to the interaction between users and articles.

For more convenience, they are stored in ClickHouse which is an open-source, high-performance columnar OLAP database management system.

♦ Data Processing

The main role in this stage is to form user features, item features, and behavior features for further steps. Articles are firstly deduplicated and filtered based on some criteria such as content is not cloned, breaking news is eliminated each day, … This makes our Newsfeed not display duplicate contents despite coming from various resources and always show hottest personalized news.

Article features include a 14-dimensional category distribution, a 150-dimensional topic distribution vector, and its ranking score. The ranking score of an article can be understood as its popularity on the Internet. The higher the ranking score, the more popular an article is. This score is calculated based on the number of clicks and impressions on Newsfeed and its published time:

where:

NF_impression: the number of impressions of an article on NewsFeed. This compensates for an article that has significant shows on Newsfeed.
NF_click: the number of clicks of an article on NewsFeed. More clicks, higher score. In the case of the cold-start problem, for a new crawled article, it is the number of clicks on Cốc Cốc browser where users access directly to websites.
published_hour: the number of hours from the published time of the article on a website to the current time. This guarantees that recent articles have more priority than the old ones.
G: constant gravity coefficient (G=1.7)
Lamda: control the compensations of high impressions on NewsFeed.

The ranking score decreases while published hours increase

User features are determined based on the user’s historical reading news. To capture the users' preference in a long-term and short-time period, we define user long-term interest as user interest during two weeks while user short-term interest as user interest within 6 hours. This ensures that the Recommender System is not only able to learn the actual interest of users but also serve their instant interest. For example, the below figure illustrates the browser log of a user, we can determine that for the long term, the user keens on category 15000 most, followed by 16000, 19000, and 20000.

The behavior features reflect the interaction between a user and articles in which a user can click, view, like, or block an article. Seriously, in our Newsfeed, readers can control their feed by following several interesting categories or by blocking a domain to prevent articles from that source shown on their feed. All events are tracked and pushed to Kafka which allows multiple groups to consume its messages.

♦ Training Offline

In this section, we concentrate on the matching algorithms in a recommender system. Every day, more than 10K articles are crawled frequently, so the accumulated number of articles to recommend to a user can reach 50K for several days. Hence, to serve 2.5 million Cốc Cốc users, the matching module aims to select a small proportion of news that the user may like from a huge number of articles. This module is a combination of several algorithms such as Content-based filtering, Collaborative filtering (User-based & Item-based), or GraphSage. We will have another page to discuss those algorithms in detail.

After that, the ranking module is applied to rank the recommended candidate set from the favorite to the least favorite based on the user’s preferences. The ranking algorithm can be seen as a binary classification task with two labels, e.g. 0 means no-click & 1 means click, and the recommended articles are sorted by the predicted probability. However, in the first version of the Recommendation System, we will use the ranking score described in the previous section to rank candidates before caching them on our Redis Cluster.

♦ Serving online

When a user opens a new tab and scrolls down, the frontend API will fetch a recommended list of articles from Redis Cluster and display them on the newsfeed. Each page has several components such as short-term recommendation, long-term recommendation, or popular recommendation with predefined proportion. This enables users to follow their favorite news and discover other interesting ones as well.

Deployment & Monitoring

All services and APIs are dockerized and deployed to our K8S cluster. The namespace, image, resources, environment configuration, and command are declared clearly in advance. For example, here is the config map we declare variable environments:

apiVersion: v1
kind: ConfigMap
metadata:
name: Cốc Cốc-nre
data:
CLICKHOUSE_HOST: "localhost"
CLICKHOUSE_PORT: "8443"
CLICKHOUSE_USER: "newsfeed"
...

The monitoring dashboard observes various aspects of the Recommender System such as DAU, CTR, number of impressions/clicks, memory used & keyspace of Redis Cluster, …

Things to improve

At the moment, the Newsfeed Recommender System runs smoothly and effectively to provide a trustworthy product for Cốc Cốc users. However, we still cope with some issues related to Redis cluster performance where each time updating recommended list for millions of users, the CPU load is pretty high. Furthermore, to get better recommendations, the simple ranking algorithm should be replaced by some advanced methods like tree-based models, Graph Neural Networks with much more resources, and GPU.

References

Newsfeed Recommendation Engine - Building up its Tech stack from Scratch

Phong Do