We’re the Advertising Management team at Cốc Cốc, and we are responsible for developing some server parts of the Demand Side Platform in the Advertising System of Cốc Cốc. We build the platform to help our advertisers manage and optimize their ads.
The problem
In our ads platform, we have 2 primary groups of data that need to be shown in UI:
The first one is data input by the user: Campaign, ads, settings, targeting options like keywords, interests, demographics, topics. This data need the relationship, transactional to constrain data, so we use MariaDB Galera Cluster.
The second one is statistics for campaigns, ads, settings, targeting options… The statistics are a number of events like clicks, shows, view, hit...
Currently, we have about 200-250 million new events daily, so the total of events is huge, years after years. Now, we have about 345 billion events (from 2013), and the number of daily events increased over the years.
As people expected from our UI, they want to be reported in real-time, with data that is sortable and filterable, and for the non-fixed period in just seconds.
One of our mission is to design a system that supports building a flexible report for UI, so that users can select any fields of data that they want, with metrics. Also, those metrics can be broken down by any dimension. For example:
Providing reports for the campaign, advert, and matching broken down metrics by quarter.
Providing reports by the campaign type from day X → Y, breaking down metrics by day of week and location.
Providing reports by location from day X → Y, breaking down metrics by campaign type, and time.
Providing a report for the whole system at the account level, breaking down metrics by accounts.
Type of reports is defined by the user, we can predefine some reports, and then the user can custom it as they want, and then save it.
The solution
Several years ago, our team use Mongo DB to store statistics, the statistics are split by day as collections like:
1 statistics_20190101
2 statistics_20190102
3 statistics_20190103
At that time, we have a problem with showing reports in UI: the data comes from 2 separated databases so it is impossible to fetch data, which to be sorted and filtered from the 2 sources. For example, if we want to provide reports like:
list of campaigns with name contains keyword “laptop”, which have ctr < 0,01% and show < 1000 from 2021-01-01 → 2021-01-30, sort result by spending.
In this case, database MySQL can help to filter out the list of campaign names containing the keyword “laptop”, but the statistics are not in MySQL so we can not sort and filter out in MySQL. If the result from MySQL is 1000 campaigns, then we’ll have to fetch statistics for 1000 campaigns from mongo, surf over 30 collections (for 30 days) to aggregate the data, and then do sort, paginate records by code. This is really complicated, it’s low in performance. We can not provide flexible reports for clients with basic features like selecting, sorting, filtering, paginating data from UI at the time.
♦ So what did we do?
To build a report which can sort, filter, paginate data by database, all of them should be in the same source. It is impossible to query from 2 data sources with flexible queries.
ClickHouse® is an open-source, high-performance columnar OLAP database management system for real-time analytics using SQL.
ClickHouse uses columnar storage, which is hundreds of times faster than MySQL. Even though MySQL’s index has queries, it still has to iterate through billions of records to apply aggregate functions. Columnar databases are significantly faster for grouping and aggregating a large amount of data.
ClickHouse also supports dictionaries with many different sources, which is the most important thing that we need for a flexible report builder.
ClickHouse supports aggregating data automatically from a big table to a smaller table.
ClickHouse supports many types of table engines. It helps to query through many different databases using one query. For example, we can easily make a query to join a MySQL table (by using a remote table in CH) with a ClickHouse table with just one query. And that’s pretty simple for the coding team.
Back to the requirement. We want to build a report for this:
list of campaigns with name contain keyword “laptop” , which have ctr < 0,01% and show < 1000 from 2021-01-01 → 2021-01-30, sort result by spending.
Let's see how we design the event table. The event table will contain all the raw events we have in the system. As you see in the aggregated table, we aggregated statistics at the campaign level only, with the id setting available. For example, we allow users to set location settings and set schedules at the campaign level, however, on lower levels, they were not allowed to do that. That’s why we will only aggregate these data at the campaign level.
Values of column type could be show, click, view_25_percent, hit…
total_event is the total number of events, grouped by all other fields: date, hour, client_id, campaign_id…. and some other ids.
Basically, we will see the number of data records in the aggregated table be reduced, and the number of columns is also reduced => So we can use this table to provide reports at the campaign level. For other levels, we do a similar thing.
♦ How can we query the report?
Here, we fetch the campaigns of user 123 with ctr < 0,01 % and show < 1000 from 2021-11-03 till 2021-12-02
8 if(sum(i_clicks) = 0, null, sum(i_spending) / sum(i_clicks)) as `cpc`,
9 FROM (
10 SELECT campaign_id,
11if(type = 'show', sumMerge(total_events), 0) as i_shows,
12if(type = 'click', sumMerge(total_events), 0) as i_clicks,
13 sumMerge(spending) as i_spending
14FROM aggregated_campaign_statistics
15WHEREdate <= '2021-12-02'
16 AND date >= '2021-11-03'
17 AND client_id = 123
18 AND (type = 'show' OR type = 'click')
19 GROUP BY campaign_id, type, user_id)
20GROUP BY campaign_id, campaign_name
21HAVING ctr < 0.01 and shows < 1000 and lowerUTF8(`campaign_name`) LIKE '%laptop%'
22ORDER BY campaign_id ASC
23LIMIT0,20;
We use the dictionary to fetch related data from MySQL, which means data in MySQL DB that can be used for the dictionary is limited.
In our team, we do not store all information from MySQL in the ClickHouse dictionary. Instead, we store active data only, depending on the level of data.
One of our challenges is to write a report builder with flexible code. We will have more metrics, more data dimensions over time, meaning our code should be flexible to add more fields without refactoring.
Let's try to reduce the full query above to this template:
1 SELECT %(clickHouseOuterFields)s
2 FROM(
3 SELECT %(clickHouseInnerFields)s
4FROM %(clickHouseInnerTable)s
5 WHERE %(clickHouseInnerWhere)s
6 %(clickHouseInnerGroupBy)s
7 )
8 %(clickHouseOuterGroupBy)s
9 %(clickHouseOuterHaving)s
10 %(sqlOrder)s
11 %(sqlLimit)s
And now, we can see that all we need to code is to fill params into the template. One field in the request could appear in some positions of the template, depending on the context:
You just want to fetch that field out
You want to filter by that field
- Some fields need to be put in the WHERE clause
- Some fields need to put in the HAVING clause
You want to sort by that field
If the field is a dimension, you need to put it in GROUP BY as well, so we decided to use config for fields like this:
Basically, we will need to write an SQLBuilder, that will combine the config fields and template to be an SQL query, and then which executes it. This is how it works:
So, you got it? right? That is how we build a report builder at for QC. Performance API request for that is about 300-400ms, so for us, it is acceptable to render in UI.
Other known issues
The solution only works with metrics that can be aggregated:
- clicks
- shows
- spending
- CTR
- CPC
- CPM
- views
- hits
- avg position of ads
- mute
- unmute
- VTR
It doesn't work with metrics that can not be aggregated like
- reach: the number of users who saw the ads
* This number can be added to the report but can not be supported to sort or filter.
* If we want to support sorting and filtering, it must be pre-calculated and stored somewhere else, and we are still looking for a solution.
Finally, I hope that our solution can be useful to you. Maybe it can help you solve many of your related problems.