Preface
Linux tracing is a kind of a new buzzword these days in the performance analysis area. But IMHO the word “tracing” is just a catchy common word that represents 2 methods for collecting events or records for code execution: profiling and event tracing.
In this article, I’ll explain those 2 concepts with examples using bpftrace and perf.
This is the second article in my practical Linux tracing series, see the first one here. As people complain about the long previous post, I’ll keep this short.
Profiling
Profiling is a method for collecting events by sampling or taking snapshots of a whole big object (process or event of a whole system) at a specific frequency.
If our sample rate is fast enough, we can virtually catch all the events executed on your system. Otherwise, you may miss events that were executed in the period of time between our sample actions.
Imagine our server is a mega mall and our applications are people moving back and forth there. You can think about profiling as continuously taking pictures of that mall at a fixed shutter speed.
Linux has perf_event_open syscall, which can help us to do the profiling at the system level. As Julia Evans explained here, one of the things it can do is save the state of the CPU’s registers every time we sample, which is pretty much understandable for what happens behind the scene.
With us, the easiest way to do profiling is by using Perf, for example:
perf record -a -g -F 99 -- sleep 10
|
This will continuously take snapshots (or pictures in our analogy) 99 times per second, for 10 seconds long. We can examine the data later by perf record or perf script
Other tools like bpftrace also provide profiling capability, but actually still use the same data source as perf (from perf_event_output). For example:
bpftrace -e 'profile:hz:99 {@[kstack]=count();}'
|
Profiling this way can help us to have a broad view of our system, which is very cool. But not really a full picture.
It’s simply because of the nature of sampling.
Your rate may not be able to catch up with everything (especially kernel events, which are super fast, at nanoseconds)
Even though you can increase such a rate limit here, trying to catch up
/proc/sys/kernel/perf_event_max_sample_rate
|
it still can make a big impact on your CPU performance or disk IO (in the case of using perf, as it needs to dump several Gbytes of data filesystem), which is also bad.
There’s no perfect rule for setting the rate, it all depends on your system workload.
For example, I can get this warning in while profiling 9999 times per second with dwarf call-graph method, in one of my not-very-busy hosts
perf record -a -g -F 9999 --call-graph dwarf -- sleep 10
|
To avoid overloading our system this way, we better know what exactly we need to trace, instead of profiling everything. Which is another method for tracing that we’ll discuss next.
Event Tracing
As its name self-explains, we trace only some specific event[s] that we need. Just much less overhead in comparison with profiling. And every time such an event happens, we can catch it immediately.
If we get back to the analogy with a mega mall and tracking people above, event tracing should be the action of tracking 1 specific person vs 1 shop only. Every time a person enters a shop, there will be a sensor that triggers our camera to take a shoot immediately.
Event tracing works in pretty much the same way. By using probe (kprobe / uprobe) or tracepoints (Linux kernel tracepoints or User Defined Static Tracepoints), we can get notified when a function is executed or returned. More than that, we can know much more about such execution like parameters, return value, latency, frequency …
For example, if you have a secret file called homework and want to get notified any time somebody opens it, you can track it this way:
bpftrace -e '
|
You can test by opening another terminal and cat that file, then gotcha!
[root@tungdam10-virt.dev:~]# bpftrace -e '
|
Here we focus on only 2 tracepoints sys_enter_open and sys_enter_openat, which is specific enough, I guess.
If you choose the event to trace well, you still can show quite a lot of info about the overall system.
For example, Brendan Gregg’s execsnoop can tell you which command is exec-ed on your host by tracing only 1 tracepoint. I’ll write it here as a one-liner bpftrace so you can see how easy it is
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
|
Super lightweight and useful.
Which method you should use then?
Depends on our purpose.
If you know what exactly the event you need to check, then use event tracing.
If you don’t know what to do yet, start with profiling. You can examine the big picture first (even if it’s a coarse one), pick some suspicious spots then drill down by using event tracing.
Profiling is a better source for stack trace analysis (especially when it’s visualized by FlameGraph), to find the hot code.
Event tracing is better for function drill-down analysis. It’s much easier to find a slow code this way.
Thanks for reading!
***
For many articles on tracing and stories of the everyday working of Mr. Tung Dam, Cốc Cốc Head of Infrastructure and Support Department, please continue to be notified on our latest blog!
For more of Mr. Tung Dam's sharing: