Chapter 9. Real-World Use Cases

The most important question to ask yourself when implementing a new technology is: “What are the use cases for this out there?” That’s why we decided to interview the creators of some of the most exciting BPF projects out there to share their ideas.

Sysdig eBPF God Mode

Sysdig, the company that makes the eponymous open source Linux troubleshooting tool, started playing with eBPF in 2017 under kernel 4.11.

It has been historically using a kernel module to extract and do all the kernel-side work, but as the user base increased and when more and more companies started experimenting, the company acknowledged that it is a limitation for the majority of external actors, in many ways:

  • There’s an increasing number of users who can’t load kernel modules on their machines. Cloud-native platforms are becoming more and more restrictive against what runtime programs can do.

  • New contributors (and even old) don’t understand the architecture of a kernel module. That decreases the overall number of contributors and is a limiting factor for the growth of the project itself.

  • Kernel modules maintenance is difficult, not just because of writing the code, but also because of the effort needed to keep it safe and well organized.

For those motivations, Sysdig decided to try the approach of writing the same set of features it has in the module but using an eBPF program instead. Another benefit that automatically comes from adopting eBPF is the possibility for Sysdig to further take advantage of other nice eBPF tracing features. For example, it’s relatively easy to attach eBPF programs to particular execution points in a user-space application using user probes, as described in “User-Space Probes”.

In addition, the project can now use native helper capabilities in eBPF programs to capture stack traces of running processes to augment the typical system call event stream. This gives users even more troubleshooting information.

Although it’s all bells and whistles now, Sysdig initially faced some challenges due to the limitations of the eBPF virtual machine when getting started, so the chief architect of the project, Gianluca Borello, decided to improve it by contributing upstream patches to the kernel itself, including:

  • The ability to deal with strings in eBPF programs natively

  • Multiple patches to improve the arguments semantic in eBPF programs 1, 2, and 3

The latter was particularly essential to dealing with system call arguments, probably the most important data source available in the tool.

Figure 9-1 shows the architecture of the eBPF mode in Sysdig.

A diagram showing the architecture of the eBPF mode in Sysdig.
Figure 9-1. Sysdig’s eBPF architecture

The core of the implementation is a collection of custom eBPF programs responsible for the instrumentation. These programs are written in a subset of the C programming language. They are compiled using recent versions of Clang and LLVM, which translate the high-level C code into the eBPF bytecode.

There is one eBPF program for every different execution point where Sysdig instruments the kernel. Currently, eBPF programs are attached to the following static tracepoints:

  • System call entry path

  • System call exit path

  • Process context switch

  • Process termination

  • Minor and major page faults

  • Process signal delivery

Each program takes in the execution point data (e.g., for system calls, arguments passed by the calling process) and starts processing them. The processing depends on the type of system call. For simple system calls, the arguments are just copied verbatim into an eBPF map used for temporary storage until the entire event frame is formed. For other, more complicated calls, the eBPF programs include the logic to translate or augment the arguments. This enables the Sysdig application in user-space to fully leverage the data.

Some of the additional data includes the following:

  • Data associated to a network connection (TCP/UDP IPv4/IPv6 tuple, UNIX socket names, etc.)

  • Highly granular metrics about the process (memory counters, page faults, socket queue length, etc.)

  • Container-specific data, such as the cgroups the process issuing the syscall belongs to, as well as the namespaces in which process lives

As shown in Figure 9-1, after an eBPF program captures all the needed data for a specific system call, it uses a special native BPF function to push the data to a set of per-CPU ring buffers that the user-space application can read at a very high throughput. This is where the usage of eBPF in Sysdig differs from the typical paradigm of using eBPF maps to share “small data” produced in kernel-space with user-space. To learn more about maps and how to communicate between user- and kernel-space, refer to Chapter 3.

From a performance point of view, the results are good! In Figure 9-2 you can see how the instrumentation overhead of the eBPF instrumentation of Sysdig is only marginally greater than the “classic” kernel module instrumentation.

Performance comparison diagram between the Sysdig eBPF and kernel module implementation
Figure 9-2. Sysdig eBPF performance comparison

You can play with Sysdig and its eBPF support by following the usage instructions, but also make sure to also look at the code of the BPF driver.

Flowmill

Flowmill, an observability startup, emerged from an academic research project called Flowtune by its founder, Jonathan Perry. Flowtune examined how to efficiently schedule individual packets in congested datacenter networks. One of the core pieces of technology required for this work was a means of gathering network telemetry with extremely low overhead. Flowmill ultimately adapted this technology to observe, aggregate, and analyze connections between every component in a distributed application to do the following:

  • Provide an accurate view of how services interact in a distributed system

  • Identify areas where statistically significant changes have occurred in traffic rates, errors, or latency

Flowmill uses eBPF kernel probes to trace every open socket and capture operating systems metrics on them periodically. This is complex for a number of reasons:

  • It’s necessary to instrument both new connections and existing connections already open at the time the eBPF probes are established. Additionally, it must account for both TCP and UDP as well as IPv4 and IPv6 code paths through the kernel.

  • For container-based systems, each socket must be attributed to the appropriate cgroup and joined with orchestrator metadata from a platform like Kubernetes or Docker.

  • Network address translation performed via conntrack must be instrumented to establish the mapping between sockets and their externally visible IP addresses. For example, in Docker, a common networking model uses source NAT to masquerade containers behind a host IP address and in Kubernetes, and a service virtual IP address is used to represent a set of containers.

  • Data collected by eBPF programs must be post-processed to provide aggregates by service and to match data collected on two sides of a connection.

However, adding eBPF kernel probes provides a far more efficient and robust way of gathering this data. It completely eliminates the risk of missing connections and can be done with low overhead on every socket on a subsecond interval. Flowmill’s approach relies on an agent, which combines a set of eBPF kprobes and user-space metrics collection as well as off-box aggregation and post processing. The implementation makes heavy use of Perf rings to pass metrics collected on each socket to userspace for further processing. Additionally, it uses a hash map to keep track of open TCP and UDP sockets.

Flowmill found there are generally two strategies to designing eBPF instrumentation. The “easy” approach finds the one to two kernel functions that are called on every instrumented event, but requires BPF code to maintain more state and to do more work per call, on an instrumentation point called very frequently. To alleviate concerns about instrumentation impacting production workloads, Flowmill followed the other strategy: instrument more specific functions that are called less frequently and signify an important event. This has significantly lower overhead, but requires more effort in covering all important code paths, especially across kernel versions as kernel code evolves.

For example, tcp_v4_do_rcv captures all established TCP RX traffic and has access to the struct sock, but has extremely high call volume. Instead, users can instrument functions dealing with ACKs, out-of-order packet handling, RTT estimation, and more that allow handling specific events that influence known metrics.

With this approach across TCP, UDP, processes, containers, conntrack, and other subsystems, the system achieves extremely good performance of the system with overhead low enough that is difficult to measure in most systems. CPU overhead is generally 0.1% to 0.25% per core including eBPF and user-space components and is dependent primarily on the rate at which new sockets are created.

There is more about Flowmill and Flowtune on their website.

Sysdig and Flowmill are pioneers in the use of BPF to build monitoring and observability tools, but they are not the only ones. Throughout the book, we’ve mentioned other companies like Cillium and Facebook that have adopted BPF as their framework of choice to deliver highly secure and performant networking infrastructure. We’re very excited for the future ahead of BPF and its community, and we cannot wait to see what you built with it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset