Chapter 5. BPF Utilities

So far, we’ve talked about how you can write BPF programs to get more visibility within your systems. Over the years, many developers have built tools with BPF for that same purpose. In this chapter we talk about several of the off-the-shelf tools that you can use every day. Many of these tools are advanced versions of some BPF programs that you’ve already seen. Others are tools that will help you gain direct visibility into your own BPF programs.

This chapter covers some tools that will help you in your day-to-day work with BPF. We begin by covering BPFTool, a command-line utility to get more information about your BPF programs. We cover BPFTrace and kubectl-trace, which will help you write BPF programs more efficiently with a concise domain-specific language (DSL). Finally, we talk about eBPF Exporter, an open source project to integrate BPF with Prometheus.

BPFTool

BPFTool is a kernel utility for inspecting of BPF programs and maps. This tool doesn’t come installed by default on any Linux distribution, and it’s in heavy development, so you’ll want to compile the version that best supports your Linux kernel. We cover the version of BPFTool distributed with version 5.1 of the Linux kernel.

In the next sections we discuss how to install BPFTool onto your system and how to use it to observe and change the behavior of your BPF programs and maps from the terminal.

Installation

To install BPFTool, you need to download a copy of the kernel’s source code. There might be some packages for your specific Linux distribution online, but we’re going to cover how to install it from the source because it’s not too complicated.

  1. Use Git to clone the repository from GitHub with git clone https://github.com/torvalds/linux.

  2. Check out the specific kernel version tag with git checkout v5.1.

  3. Within the kernel’s source, navigate to the directory where BPFTool’s source is stored with cd tools/bpf/bpftool.

  4. Compile and install this tool with make && sudo make install.

You can check that BPFTool is correctly installed by checking its version:

# bpftool --version
bpftool v5.1.0

Feature Display

One of the basic operations that you can perform with BPFTool is scanning your system to know which BPF features you have access to. This is great when you don’t remember which version of the kernel introduced which kind of programs or whether the BPF JIT compiler is enabled. To find out the answer to those questions, and many others, run this command:

# bpftool feature

You’ll get some long output with details about all the supported BPF features in your systems. For brevity, we show you a cropped version of that output here:

Scanning system configuration...
bpf() syscall for unprivileged users is enabled
JIT compiler is enabled
...
Scanning eBPF program types...
eBPF program_type socket_filter is available
eBPF program_type kprobe is NOT available
...
Scanning eBPF map types...
eBPF map_type hash is available
eBPF map_type array is available

In this output you can see that our system allows unprivileged users to execute the syscall bpf, this call is restricted to certain operations. You can also see that the JIT is enabled. Newer versions of the kernel enable this JIT by default, and it helps greatly in compiling BPF programs. If your system doesn’t have it enabled, you can run this command to enable it:

# echo 1 > /proc/sys/net/core/bpf_jit_enable

The feature output also shows you which program types and map types are enabled in your system. This command exposes much more information than what we’re showing you here, like BPF helpers supported by program type and many other configuration directives. Feel free to dive into them while exploring your system.

Knowing what features you have at your disposal can be useful, especially if you need to dive into an unknown system. With that, we’re ready to move on to other interesting BPFTool features, like inspecting loaded programs.

Inspecting BPF Programs

BPFTool gives you direct information about BPF programs on the kernel. It allows you to investigate what’s already running in your system. It also allows you to load and pin new BPF programs that have been previously compiled from your command line.

The best starting point to learn how to use BPFTool to work with programs is by inspecting what you have running in your system. To do that, you can run the command bpftool prog show. If you’re using Systemd as your init system, you probably already have a few BPF programs loaded and attached to some cgroups; we talk about these a little later. The output of running that command will look like this:

52: cgroup_skb  tag 7be49e3934a125ba
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 52,53
53: cgroup_skb  tag 2a142ef67aaad174
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 52,53
54: cgroup_skb  tag 7be49e3934a125ba
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 54,55

The numbers on the left side, before the colon, are the program identifiers; we use them later to investigate what these programs are all about. From this output you can also learn which kinds of programs your system is running. In this case, the system is running three BPF programs attached to cgroup socket buffers. The loading time will likely match when you booted your system if those programs were actually started by Systemd. You can also see how much memory those programs are currently using and the identifiers for the maps associated with them. All of this is useful at first glance, and because we have the program identifiers, we can dive a little bit deeper.

You can add the program identifier to the previous command as an extra argument: bpftool prog show id 52. With that, BPFTool will show you the same information you saw before, but only for the program identified by the ID 52; that way, you can filter out information that you don’t need. This command also supports a --json flag to generate some JSON output. This JSON output is very convenient if you want to manipulate the output. For example, tools like jq will give you a more structured formatting for this data:

# bpftool prog show --json id 52 | jq
{
  "id": 52,
  "type": "cgroup_skb",
  "tag": "7be49e3934a125ba",
  "gpl_compatible": false,
  "loaded_at": 1553816764,
  "uid": 0,
  "bytes_xlated": 296,
  "jited": true,
  "bytes_jited": 229,
  "bytes_memlock": 4096,
  "map_ids": [
    52,
    53
  ]
}

You can also perform more advanced manipulations and filter only the information that you’re interested in. In the next example, we’re interested only in knowing the BPF program identifier, which type of program it is, and when it was loaded in the kernel:

# bpftool prog show --json id 52 | jq -c '[.id, .type, .loaded_at]'
[52,"cgroup_skb",1553816764]

When you know a program identifier, you can also get a dump of the whole program using BPFTool; this can be handy when you need to debug the BPF bytecode generated by a compiler:

# bpftool prog dump xlated id 52
   0: (bf) r6 = r1
   1: (69) r7 = *(u16 *)(r6 +192)
   2: (b4) w8 = 0
   3: (55) if r7 != 0x8 goto pc+14
   4: (bf) r1 = r6
   5: (b4) w2 = 16
   6: (bf) r3 = r10
   7: (07) r3 += -4
   8: (b4) w4 = 4
   9: (85) call bpf_skb_load_bytes#7151872
   ...

This program loaded in our kernel by Systemd is inspecting packet data by using the helper bpf_skb_load_bytes.

If you want a more visual representation of this program, including instruction jumps, you can use the visual keyword in this command. That will generate the output in a format that you can convert to a graph representation with tools like dotty, or any other program that can draw graphs:

# bpftool prog dump xlated id 52 visual &> output.out
# dot -Tpng output.out -o visual-graph.png

You can see the visual representation for a small Hello World program in Figure 5-1.

Visual representation of a BPF program
Figure 5-1. Visual representation of a BPF program

If you’re running version 5.1 or newer of the kernel, you’ll also have access to runtime statistics. They tell you how long the kernel is spending on your BPF programs. This feature might not be enabled in your system by default; you’ll need to run this command first to let the kernel know that it needs to show you that data:

# sysctl -w kernel.bpf_stats_enabled=1

When the stats are enabled, you’ll get two more pieces of information when you run BPFTool: the total amount of time that the kernel has spent running that program (run_time_ns), and how many times it has run it (run_cnt):

52: cgroup_skb  tag 7be49e3934a125ba  run_time_ns 14397 run_cnt 39
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 52,53

But BPFTool doesn’t only allow you to inspect how your programs are doing; it also lets you load new programs into the kernel and attach some of them to sockets and cgroups. For example, we can load one of our previous programs and pin it to the BPF file system, with this command:

# bpftool prog load bpf_prog.o /sys/fs/bpf/bpf_prog

Because the program is pinned to the filesystem, it won’t terminate after running, and we can see that it’s still loaded with the previous show command:

# bpftool prog show
52: cgroup_skb  tag 7be49e3934a125ba
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 52,53
53: cgroup_skb  tag 2a142ef67aaad174
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 52,53
54: cgroup_skb  tag 7be49e3934a125ba
        loaded_at 2019-03-28T16:46:04-0700  uid 0
        xlated 296B  jited 229B  memlock 4096B  map_ids 54,55
60: perf_event  name bpf_prog  tag c6e8e35bea53af79
        loaded_at 2019-03-28T20:46:32-0700  uid 0
        xlated 112B  jited 115B  memlock 4096B

As you can see, BPFTool gives you a lot of information about the programs loaded in your kernel without having to write and compile any code. Let’s see how to work with BPF maps next.

Inspecting BPF Maps

Besides giving you access to inspect and manipulate BPF programs, BPFTool can give you access to the BPF maps that those programs are using. The command to list all maps and filter maps by their identifiers is similar to the show command that you saw previously. Instead of asking BPFTool to display information for prog, let’s ask it to show us information for map:

# bpftool map show
52: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 1  memlock 4096B
53: lpm_trie  flags 0x1
        key 20B  value 8B  max_entries 1  memlock 4096B
54: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 1  memlock 4096B
55: lpm_trie  flags 0x1
        key 20B  value 8B  max_entries 1  memlock 4096B

Those maps match the identifiers that you saw earlier attached to your programs. You can also filter maps by their ID, in the same way we filtered programs by their ID earlier.

You can use BPFTool to create and update maps and to list all the elements within a map. Creating a new map requires the same information that you provide when you initialize a map along with one of your programs. We need to specify which type of map we want to create, the size of the keys and values, and its name. Because we’re not initializing the map along with a program, we also need to pin it to the BPF filesystem so that we can use it later:

# bpftool map create /sys/fs/bpf/counter
    type array key 4 value 4 entries 5 name counter

If you list the maps in the system after running that command, you’ll see the new map at the bottom of the list:

52: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 1  memlock 4096B
53: lpm_trie  flags 0x1
        key 20B  value 8B  max_entries 1  memlock 4096B
54: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 1  memlock 4096B
55: lpm_trie  flags 0x1
        key 20B  value 8B  max_entries 1  memlock 4096B
56: lpm_trie  flags 0x1
        key 8B  value 8B  max_entries 1  memlock 4096B
57: lpm_trie  flags 0x1
        key 20B  value 8B  max_entries 1  memlock 4096B
58: array  name counter  flags 0x0
        key 4B  value 4B  max_entries 5  memlock 4096B

After you’ve created the map, you can update and delete elements like we’d do inside a BPF program.

Tip

Remember that you cannot remove elements from fixed-size arrays; you can only update them. But you can totally delete elements from other maps, like hash maps.

If you want to add a new element to the map or update an existing one, you can use the map update command. You can grab the map identifier from the previous example:

# bpftool map update id 58 key 1 0 0 0 value 1 0 0 0

If you try to update an element with an invalid key or value, BPFTool will return an error:

# bpftool map update id 58 key 1 0 0 0 value 1 0 0
Error: value expected 4 bytes got 3

BPFTool can give you a dump of all the elements in a map if you need to inspect its values. You can see how BPF initializes all of the elements to a null value when you create fixed-size array maps:

# bpftool map dump id 58
key: 00 00 00 00  value: 00 00 00 00
key: 01 00 00 00  value: 01 00 00 00
key: 02 00 00 00  value: 00 00 00 00
key: 03 00 00 00  value: 00 00 00 00
key: 04 00 00 00  value: 00 00 00 00

One of the most powerful options that BPFTool gives you is that you can attach precreated maps to new programs and replace the maps that they would initialize with those preallocated maps. That way, you can give programs access to saved data from the beginning, even if you didn’t write the program to read a map from the BPF file system. To do that, you need to set the map you want to initialize when you load the program with BPFTool. You can specify the map by the ordered identifier that it would have when the program loads it, for example 0 for the first map, 1 for the second one, and so on. You can also specify the map by its name, which is usually more convenient:

# bpftool prog load bpf_prog.o /sys/fs/bpf/bpf_prog_2 
    map name counter /sys/fs/bpf/counter

In this example we attach the map that we just created to a new program. In this case, we replace the map by its name, because we know that the program initializes a map called counter. You can also use the map’s index position with the keyword idx, as in idx 0, if that’s easier to remember for you.

Accessing BPF maps directly from the command line is useful when you need to debug message passing in real time. BPFTool gives you direct access in a convenient way. Besides introspecting programs and maps, you can use BPFTool to extract much more information from the kernel. Let’s see how to access specific interfaces next.

Inspecting Programs Attached to Specific Interfaces

Sometimes you’ll find yourself wondering which programs are attached to specific interfaces. BPF can load programs that work on top of cgroups, Perf events, and network packets. The subcommands cgroup, perf, and net can help you trace back attachments on those interfaces.

The perf subcommand lists all programs attached to tracing points in the system, like kprobes, uprobes, and tracepoints; you can see that listing by running bpftool perf show.

The net subcommand lists programs attached to XDP and Traffic Control. Other attachments, like socket filters and reuseport programs, are accessible only by using iproute2. You can list the attachments to XDP and TC with bpftool net show, like you’ve seen with other BPF objects.

Finally, the cgroup subcommand lists all programs attached to cgroups. This subcommand is a little bit different than the other ones you’ve seen. bpftool cgroup show requires the path to the cgroup you want to inspect. If you want to list all the attachments in all cgroups in the system, you’ll need to use bpftool cgroup tree, as shown in this example:

# bpftool cgroup tree
CgroupPath
ID       AttachType      AttachFlags     Name
/sys/fs/cgroup/unified/system.slice/systemd-udevd.service
    5        ingress
    4        egress
/sys/fs/cgroup/unified/system.slice/systemd-journald.service
    3        ingress
    2        egress
/sys/fs/cgroup/unified/system.slice/systemd-logind.service
    7        ingress
    6        egress

Thanks to BPFTool, you can verify that your programs are attached correctly to any interface in the kernel, giving you quick visibility access to cgroups, Perf, and the network interface.

So far, we’ve talked about how you can enter different commands in your terminal to debug how your BPF programs behave. However, remembering all these commands can be cumbersome when you need them the most. Next we describe how to load several commands from plain-text files so that you can build a set of scripts that you can keep handy without having to retain each option that we’ve talked about.

Loading Commands in Batch Mode

It’s common to run several commands over and over while you’re trying to analyze the behavior of one or multiple systems. You might end up with a collection of commands that you use frequently as part of your toolchain. BPFTool’s batch mode is for you if you don’t want to type those commands every single time.

With batch mode, you can write all of the commands that you want to execute in a file and run all of them at once. You can also write comments in this file by starting a line with #. However, this execution mode is not atomic. BPFTool executes commands line by line, and it will abort the execution if one of the commands fails, leaving the system in the state it was in after running the latest successful command.

This is a short example of a file that batch mode can process:

# Create a new hash map
map create /sys/fs/bpf/hash_map type hash key 4 value 4 entries 5 name hash_map
# Now show all the maps in the system
map show

If you save those commands in a file called /tmp/batch_example.txt, you’ll be able to load it with bpftool batch file /tmp/batch_example.txt. You’ll get output similar to the following snippet when you run this command for the first time, but if you try to run it again, the command will exit with no output because we already have a map with the name hash_map in the system, and the batch execution will fail in the first line:

# bpftool batch file /tmp/batch_example.txt
2: lpm_trie  flags 0x1
	key 8B  value 8B  max_entries 1  memlock 4096B
3: lpm_trie  flags 0x1
	key 20B  value 8B  max_entries 1  memlock 4096B
18: hash  name hash_map  flags 0x0
	key 4B  value 4B  max_entries 5  memlock 4096B
processed 2 commands

Batch mode is one of our favorite options in BPFTool. We recommend keeping these batch files in a version control system so that you can share them with your team to create your own set of utility tools. Before jumping to our next favorite utility, let’s see how BPFTool can help you understand the BPF Type Format better.

Displaying BTF Information

BPFTool can display BPF Type Format (BTF) information for any given binary object when it is present. As we mentioned in Chapter 2, BTF annotates program structures with metadata information to help you debug programs.

For example, it can give you the source file and line numbers for each instruction in a BPF program when you add the keyword linum to prog dump.

More recent versions of BPFTool include a new btf subcommand to help you dive into your programs. The initial focus of this command is to visualize structure types. For example, bpftool btf dump id 54 shows all of the BFT types for the program loaded with an ID of 54.

These are some of the things you can use BPFTool for. It’s a low-friction entry point to any system, especially if you don’t work on that system on a day-to-day basis.

BPFTrace

BPFTrace is a high-level tracing language for BPF. It allows you to write BPF programs with a concise DSL, and save them as scripts that you can execute without having to compile and load them in the kernel manually. The language is inspired by other well-known tools, like awk and DTrace. If you’re familiar with DTrace and you’ve always missed being able to use it on Linux, you’re going to find in BPFTrace a great substitute.

One of the advantages of using BPFTrace over writing programs directly with BCC or other BPF tooling is that BPFTrace provides a lot of built-in functionality that you don’t need to implement yourself, such as aggregating information and creating histograms. On the other hand, the language that BPFTrace uses is much more limited, and it will get in your way if you try to implement advanced programs. In this section, we show you the most important aspects of the language. We recommend going to the BPFTrace repository on GitHub to learn about it.

Installation

You can install BPFTrace in several ways, although its developers recommend you use one of the prebuilt packages for your specific Linux distribution. They also maintain a document with all the installation options and prerequisites for your system in their repository. There, you’ll find instructions in the installation document.

Language Reference

The programs that BPFTrace executes have a terse syntax. We can divide them into three sections: header, action blocks, and footer. The header is a special block that BPFTrace executes when it loads the program; it’s commonly used to print some information at the top of the output, like a preamble. In the same way, the footer is a special block that BPFTrace executes once before terminating the program. Both the header and footer are optional sections in a BPFTrace program. A BPFTrace program must have at least one action block. Action blocks are where we specify the probes that we want to trace and the actions we perform when the kernel fires the events for those probes. The next snippet shows you these three sections in a basic example:

BEGIN
{
  printf("starting BPFTrace program
")
}

kprobe:do_sys_open
{
  printf("opening file descriptor: %s
", str(arg1))
}

END
{
  printf("exiting BPFTrace program
")
}

The header section is always marked with the keyword BEGIN, and the footer section is always marked with the keyword END. These keywords are reserved by BPFTrace. Action block identifiers define the probe to which you want to attach the BPF action. In the previous example, we printed a log line every time the kernel opens a file.

Besides identifying the program sections, we can already see a few more details about the language syntax in the previous examples. BPFTrace provides some helpers that are translated to BPF code when the program is compiled. The helper printf is a wrapper around the C function printf, which prints program details when you need it. str is a built-in helper that translates a C pointer to its string representation. Many kernel functions receive pointers to characters as arguments; this helper translates those pointers to strings for you.

BPFTrace could be considered a dynamic language in the sense that it doesn’t know the number of arguments a probe might receive when it’s executed by the kernel. This is why BPFTrace provides argument helpers to access the information that the kernel processes. BPFTrace generates these helpers dynamically depending on the number of arguments the block receives, and you can access the information by its position in the list of arguments. In the previous example, arg1 is the reference to the second argument in the open syscall, which makes reference to the file path.

To execute this example, you can save it in a file and run BPFTrace with the file path as the first argument:

# bpftrace /tmp/example.bt

BPFTrace’s language is designed with scripting in mind. In the previous examples, you’ve seen the terse version of the language, so you can get familiar with it. But many of the programs that you can write with BPFTrace fit on one single line. You don’t need to store those one-line programs in files to execute them; you can run them with the option -e when you execute BPFTrace. For example, the previous counter example can be a one-liner by collapsing the action block into a single line:

# bpftrace -e "kprobe:do_sys_open { @opens[str(arg1)] = count() }"

Now that you know a little bit more about BPFTrace’s language, let’s see how to use it in several scenarios.

Filtering

When you run the previous example, you probably get a stream of files that your system is constantly opening, until you press Ctrl-C to exit the program. That’s because we’re telling BPF to print every file descriptor that the kernel opens. There are situations when you want to execute the action block only for specific conditions. BPFTrace calls that filtering.

You can associate one filter to each action block. They are evaluated like action blocks, but the action does not execute if the filter returns a false value. They also have access to the rest of the language, including probe arguments and helpers. These filters are encapsulated within two slashes after the action header:

kprobe:do_sys_open /str(arg1) == "/tmp/example.bt"/
{
  printf("opening file descriptor: %s
", str(arg1))
}

In this example, we refine our action block to be executed only when the file the kernel is opening is the file that we’re using to store this example. If you run the program with the new filter, you’ll see that it prints the header, but it stops printing there. This is because every file that was triggering our action before is being skipped now thanks to our new filter. If you open the example file several times in a different terminal, you’ll see how the kernel executes the action when the filter matches our file path:

# bpftrace /tmp/example.bt
Attaching 3 probes...
starting BPFTrace program
opening file descriptor: /tmp/example.bt
opening file descriptor: /tmp/example.bt
opening file descriptor: /tmp/example.bt
^Cexiting BPFTrace program

BPFTrace’s filtering capabilities are super helpful to hide information that you don’t need, keeping data scoped to what you really care about. Next we talk about how BPFTrace makes working with maps seamless.

Dynamic Mapping

One handy feature that BPFTrace implements is dynamic map associations. It can generate BPF maps dynamically that you can use for many of the operations you’ve seen throughout the book. All map associations start with the character @, followed by the name of the map that you want to create. You can also associate update elements in those maps by assigning them values.

If we take the example that we started this section with, we could aggregate how often our system opens specific files. To do that, we need to count how many times the kernel runs the open syscall on a specific file, and then store those counters in a map. To identify those aggregations, we can use the file path as the map’s key. This is how our action block would look in this case:

kprobe:do_sys_open
{
  @opens[str(arg1)] = count()
}

If you run your program again, you’ll get output similar to this:

# bpftrace /tmp/example.bt
Attaching 3 probes...
starting BPFTrace program
^Cexiting BPFTrace program

@opens[/var/lib/snapd/lib/gl/haswell/libdl.so.2]: 1
@opens[/var/lib/snapd/lib/gl32/x86_64/libdl.so.2]: 1
...
@opens[/usr/lib/locale/en.utf8/LC_TIME]: 10
@opens[/usr/lib/locale/en_US/LC_TIME]: 10
@opens[/usr/share/locale/locale.alias]: 12
@opens[/proc/8483/cmdline]: 12

As you can see, BPFTrace prints the contents of the map when it stops the program execution. And as we expected, it’s aggregating how often the kernel is opening the files in our system. By default, BPFTrace is always going to print the contents of every map it creates when it terminates. You don’t need to specify that you want to print a map; it always assumes that you want to. You can change that behavior by clearing the map inside the END block by using the built-in function clear. This works because printing maps always happens after the footer block is executed.

BPFTrace dynamic mapping is super convenient. It removes a lot of boilerplate required to work with maps and focuses on helping you to collect data easily.

BPFTrace is a powerful tool for your day-to-day tasks. Its scripting language gives you enough flexibility to access every aspect of your system without the ceremony of having to compile and load your BPF program into the kernel manually, and this can help you trace and debug problems in your system from the get-go. Check out the reference guide in its GitHub repository to learn how to take advantage of all of its built-in capabilities, such as automatic histograms and stack trace aggregations.

In the next section we explore how to use BPFTrace inside Kubernetes.

kubectl-trace

kubectl-trace is a fantastic plug-in for the Kubernetes command line, kubectl. It helps you schedule BPFTrace programs in your Kubernetes cluster without having to install any additional packages or modules. It does this by scheduling a Kubernetes job with a container image that has everything you need to run the program installed already. This image is called trace-runner, and it’s also available in the public Docker registry.

Installation

You need to install kubectl-trace from its source repository using Go’s toolchain because its developers don’t provide any binary package:

go get -u github.com/iovisor/kubectl-trace/cmd/kubectl-trace

kubectl’s plug-in system will automatically detect this new add-on after Go’s toolchain compiles the program and puts it in the path. kubectl-trace automatically downloads the Docker images that it needs to run in your cluster the first time that you execute it.

Inspecting Kubernetes Nodes

You can use kubectl-trace to target nodes and pods where containers run, and you can also use it to target processes running on those containers. In the first case, you can run pretty much any BPF program that you’d like. However, in the second case, you’re restricted to running only the programs that attach user-space probes to those processes.

If you want to run a BPF program on a specific node, you need a proper identifier so that Kubernetes schedules the job in the appropriate place. After you have that identifier, running the program is similar to running the programs you saw earlier. This is how we would run our one-liner to count file openings:

# kubectl trace run node/node_identifier -e 
  "kprobe:do_sys_open { @opens[str(arg1)] = count() }"

As you can see, the program is exactly the same, but we’re using the command kubectl trace run to schedule it in a specific cluster node. We use the syntax node/... to tell kubectl-trace that we’re targetting a node in the cluster. If we want to target a specific pod, we’d replace node/ with pod/.

Running a program on a specific container requires longer syntax; let’s see an example first and go through it:

# kubectl trace run pod/pod_identifier -n application_name -e <<PROGRAM
uretprobe:/proc/$container_pid/exe:"main.main" {
  printf("exit: %d
", retval)
}
PROGRAM

There are two interesting things to highlight in this command. The first is that we need the name of the application running in the container to be able to find its process; this corresponds with the application_name in our example. You’ll want to use the name of the binary that’s executed in the container, for example nginx or memcached. Usually, containers run only one process, but this gives us extra guarantees that we’re attaching our program to the correct process. The second aspect to highlight is the inclusion of $container_pid in our BPF program. This is not a BPFTrace helper, but a placeholder that kubectl-trace uses as a replacement for the process identifier. Before running the BPF program, the trace-runner substitutes the placeholder with the appropriate identifier, and it attaches our program to the correct process.

If you run Kubernetes in production, kubectl-trace will make your life much easier when you need to analyze your containers’ behavior.

In this and the previous sections, we’ve focused on tools to help you run BPF programs more efficiently, even within container environments. In the next section we talk about a nice tool to integrate data gathering from BPF programs with Prometheus, a well-known open source monitoring system.

eBPF Exporter

eBPF Exporter is a tool that allows you to export custom BPF tracing metrics to Prometheus. Prometheus is a highly scalable monitoring and alerting system. One key factor that makes Prometheus different from other monitoring systems is that it uses a pull strategy to fetch metrics, instead of expecting the client to push metrics to it. This allows users to write custom exporters that can gather metrics from any system, and Prometheus will pull them using a well-defined API schema. eBPF Exporter implements this API to fetch tracing metrics from BPF programs and import them into Prometheus.

Installation

Although eBPF Exporter offers binary packages, we recommend installing it from source because there are often no new releases. Building from source also gives you access to newer functionality built on top of modern versions of BCC, the BPF Compiler Collection.

To install eBPF Exporter from the source, you need to have BCC and Go’s toolchain already installed on your computer. With those prerequisites, you can use Go to download and build the binary for you:

go get -u github.com/cloudflare/ebpf_exporter/...

Exporting Metrics from BPF

eBPF Exporter is configured using YAML files, in which you can specify the metrics that you want to collect from the system, the BPF program that generates those metrics, and how they translate to Prometheus. When Prometheus sends a request to eBPF Exporter to pull metrics, this tool translates the information that the BPF programs are collecting to metric values. Fortunately, eBPF Exporter bundles many programs that collect very useful information from your system, like instructions per cycle (IPC) and CPU cache hit rates.

A simple configuration file for eBPF Exporter includes three main sections. In the first section, you define the metrics that you want Prometheus to pull from the system. Here is where you translate the data collected in BPF maps to metrics that Prometheus understands. Following is an example of these translations from the project’s examples:

programs:
  - name: timers
    metrics:
      counters:
        - name: timer_start_total
          help: Timers fired in the kernel
          table: counts
          labels:
            - name: function
              size: 8
              decoders:
                - name: ksym

We’re defining a metric called timer_start_total, which aggregates how often the kernel starts a timer. We also specify that we want to collect this information from a BPF map called counts. Finally, we define a translation function for the map keys. This is necessary because map keys are usually pointers to the information, and we want to send Prometheus the actual function names.

The second section in this example describes the probes we want to attach our BPF program to. In this case, we want to trace the timer start calls; we use the tracepoint timer:timer_start for that:

    tracepoints:
      timer:timer_start: tracepoint__timer__timer_start

Here we’re telling eBPF Exporter that we want to attach the BPF function tracepoint__timer__timer_start to this specific tracepoint. Let’s see how to declare that function next:

    code: |
      BPF_HASH(counts, u64);
      // Generates function tracepoint__timer__timer_start
      TRACEPOINT_PROBE(timer, timer_start) {
          counts.increment((u64) args->function);
          return 0;
      }

The BPF program is inlined within the YAML file. This is probably one of our less favorite parts of this tool because YAML is particular about whitespacing, but it works for small programs like this one. eBPF Exporter uses BCC to compile programs, so we have access to all its macros and helpers. The previous snippet uses the macro TRACEPOINT_PROBE to generate the final function that we’ll attach to our tracepoint with the name tracepoint__timer__timer_start.

Cloudflare uses eBPF Exporter to monitor metrics across all of its datacenters. The company made sure to bundle the most common metrics that you’ll want to export from your systems. But as you can see, it’s relatively easy to extend with new metrics.

Conclusion

In this chapter we talked about some of the our favorite tools for system analysis. These tools are general enough to have them on hand when you need to debug any kind of anomaly on your system. As you can see, all these tools abstract the concepts that we saw in the previous chapters to help you use BPF even when the environment is not ready for it. This is one of the many advantages of BPF before other analysis tools; because any modern Linux kernel includes the BPF VM, you can build new tools on top that take advantage of these powerful capabilities.

There are many other tools that use BPF for similar purposes, such as Cilium and Sysdig, and we encourage you to try them.

This chapter and Chapter 4 dealt mostly with system analysis and tracing, but there is much more that you can do with BPF. In the next chapters we dive into its networking capabilities. We show you how to analyze traffic in any network and how to use BPF to control messages in your network.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset