Chapter 6. Linux Networking and BPF

From a networking point of view, we use BPF programs for two main use cases: packet capturing and filtering.

This means that a user-space program can attach a filter to any socket and extract information about packets flowing through it and allow/disallow/redirect certain kinds of packets as they are seen at that level.

The goal of this chapter is to explain how BPF programs can interact with the Socket Buffer structure at different stages of the network data path in the Linux kernel network stack. We are identifying, as common use cases two types of programs:

  • Program types related to sockets

  • Programs written for the BPF-based classifier for Traffic Control

Note

The Socket Buffer structure, also called SKB or sk_buff, is the one in the kernel that is created and used for every packet sent or received. By reading the SKB you can pass or drop packets and populate BPF maps to create statistics and flow metrics about the current traffic.

In addition some BPF programs allow you to manipulate the SKB and, by extension, transform the final packets in order to redirect them or change their fundamental structure. For example, on an IPv6-only system, you might write a program that converts all the received packets from IPv4 to IPv6, which can be accomplished by mangling with the packets’ SKB.

Understanding the differences between the different kinds of programs we can write and how different programs lead to the same goal is the key to understanding BPF and eBPF in networking; in the next section we look at the first two ways to do filtering at socket level: by using classic BPF filters, and by using eBPF programs attached to sockets.

BPF and Packet Filtering

As stated, BPF filters and eBPF programs are the principal use cases for BPF programs in the context of networking; however, originally, BPF programs were synonymous with packet filtering.

Packet filtering is still one of the most important use cases and has been expanded from classic BPF (cBPF) to the modern eBPF in Linux 3.19 with the addition of map-related functions to the filter program type BPF_PROG_TYPE_SOCKET_FILTER.

Filters can be used mainly in three high-level scenarios:

  • Live traffic dropping (e.g., allowing only User Datagram Protocol [UDP] traffic and discarding anything else)

  • Live observation of a filtered set of packets flowing into a live system

  • Retrospective analysis of network traffic captured on a live system, using the pcap format, for example

Note

The term pcap comes from the conjunction of two words: packet and capture. The pcap format is implemented as a domain-specific API for packet capturing in a library called Packet Capture Library (libpcap). This format is useful in debugging scenarios when you want to save a set of packets that have been captured on a live system directly to a file to analyze them later using a tool that can read a stream of packets exported in the pcap format.

In the following sections we show two different ways to apply the concept of packet filtering with BPF programs. First we show how a common and widespread tool like tcpdump acts as a higher-level interface for BPF programs used as filters. Then we write and load our own program using the BPF_PROG_TYPE_SOCKET_FILTER BPF program type.

tcpdump and BPF Expressions

When talking about live traffic analysis and observation, one of the command-line tools that almost everyone knows about is tcpdump. Essentially a frontend for libpcap, it allows the user to define high-level filtering expressions. What tcpdump does is read packets from a network interface of your choice (or any interface) and then writes the content of the packets it received to stdout or a file. The packet stream can then be filtered using the pcap filter syntax. The pcap filter syntax is a DSL that is used to filter packets using a higher-level set of expressions made by a set of primitives that are generally easier to remember than BPF assembly. It’s out of the scope of this chapter to explain all the primitives and expressions possible in the pcap filter syntax because the entire set can be found in man 7 pcap-filter, but we do go through some examples so that you can understand its power.

The scenario is that we are in a Linux box that is exposing a web server on port 8080; this web server is not logging the requests it receives, and we really want to know whether it is receiving any request and how those requests are flowing into it because a customer of the served application is complaining about not being able to get any response while browsing the products page. At this point, we know only that the customer is connecting to one of our products pages using our web application served by that web server, and as almost always happens, we have no idea what could be the cause of that because end users generally don’t try to debug your services for you, and unfortunately we didn’t deploy any logging or error reporting strategy into this system, so we are completely blind while investigating the problem. Fortunately, there’s a tool that can come to our rescue! It is tcpdump, which can be told to filter only IPv4 packets flowing in our system that are using the Transmission Control Protocol (TCP) on port 8080. Therefore, we will be able to analyze the traffic of the web server and understand what are the faulty requests.

Here’s the command to conduct that filtering with tcpdump:

# tcpdump -n 'ip and tcp port 8080'

Let’s take a look at what’s happening in this command:

  • -n is there to tell tcpdump to not convert addresses to the respective names, we want to see the addresses for source and destination.

  • ip and tcp port 8080 is the pcap filter expression that tcpdump will use to filter your packets. ip means IPv4, and is a conjunction to express a more complex filter to allow adding more expressions to match, and then we specify that we are interested only in TCP packets coming from or to port 8080 using tcp port 8080. In this specific case a better filter would’ve been tcp dst port 8080 because we are interested only in packets having as the destination port 8080 and not packets coming from it.

The output of that will be something like this (without the redundant parts like complete TCP handshakes):

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlp4s0, link-type EN10MB (Ethernet), capture size 262144 bytes
12:04:29.593703 IP 192.168.1.249.44206 > 192.168.1.63.8080: Flags [P.],
   seq 1:325, ack 1, win 343,
   options [nop,nop,TS val 25580829 ecr 595195678],
   length 324: HTTP: GET / HTTP/1.1
12:04:29.596073 IP 192.168.1.63.8080 > 192.168.1.249.44206: Flags [.],
   seq 1:1449, ack 325, win 507,
   options [nop,nop,TS val 595195731 ecr 25580829],
   length 1448: HTTP: HTTP/1.1 200 OK
12:04:29.596139 IP 192.168.1.63.8080 > 192.168.1.249.44206: Flags [P.],
   seq 1449:2390, ack 325, win 507,
   options [nop,nop,TS val 595195731 ecr 25580829],
   length 941: HTTP
12:04:46.242924 IP 192.168.1.249.44206 > 192.168.1.63.8080: Flags [P.],
   seq 660:996, ack 4779, win 388,
   options [nop,nop,TS val 25584934 ecr 595204802],
   length 336: HTTP: GET /api/products HTTP/1.1
12:04:46.243594 IP 192.168.1.63.8080 > 192.168.1.249.44206: Flags [P.],
   seq 4779:4873, ack 996, win 503,
   options [nop,nop,TS val 595212378 ecr 25584934],
   length 94: HTTP: HTTP/1.1 500 Internal Server Error
12:04:46.329245 IP 192.168.1.249.44234 > 192.168.1.63.8080: Flags [P.],
   seq 471:706, ack 4779, win 388,
   options [nop,nop,TS val 25585013 ecr 595205622],
   length 235: HTTP: GET /favicon.ico HTTP/1.1
12:04:46.331659 IP 192.168.1.63.8080 > 192.168.1.249.44234: Flags [.],
   seq 4779:6227, ack 706, win 506,
   options [nop,nop,TS val 595212466 ecr 25585013],
   length 1448: HTTP: HTTP/1.1 200 OK
12:04:46.331739 IP 192.168.1.63.8080 > 192.168.1.249.44234: Flags [P.],
   seq 6227:7168, ack 706, win 506,
   options [nop,nop,TS val 595212466 ecr 25585013],
   length 941: HTTP

The situation is a lot clearer now! We have a bunch of requests going well, returning a 200 OK status code, but there is also one with a 500 Internal Server Error code on the /api/products endpoint. Our customer is right; we have a problem listing the products!

At this point, you might ask yourself, what does all this pcap filtering stuff and tcpdump have to do with BPF programs if they have their own syntax? Pcap filters on Linux are compiled to BPF programs! And because tcpdump uses pcap filters for the filtering, this means that every time you execute tcpdump using a filter, you are actually compiling and loading a BPF program to filter your packets. Fortunately, by passing the -d flag to tcpdump, you can dump the BPF instructions that it will load while using the specified filter:

tcpdump  -d  'ip and tcp port 8080'

The filter is the same as the one used in the previous example, but the output now is a set of BPF assembly instructions because of the -d flag.

Here’s the output:

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 12
(002) ldb      [23]
(003) jeq      #0x6             jt 4    jf 12
(004) ldh      [20]
(005) jset     #0x1fff          jt 12   jf 6
(006) ldxb     4*([14]&0xf)
(007) ldh      [x + 14]
(008) jeq      #0x1f90          jt 11   jf 9
(009) ldh      [x + 16]
(010) jeq      #0x1f90          jt 11   jf 12
(011) ret      #262144
(012) ret      #0

Let’s analyze it:

ldh [12]

(ld) Load a (h) half-word (16 bit) from the accumulator at offset 12, which is the Ethertype field, as shown in Figure 6-1.

jeq #0x800 jt 2 jf 12

(j) Jump if (eq) equal; check whether the Ethertype value from the previous instruction is equal to 0x800—which is the identifier for IPv4—and then use the jump destinations that are 2 if true (jt) and 12 if false (jf), so this will continue to the next instruction if the Internet Protocol is IPv4—otherwise it will jump to the end and return zero.

ldb [23]

Load byte into (ldb), will load the higher-layer protocol field from the IP frame that can be found at offset 23—offset 23 comes from the addition of the 14 bytes of the headers in the Ethernet Layer 2 frame (see Figure 6-1) plus the position the protocol has in the IPv4 header, which is the 9th, so 14 + 9 = 23.

jeq #0x6 jt 4 jf 12

Again a jump if equal. In this case, we check that the previous extracted protocol is 0 x 6, which is TCP. If it is, we jump to the next instruction (4) or we go to the end (12)—if it is not, we drop the packet.

ldh [20]

This is another load half-word instruction—in this case, it is to load the value of packet offset + fragment offset from the IPv4 header.

jset #0x1fff jt 12 6

This jset instruction will jump to 12 if any of the data we found in the fragment offset is true—otherwise, go to 6, which is the next instruction. The offset after the instruction 0x1fff says to the jset instruction to look only at the last 13 bytes of data. (Expanded it becomes 0001 1111 1111 1111.)

ldxb 4*([14]&0xf)

(ld) Load into x (x) what (b) is. This instruction will load the value of the IP header length into x.

ldh [x + 14]

Another load half-word instruction that will go get the value at offset (x + 14), IP header length + 14, which is the location of the source port within the packet.

jeq #0x1f90 jt 11 jf 9

If the value at (x + 14) is equal to 0x1f90 (8080 in decimal), which means that the source port will be 8080, continue to 11 or go check whether the destination is on port 8080 by continuing to 9 if this is false.

ldh [x + 16]

This is another load half-word instruction that will go get the value at offset (x + 16), which is the location of destination port in the packet.

jeq #0x1f90 jt 11 jf 12

Here’s another jump if equal, this time used to check if the destination is 8080, go to 11; if not, go to 12 and discard the packet.

ret #262144

When this instruction is reached, a match is found—thus return the matched snap length. By default this value is 262,144 bytes. It can be tuned using the -s parameter in tcpdump.

Diagram showing the Layer 2 Ethernet frame structure and the respective lengths
Figure 6-1. Layer 2 Ethernet frame structure

Here’s the “correct” example because, as we said in the case of our web server, we only need to take into account the packet having 8080 as a destination, not as a source, so the tcpdump filter can specify it with the dst destination field:

tcpdump -d 'ip and tcp dst port 8080'

In this case, the dumped set of instructions is similar to the previous example, but as you can see, it lacks the entire part about matching the packets with a source of port 8080. In fact, there’s no ldh [x + 14] and the relative jeq #0x1f90 jt 11 jf 9.

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 10
(002) ldb      [23]
(003) jeq      #0x6             jt 4    jf 10
(004) ldh      [20]
(005) jset     #0x1fff          jt 10   jf 6
(006) ldxb     4*([14]&0xf)
(007) ldh      [x + 16]
(008) jeq      #0x1f90          jt 9    jf 10
(009) ret      #262144
(010) ret      #0

Besides just analyzing the generated assembly from tcpdump, as we did, you might want to write your own code to filter network packets. It turns out that the biggest challenge in that case would be to actually debug the execution of the code to make sure it matches our expectations; in this case, in the kernel source tree, there’s a tool in tools/bpf called bpf_dbg.c that is essentially a debugger that allows you to load a program and a pcap file to test the execution step by step.

Tip

tcpdump can also read directly from a .pcap file and apply BPF filters to it.

Packet Filtering for Raw Sockets

The BPF_PROG_TYPE_SOCKET_FILTER program type allows you to attach the BPF program to a socket. All of the packets received by it will be passed to the program in the form of an sk_buff struct, and then the program can decide whether to discard or allow them. This kind of programs also has the ability to access and work on maps.

Let’s look at an example to see how this kind of BPF program can be used.

The purpose of our example program is to count the number of TCP, UDP, and Internet Control Message Protocol (ICMP) packets flowing in the interface under observation. To do that, we need the following:

  • The BPF program that can see the packets flowing

  • The code to load the program and attach it to a network interface

  • A script to compile the program and launch the loader

At this point, we can write our BPF program in two ways: as C code that is then compiled to an ELF file, or directly as a BPF assembly. For this example, we opted to use C code to show a higher-level abstraction and how to use Clang to compile the program. It’s important to note that to make this program, we are using headers and helpers available only in the Linux kernel’s source tree, so the first thing to do is to obtain a copy of it using Git. To avoid differences, you can check out the same commit SHA we’ve used to make this example:

export KERNEL_SRCTREE=/tmp/linux-stable
git clone  git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
  $KERNEL_SRCTREE
cd $KERNEL_SRCTREE
git checkout 4b3c31c8d4dda4d70f3f24a165f3be99499e0328
Tip

To contain BPF support, you will need clang >= 3.4.0 with llvm >= 3.7.1. To verify BPF support in your installation, you can use the command llc -version and look to see whether it has the BPF target.

Now that you understand socket filtering, we can get our hands on a BPF program of type socket.

The BPF program

The main duty of the BPF program here is to access the packet it receives; check whether its protocol is TCP, UDP, or ICMP, and then increment the counter on the map array on the specific key for the found protocol.

For this program we are going to take advantage of the loading mechanism that parses ELF files using the helpers located in samples/bpf/bpf_load.c in the kernel source tree. The load function load_bpf_file is able to recognize some specific ELF section headers and can associate them to the respective program types. Here’s how that code looks:

	bool is_socket = strncmp(event, "socket", 6) == 0;
	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
	bool is_raw_tracepoint = strncmp(event, "raw_tracepoint/", 15) == 0;
	bool is_xdp = strncmp(event, "xdp", 3) == 0;
	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
	bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
	bool is_sockops = strncmp(event, "sockops", 7) == 0;
	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;

The first thing that the code does is to create an association between the section header and an internal variable—like for SEC("socket"), we will end up with bool is_socket=true.

Later in the same file, we see a set of if instructions that create the association between the header and the actual prog_type , so for is_socket, we end up with BPF_PROG_TYPE_SOCKET_FILTER:

	if (is_socket) {
		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
	} else if (is_kprobe || is_kretprobe) {
		prog_type = BPF_PROG_TYPE_KPROBE;
	} else if (is_tracepoint) {
		prog_type = BPF_PROG_TYPE_TRACEPOINT;
	} else if (is_raw_tracepoint) {
		prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT;
	} else if (is_xdp) {
		prog_type = BPF_PROG_TYPE_XDP;
	} else if (is_perf_event) {
		prog_type = BPF_PROG_TYPE_PERF_EVENT;
	} else if (is_cgroup_skb) {
		prog_type = BPF_PROG_TYPE_CGROUP_SKB;
	} else if (is_cgroup_sk) {
		prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
	} else if (is_sockops) {
		prog_type = BPF_PROG_TYPE_SOCK_OPS;
	} else if (is_sk_skb) {
		prog_type = BPF_PROG_TYPE_SK_SKB;
	} else if (is_sk_msg) {
		prog_type = BPF_PROG_TYPE_SK_MSG;
	} else {
		printf("Unknown event '%s'
", event);
		return -1;
	}

Good, so because we want to write a BPF_PROG_TYPE_SOCKET_FILTER program, we need to specify a SEC("socket") as an ELF header to our function that will act as an entry point for our BPF program.

As you can see by that list, there are a variety of program types related to sockets and in general network operations. In this chapter we are showing examples with BPF_PROG_TYPE_SOCKET_FILTER; however, you can find a definition of all the other program types in Chapter 2. Moreover, in Chapter 7 we discuss XDP programs with the program type BPF_PROG_TYPE_XDP.

Because we want to store the count of packets for every protocol we encounter, we need to create a key/value map where the protocol is key and the packets count as value. For that purpose, we can use a BPF_MAP_TYPE_ARRAY:

struct bpf_map_def SEC("maps") countmap = {
    .type = BPF_MAP_TYPE_ARRAY,
    .key_size = sizeof(int),
    .value_size = sizeof(int),
    .max_entries = 256,
};

The map is defined using the bpf_map_def struct, and it will be named countmap for reference in the program.

At this point, we can write some code to actually count the packets. We know that programs of type BPF_PROG_TYPE_SOCKET_FILTER are one of our options because by using such a program, we can see all the packets flowing through an interface. Therefore we attach the program to the right header with SEC("socket"):

SEC("socket")
int socket_prog(struct __sk_buff *skb) {
  int proto = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
  int one = 1;
  int *el = bpf_map_lookup_elem(&countmap, &proto);
  if (el) {
    (*el)++;
  } else {
    el = &one;
  }
  bpf_map_update_elem(&countmap, &proto, el, BPF_ANY);
  return 0;
}

After the ELF header attachment we can use the load_byte function to extract the protocol section from the sk_buff struct. Then we use the protocol ID as a key to do a bpf_map_lookup_elem operation to extract the current counter value from our countmap so that we can increment it or set it to 1 if it is the first packet ever. Now we can update the map with the incremented value using bpf_map_update_elem.

To compile the program to an ELF file, we just use Clang with -target bpf. This command creates a bpf_program.o file that we will load using the loader:

clang -O2 -target bpf -c bpf_program.c -o bpf_program.o

Load and attach to a network interface

The loader is the program that actually opens our compiled BPF ELF binary bpf_program.o and attaches the defined BPF program and its maps to a socket that is created against the interface under observation, in our case lo, the loopback interface.

The most important part of the loader is the actual loading of the ELF file:

  if (load_bpf_file(filename)) {
    printf("%s", bpf_log_buf);
    return 1;
  }

  sock = open_raw_sock("lo");

  if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
                 sizeof(prog_fd[0]))) {
    printf("setsockopt %s
", strerror(errno));
    return 0;
  }

This will populate the prog_fd array by adding one element that is the file descriptor of our loaded program that we can now attach to the socket descriptor of our loopback interface lo opened with open_raw_sock.

The attach is done by setting the option SO_ATTACH_BPF to the raw socket opened for the interface.

At this point our user-space loader is able to look up map elements while the kernel sends them:

  for (i = 0; i < 10; i++) {
    key = IPPROTO_TCP;
    assert(bpf_map_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);

    key = IPPROTO_UDP;
    assert(bpf_map_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);

    key = IPPROTO_ICMP;
    assert(bpf_map_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);

    printf("TCP %d UDP %d ICMP %d packets
", tcp_cnt, udp_cnt, icmp_cnt);
    sleep(1);
  }

To do the lookup, we attach to the array map using a for loop and bpf_map_lookup_elem so that we can read and print the values for the TCP, UDP, and ICMP packet counters, respectively.

The only thing left is to compile the program!

Because this program is using libbpf, we need to compile it from the kernel source tree we just cloned:

$ cd $KERNEL_SRCTREE/tools/lib/bpf
$ make

Now that we have libbpf, we can compile the loader using this script:

KERNEL_SRCTREE=$1
LIBBPF=${KERNEL_SRCTREE}/tools/lib/bpf/libbpf.a
clang -o loader-bin -I${KERNEL_SRCTREE}/tools/lib/bpf/ 
  -I${KERNEL_SRCTREE}/tools/lib -I${KERNEL_SRCTREE}/tools/include 
  -I${KERNEL_SRCTREE}/tools/perf -I${KERNEL_SRCTREE}/samples 
  ${KERNEL_SRCTREE}/samples/bpf/bpf_load.c 
  loader.c "${LIBBPF}" -lelf

As you can see, the script includes a bunch of headers and the libbpf library from the kernel itself, so it must know where to find the kernel source code. To do that, you can replace $KERNEL_SRCTREE in it or just write that script into a file and use it:

$ ./build-loader.sh /tmp/linux-stable

At this point the loader will have created a loader-bin file that can be finally started along with the BPF program’s ELF file (requires root privileges):

# ./loader-bin bpf_program.o

After the program is loaded and started it will do 10 dumps, one every second showing the packet count for each one of the three considered protocols. Because the program is attached to the loopback device lo, along with the loader you can run ping and see the ICMP counter increasing.

So run ping to generate ICMP traffic to localhost:

$ ping -c 100 127.0.0.1

This starts pinging localhost 100 times and outputs something like this:

PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.100 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.093 ms
64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from 127.0.0.1: icmp_seq=5 ttl=64 time=0.105 ms
64 bytes from 127.0.0.1: icmp_seq=6 ttl=64 time=0.093 ms
64 bytes from 127.0.0.1: icmp_seq=7 ttl=64 time=0.104 ms
64 bytes from 127.0.0.1: icmp_seq=8 ttl=64 time=0.142 ms

Then, in another terminal, we can finally run our BPF program:

# ./loader-bin bpf_program.o

It begins dumping out the following:

TCP 0 UDP 0 ICMP 0 packets
TCP 0 UDP 0 ICMP 4 packets
TCP 0 UDP 0 ICMP 8 packets
TCP 0 UDP 0 ICMP 12 packets
TCP 0 UDP 0 ICMP 16 packets
TCP 0 UDP 0 ICMP 20 packets
TCP 0 UDP 0 ICMP 24 packets
TCP 0 UDP 0 ICMP 28 packets
TCP 0 UDP 0 ICMP 32 packets
TCP 0 UDP 0 ICMP 36 packets

At this point, you already know a good amount of what is needed to filter packets on Linux using a socket filter eBPF program. Here’s some big news: that’s not the only way! You might want to instrument the packet scheduling subsystem in place by using the kernel instead of on sockets directly. Just read the next section to learn how.

BPF-Based Traffic Control Classifier

Traffic Control is the kernel packet scheduling subsystem architecture. It is made of mechanisms and queuing systems that can decide how packets flow and how they are accepted.

Some use cases for Traffic Control include, but are not limited to, the following:

  • Prioritize certain kinds of packets

  • Drop specific kind of packet

  • Bandwidth distribution

Given that in general Traffic Control is the way to go when you need to redistribute network resources in a system, to get the best out of it, specific Traffic Control configurations should be deployed based on the kind of applications that you want to run. Traffic Control provides a programmable classifier, called cls_bpf, to let the hook into different levels of the scheduling operations where they can read and update socket buffer and packet metadata to do things like traffic shaping, tracing, preprocessing, and more.

Support for eBPF in cls_bpf was implemented in kernel 4.1, which means that this kind of program has access to eBPF maps, has tail call support, can access IPv4/IPv6 tunnel metadata, and in general use helpers and utilities coming with eBPF.

The tooling used to interact with networking configuration related to Traffic Control is part of the iproute2 suite, which contains ip and tc, which are used to manipulate network interfaces and traffic control configuration, respectively.

At this point, learning Traffic Control can be difficult without the proper reference in terms of terminology. The following section can help.

Terminology

As mentioned, there are interaction points between Traffic Control and BPF programs, so you need to understand some Traffic Control concepts. If you have already mastered Traffic Control, feel free to skip this terminology section and go straight to the examples.

Queueing disciplines

Queuing disciplines (qdisc) define the scheduling objects used to enqueue packets going to an interface by changing the way they are sent; those objects can be classless or classful.

The default qdisc is pfifo_fast, which is classless and enqueues packets on three FIFO (first in first out) queues that are dequeued based on their priority; this qdisc is not used for virtual devices like the loopback (lo) or Virtual Ethernet devices (veth) that use noqueue instead. Besides being a good default for its scheduling algorithm, pfifo_fast also doesn’t require any configuration to work.

Virtual interfaces can be distinguished from physical interfaces (devices) by asking the /sys pseudo filesystem:

ls -la /sys/class/net
total 0
drwxr-xr-x  2 root root 0 Feb 13 21:52 .
drwxr-xr-x 64 root root 0 Feb 13 18:38 ..
lrwxrwxrwx  1 root root 0 Feb 13 23:26 docker0 ->
../../devices/virtual/net/docker0
lrwxrwxrwx  1 root root 0 Feb 13 23:26 enp0s31f6 ->
../../devices/pci0000:00/0000:00:1f.6/net/enp0s31f6
lrwxrwxrwx  1 root root 0 Feb 13 23:26 lo -> ../../devices/virtual/net/lo

At this point, some confusion is normal. If you’ve never heard about qdiscs, one thing you can do is to use the ip a command to show the list of network interfaces configured in the current system:

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc
fq_codel stateDOWN group default
qlen 1000
link/ether 8c:16:45:00:a7:7e brd ff:ff:ff:ff:ff:ff
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc
noqueue state DOWN group default
link/ether 02:42:38:54:3c:98 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
   valid_lft forever preferred_lft forever
inet6 fe80::42:38ff:fe54:3c98/64 scope link
   valid_lft forever preferred_lft forever

This list already tells us something. Can you find the word qdisc in it? Let’s analyze the situation:

  • We have three network interfaces in this system: lo, enp0s31f6, and docker0.

  • The lo interface is a virtual interface, so it has qdisc noqueue.

  • The enp0s31f6 is a physical interface. Wait, why is the qdisc here fq_codel (fair queue controlled delay)? Wasn’t pfifo_fast the default? It turns out that the system we’re testing the commands on is running Systemd, which is setting the default qdisc differently using the kernel parameter net.core.default_qdisc.

  • The docker0 interface is a bridge interface, so it uses a virtual device and has noqueue qdisc.

The noqueue qdisc doesn’t have classes, a scheduler, or a classifier. What it does is that it tries to send the packets immediately. As stated, noqueue is used by default by virtual devices, but it’s also the qdisc that becomes effective to any interface when you delete its current associated qdisc.

fq_codel is a classless qdisc that classifies the incoming packets using a stochastic model in order to be able to queue traffic flows in a fair way.

The situation should be clearer now; we used the ip command to find information about qdiscs but it turns out that in the iproute2 toolbelt there’s also a tool called tc that has a specific subcommand for qdiscs you can use to list them:

tc qdisc ls
qdisc noqueue 0: dev lo root refcnt 2
qdisc fq_codel 0: dev enp0s31f6 root refcnt 2 limit 10240p flows 1024 quantum 1514
target 5.0ms interval 100.0ms memory_limit 32Mb ecn
qdisc noqueue 0: dev docker0 root refcnt 2

There’s much more going on here! For docker0 and lo we basically see the same information as with ip a, but for enp0s31f6, for example, it has the following:

  • A limit of 10,240 incoming packets that it can handle.

  • As mentioned, the stochastic model used by fq_codel wants to queue traffic into different flows, and this output contains the information about how many of them we have, which is 1,024.

Now that the key concepts of qdiscs have been introduced, we can take a closer look at classful and classless qdiscs in the next section to understand their differences and which ones are suitable for BPF programs.

Classful qdiscs, filters, and classes

Classful qdiscs allow the definition of classes for different kinds of traffic in order to apply different rules to them. Having a class for a qdisc means that it can contain further qdiscs. With this kind of hieararchy, then, we can use a filter (classifier) to classify the traffic by determining the next class where the packet should be enqueued.

Filters are used to assign packets to a particular class based on their type. Filters are used inside a classful qdiscs to determine in which class the packet should be enqueued, and two or more filters can map to the same class, as shown in Figure 6-2. Every filter uses a classifier to classify packets based on their information.

A qdisc containing a set of filters that map to two different classes that have associated qdiscs themselves
Figure 6-2. Classful qdisc with filters

As mentioned earlier, cls_bpf is the classifier that we want to use to write BPF programs for Traffic Control—we have a concrete example in the next sections on how to use it.

Classes are objects that can live only in a classful qdisc; classes are used in Traffic Control to create hierarchies. Complex hierarchies are made possible by the fact that a class can have filters attached to it, which can then be used as an entry point for another class or for a qdisc.

Classless qdiscs

A classless qdiscs is a qdisc that can’t have any children because it is not allowed to have any classes associated. This means that is not possible to attach filters to classless qdiscs. Because classless qdiscs can’t have children, we can’t add filters and classifiers to them, so classless qdiscs are not interesting from a BPF point of view but still useful for simple Traffic Control needs.

After building up some knowledge on qdiscs, filters, and classes, we now show you how to write BPF programs for a cls_bpf classifier.

Traffic Control Classifier Program Using cls_bpf

As we said, Traffic Control is a powerful mechanism that is made even more powerful thanks to classifiers; however, among all the classifiers, there is one that allows you to program the network data path cls_bpf classifier. This classifier is special because it can run BPF programs, but what does that mean? It means that cls_bpf will allow you to hook your BPF programs directly in the ingress and egress layers, and running BPF programs hooked to those layers means that they will be able to access the sk_buff struct for the respective packets.

To understand better this relationship between Traffic Control and BPF programs, see Figure 6-3, which shows how BPF programs are loaded against the cls_bpf classifier. You will also notice that such programs are hooked into ingress and egress qdiscs. All the other interactions in context are also described. By taking the network interface as the entry point for network traffic, you will see the following:

  • The traffic first goes to the Traffic Control’s ingress hook.

  • Then the kernel will execute the BFP program loaded into the ingress from userspace for every request coming in.

  • After the ingress program is executed, the control is given to the networking stack that informs the user’s application about the networking event.

  • After the application gives a response, the control is passed to the Traffic Control’s egress using another BPF program that executes, and upon completion gives back control to the kernel.

  • A response is given to the client.

You can write BPF programs for Traffic Control in C and compile them using LLVM/Clang with the BPF backend.

Diagram showing the interactions between Traffic Control and BPF programs loaded using cls_bpf
Figure 6-3. Loading of BPF programs using Traffic Control
Tip

Ingress and egress qdiscs allow you to hook Traffic Control into inbound (ingress) and outbound (egress) traffic, respectively.

To make this example work, you need to run it on a kernel that has been compiled with cls_bpf directly or as a module. To verify that you have everything you need, you can do the following:

cat /proc/config.gz| zcat  | grep -i BPF

Make sure you get at least the following output with either y or m:

CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=m
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y

Let’s now see how we write the classifier:

SEC("classifier")
static inline int classification(struct __sk_buff *skb) {
  void *data_end = (void *)(long)skb->data_end;
  void *data = (void *)(long)skb->data;
  struct ethhdr *eth = data;

  __u16 h_proto;
  __u64 nh_off = 0;
  nh_off = sizeof(*eth);

  if (data + nh_off > data_end) {
    return TC_ACT_OK;
  }

The “main” of our classifier is the classification function. This function is annotated with a section header called classifier so that tc can know that this is the classifier to use.

At this point, we need to extract some information from the skb; the data member contains all the data for the current packet and all its protocol details. To let our program know what’s inside of it, we need to cast it to an Ethernet frame (in our case, with the *eth variable). To make the static verifier happy, we need to check that the data, summed up with the size of the eth pointer, does not exceed the space where data_end is. After that, we can go one level inward and get the protocol type from the h_proto member in *eth:

  if (h_proto == bpf_htons(ETH_P_IP)) {
    if (is_http(skb, nh_off) == 1) {
      trace_printk("Yes! It is HTTP!
");
    }
  }

  return TC_ACT_OK;
}

After we have the protocol, we need to convert it from the host to check whether it is equal to the IPv4 protocol, the one we are interested in, and if it is, we check whether the inner packet is HTTP using our own is_http function. If it is, we print a debug message stating that we found an HTTP packet:

  void *data_end = (void *)(long)skb->data_end;
  void *data = (void *)(long)skb->data;
  struct iphdr *iph = data + nh_off;

  if (iph + 1 > data_end) {
    return 0;
  }

  if (iph->protocol != IPPROTO_TCP) {
    return 0;
  }
  __u32 tcp_hlen = 0;

The is_http function is similar to our classifier function, but it will start from an skb by knowing already the start offset for the IPv4 protocol data. As we did earlier, we need to do a check before accessing the IP protocol data with the *iph variable to let the static verifier know our intentions.

When that’s done, we just check whether the IPv4 header contains a TCP packet so that we can go ahead. If the packet’s protocol is of type IPPROTO_TCP, we need to do some more checks again to get the actual TCP header in the *tcph variable:

  plength = ip_total_length - ip_hlen - tcp_hlen;
  if (plength >= 7) {
    unsigned long p[7];
    int i = 0;
    for (i = 0; i < 7; i++) {

      p[i] = load_byte(skb, poffset + i);
    }
    int *value;
    if ((p[0] == 'H') && (p[1] == 'T') && (p[2] == 'T') && (p[3] == 'P')) {
      return 1;
    }
  }

  return 0;
}

Now that the TCP header is ours, we can go ahead and load the first seven bytes from the skb struct at the offset of the TCP payload poffset. At this point we can check whether the bytes array is a sequence saying HTTP; then we know that the Layer 7 protocol is HTTP, and we can return 1—otherwise, we return zero.

As you can see, our program is simple. It will basically allow everything, and when receiving an HTTP packet, it will let us know with a debugging message.

You can compile the program with Clang, using the bpf target, as we did before with the socket filter example. We cannot compile this program for Traffic Control in the same way; this will generate an ELF file classifier.o that will be loaded by tc this time and not by our own custom loader:

clang -O2 -target bpf -c classifier.c -o classifier.o

Now we can install the program on the interface we want our program to operate on; in our case, it was eth0.

The first command will replace the default qdisc for the eth0 device, and the second one will actually load our cls_bpf classifier into that ingress classful qdisc. This means that our program will handle all traffic going into that interface. If we want to handle outgoing traffic, we would need to use the egress qdisc instead:

# tc qdisc add dev eth0 handle 0: ingress
# tc filter add dev eth0 ingress bpf obj classifier.o flowid 0:

Our program is loaded now—what we need is to send some HTTP traffic to that interface.

To do that you need any HTTP server on that interface. Then you can curl the interface IP.

In case you don’t have one, you can obtain a test HTTP server using Python 3 with the http.server module. It will open the port 8000 with a directory listing of the current working directory:

python3 -m http.server

At this point you can call the server with curl:

$ curl http://192.168.1.63:8080

After doing that you should see your HTTP response from the HTTP server. You can now get your debugging messages (created with trace_printk), confirming that using the dedicated tc command:

# tc exec bpf dbg

The output will be something like this:

Running! Hang up with ^C!

         python3-18456 [000] ..s1 283544.114997: 0: Yes! It is HTTP!
         python3-18754 [002] ..s1 283566.008163: 0: Yes! It is HTTP!

Congratulations! You just made your first BPF Traffic Control classifier.

Tip

Instead of using a debugging message like we did in this example, you could use a map to communicate to user-space that the interface just received an HTTP packet. We leave this as an exercise for you to do. If you look at classifier.c in the previous example, you can get an idea of how to do that by looking at how we used the map countmap there.

At this point, what you might want is to unload the classifier. You can do that by deleting the ingress qdisc that you just attached to the interface:

# tc qdisc del dev eth0 ingress

Notes on act_bpf and how cls_bpf is different

You might have noticed that another object exists for BPF programs called act_bpf. It turns out that act_bpf is an action, not a classifier. This makes it operationally different because actions are objects attached to filters, and because of this it is not able to perform filtering directly, requiring Traffic Control to consider all the packets first. For this property, it is usually preferable to use the cls_bpf classifier instead of the act_bpf action.

However, because act_bpf can be attached to any classifier, there might be cases for which you find it useful to just reuse a classifier you already have and attach a BPF program to it.

Differences Between Traffic Control and XDP

Even though the Traffic Control cls_bpf and XDP programs look very similar, they are pretty different. XDP programs are executed earlier in the ingress data path, before entering into the main kernel network stack, so our program does not have access to a socket buffer struct sk_buff like with tc. XDP programs instead take a different structure called xdp_buff, which is an eager representation of the packet without metadata. All this comes with advantages and disadvantages. For example, being executed even before the kernel code, XDP programs can drop packets in an efficient way. Compared to Traffic Control programs, XDP programs can be attached only to traffic in ingress to the system.

At this point, you might be asking yourself when it’s an advantage to use one instead of the other. The answer is that because of their nature of not containing all the kernel-enriched data structures and metadata, XDP programs are better for use cases covering OSI layers up to Layer 4. But let’s not spoil all the content of the next chapter!

Conclusion

It should now be pretty clear to you that BPF programs are useful for getting visibility and control at different levels of the networking data path. You’ve seen how to take advantage of them to filter packets using high-level tools that generate a BPF assembly. Then we loaded a program to a network socket, and in the end we attached our programs to the Traffic Control ingress qdisc to do traffic classification using BPF programs. In this chapter we also briefly discussed XDP, but be prepared, because in Chapter 7 we cover the topic in its entirety by expanding on how XDP programs are constructed, what kind of XDP programs there are, and how to write and test them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset