From a networking point of view, we use BPF programs for two main use cases: packet capturing and filtering.
This means that a user-space program can attach a filter to any socket and extract information about packets flowing through it and allow/disallow/redirect certain kinds of packets as they are seen at that level.
The goal of this chapter is to explain how BPF programs can interact with the Socket Buffer structure at different stages of the network data path in the Linux kernel network stack. We are identifying, as common use cases two types of programs:
Program types related to sockets
Programs written for the BPF-based classifier for Traffic Control
The Socket Buffer structure, also called SKB or sk_buff
, is the one in the kernel that is created and used for every packet sent or received. By reading the SKB you can pass or drop packets and populate BPF maps to create statistics and flow metrics about the current traffic.
In addition some BPF programs allow you to manipulate the SKB and, by extension, transform the final packets in order to redirect them or change their fundamental structure. For example, on an IPv6-only system, you might write a program that converts all the received packets from IPv4 to IPv6, which can be accomplished by mangling with the packets’ SKB.
Understanding the differences between the different kinds of programs we can write and how different programs lead to the same goal is the key to understanding BPF and eBPF in networking; in the next section we look at the first two ways to do filtering at socket level: by using classic BPF filters, and by using eBPF programs attached to sockets.
As stated, BPF filters and eBPF programs are the principal use cases for BPF programs in the context of networking; however, originally, BPF programs were synonymous with packet filtering.
Packet filtering is still one of the most important use cases and has been expanded
from classic BPF (cBPF) to the modern eBPF in Linux 3.19 with the addition
of map-related functions to the filter program type BPF_PROG_TYPE_SOCKET_FILTER
.
Filters can be used mainly in three high-level scenarios:
Live traffic dropping (e.g., allowing only User Datagram Protocol [UDP] traffic and discarding anything else)
Live observation of a filtered set of packets flowing into a live system
Retrospective analysis of network traffic captured on a live system, using the pcap format, for example
The term pcap comes from the conjunction of two words: packet and capture. The pcap format is implemented as a domain-specific API for packet capturing in a library called Packet Capture Library (libpcap). This format is useful in debugging scenarios when you want to save a set of packets that have been captured on a live system directly to a file to analyze them later using a tool that can read a stream of packets exported in the pcap format.
In the following sections we show two different ways to apply the concept of packet filtering with BPF programs. First we show how a common and widespread tool like tcpdump
acts as a higher-level interface for BPF programs used as filters. Then we write and load our own program using the BPF_PROG_TYPE_SOCKET_FILTER
BPF program type.
When talking about live traffic analysis and observation, one of the command-line tools that almost everyone knows about is tcpdump
. Essentially a frontend for libpcap
, it allows the user to define high-level filtering expressions. What tcpdump
does is read packets from a network interface of your choice (or any interface) and then writes the content of the packets it received to stdout or a file. The packet stream can then be filtered using the pcap filter syntax. The pcap filter syntax is a DSL that is used to filter packets using a higher-level set of expressions made by a set of primitives that are generally easier to remember than BPF assembly. It’s out of the scope of this chapter to explain all the primitives and expressions possible in the pcap filter syntax because the entire set can be found in man 7 pcap-filter
, but we do go through some examples so that you can understand its power.
The scenario is that we are in a Linux box that is exposing a web server on port 8080; this web server is not logging the requests it receives, and we really want to know whether it is receiving any request and how those requests are flowing into it because a customer of the served application is complaining about not being able to get any response while browsing the products page. At this point, we know only that the customer is connecting to one of our products pages using our web application served by that web server, and as almost always happens, we have no idea what could be the cause of that because end users generally don’t try to debug your services for you, and unfortunately we didn’t deploy any logging or error reporting strategy into this system, so we are completely blind while investigating the problem. Fortunately, there’s a tool that can come to our rescue! It is tcpdump
, which can be told to filter only IPv4 packets flowing in our system that are using the Transmission Control Protocol (TCP) on port 8080. Therefore, we will be able to analyze the traffic of the web server and understand what are the faulty requests.
Here’s the command to conduct that filtering with tcpdump
:
# tcpdump -n 'ip and tcp port 8080'
Let’s take a look at what’s happening in this command:
-n
is there to tell tcpdump
to not convert addresses to the respective names, we want to see the addresses for source and destination.
ip and tcp port 8080
is the pcap filter expression that tcpdump
will use to filter your packets. ip
means IPv4
, and
is a conjunction to express a more complex filter to allow adding more expressions to match, and then we specify that we are interested only in TCP packets coming from or to port 8080 using tcp port 8080
. In this specific case a better filter would’ve been tcp dst port 8080
because we are interested only in packets having as the destination port 8080 and not packets coming from it.
The output of that will be something like this (without the redundant parts like complete TCP handshakes):
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on wlp4s0, link-type EN10MB (Ethernet), capture size 262144 bytes 12:04:29.593703 IP 192.168.1.249.44206 > 192.168.1.63.8080: Flags [P.], seq 1:325, ack 1, win 343, options [nop,nop,TS val 25580829 ecr 595195678], length 324: HTTP: GET / HTTP/1.1 12:04:29.596073 IP 192.168.1.63.8080 > 192.168.1.249.44206: Flags [.], seq 1:1449, ack 325, win 507, options [nop,nop,TS val 595195731 ecr 25580829], length 1448: HTTP: HTTP/1.1 200 OK 12:04:29.596139 IP 192.168.1.63.8080 > 192.168.1.249.44206: Flags [P.], seq 1449:2390, ack 325, win 507, options [nop,nop,TS val 595195731 ecr 25580829], length 941: HTTP 12:04:46.242924 IP 192.168.1.249.44206 > 192.168.1.63.8080: Flags [P.], seq 660:996, ack 4779, win 388, options [nop,nop,TS val 25584934 ecr 595204802], length 336: HTTP: GET /api/products HTTP/1.1 12:04:46.243594 IP 192.168.1.63.8080 > 192.168.1.249.44206: Flags [P.], seq 4779:4873, ack 996, win 503, options [nop,nop,TS val 595212378 ecr 25584934], length 94: HTTP: HTTP/1.1 500 Internal Server Error 12:04:46.329245 IP 192.168.1.249.44234 > 192.168.1.63.8080: Flags [P.], seq 471:706, ack 4779, win 388, options [nop,nop,TS val 25585013 ecr 595205622], length 235: HTTP: GET /favicon.ico HTTP/1.1 12:04:46.331659 IP 192.168.1.63.8080 > 192.168.1.249.44234: Flags [.], seq 4779:6227, ack 706, win 506, options [nop,nop,TS val 595212466 ecr 25585013], length 1448: HTTP: HTTP/1.1 200 OK 12:04:46.331739 IP 192.168.1.63.8080 > 192.168.1.249.44234: Flags [P.], seq 6227:7168, ack 706, win 506, options [nop,nop,TS val 595212466 ecr 25585013], length 941: HTTP
The situation is a lot clearer now! We have a bunch of requests going well, returning a 200 OK
status code, but there is also one with a 500 Internal Server Error
code on the /api/products
endpoint. Our customer is right; we have a problem listing the products!
At this point, you might ask yourself, what does all this pcap filtering stuff and tcpdump
have to do with BPF programs if they have their own syntax? Pcap filters on Linux are compiled to BPF programs! And because tcpdump
uses pcap filters for the filtering, this means that every time you execute tcpdump
using a filter, you are actually compiling and loading a BPF program to filter your packets. Fortunately, by passing the -d
flag to tcpdump
, you can dump the BPF instructions that it will load while using the specified filter:
tcpdump -d 'ip and tcp port 8080'
The filter is the same as the one used in the previous example, but the output now is a set of BPF assembly instructions because of the -d
flag.
Here’s the output:
(000) ldh [12] (001) jeq #0x800 jt 2 jf 12 (002) ldb [23] (003) jeq #0x6 jt 4 jf 12 (004) ldh [20] (005) jset #0x1fff jt 12 jf 6 (006) ldxb 4*([14]&0xf) (007) ldh [x + 14] (008) jeq #0x1f90 jt 11 jf 9 (009) ldh [x + 16] (010) jeq #0x1f90 jt 11 jf 12 (011) ret #262144 (012) ret #0
Let’s analyze it:
ldh [12]
(ld
) Load a (h
) half-word (16
bit) from the accumulator at offset 12, which is the Ethertype field, as shown in Figure 6-1.
jeq #0x800 jt 2 jf 12
(j
) Jump if (eq
) equal; check whether the Ethertype value from the previous instruction is equal to 0x800
—which is the identifier for IPv4—and then use the jump destinations that are 2
if true (jt
) and 12
if false (jf
), so this will continue to the next instruction if the Internet Protocol is IPv4—otherwise it will jump to the end and return zero.
ldb [23]
Load byte into (ldb
), will load the higher-layer protocol field from the IP frame that can be found at offset 23
—offset 23
comes from the addition of the 14 bytes of the headers in the Ethernet Layer 2 frame (see Figure 6-1) plus the position the protocol has in the IPv4 header, which is the 9th, so 14 + 9 = 23.
jeq #0x6 jt 4 jf 12
Again a jump if equal. In this case, we check that the previous extracted protocol is 0
x 6
, which is TCP. If it is, we jump to the next instruction (4
) or we go to the end (12
)—if it is not, we drop the packet.
ldh [20]
This is another load half-word instruction—in this case, it is to load the value of packet offset + fragment offset from the IPv4 header.
jset #0x1fff jt 12 6
This jset
instruction will jump to 12
if any of the data we found in the fragment offset is true—otherwise, go to 6
, which is the next instruction. The offset after the instruction 0x1fff
says to the jset
instruction to look only at the last 13 bytes of data. (Expanded it becomes 0001 1111 1111 1111
.)
ldxb 4*([14]&0xf)
(ld
) Load into x (x
) what (b
) is. This instruction will load the value of the IP header length into x
.
ldh [x + 14]
Another load half-word instruction that will go get the value at offset (x
+ 14
), IP header length + 14, which is the location of the source port within the packet.
jeq #0x1f90 jt 11 jf 9
If the value at (x
+ 14
) is equal to 0x1f90
(8080
in decimal), which means that the source port will be 8080
, continue to 11
or go check whether the destination is on port 8080
by continuing to 9
if this is false.
ldh [x + 16]
This is another load half-word instruction that will go get the value at offset (x
+ 16
), which is the location of destination port in the packet.
jeq #0x1f90 jt 11 jf 12
Here’s another jump if equal, this time used to check if the destination is 8080
, go to 11
; if not, go to 12
and discard the packet.
ret #262144
When this instruction is reached, a match is found—thus return the matched snap length. By default this value is 262,144 bytes. It can be tuned using the -s
parameter in tcpdump
.
Here’s the “correct” example because, as we said in the case of our web server, we only need to take into account the packet having 8080 as a destination, not as a source, so the tcpdump
filter can specify it with the dst
destination field:
tcpdump -d 'ip and tcp dst port 8080'
In this case, the dumped set of instructions is similar to the previous example, but as you can see, it lacks the entire part about matching the packets with a source of port 8080. In fact, there’s no ldh [x + 14]
and the relative jeq #0x1f90 jt 11 jf 9
.
(000) ldh [12] (001) jeq #0x800 jt 2 jf 10 (002) ldb [23] (003) jeq #0x6 jt 4 jf 10 (004) ldh [20] (005) jset #0x1fff jt 10 jf 6 (006) ldxb 4*([14]&0xf) (007) ldh [x + 16] (008) jeq #0x1f90 jt 9 jf 10 (009) ret #262144 (010) ret #0
Besides just analyzing the generated assembly from tcpdump
, as we did, you might want to write your own code to filter network packets. It turns out that the biggest challenge in that case would be to actually debug the execution of the code to make sure it matches our expectations; in this case, in the kernel source tree, there’s a tool in tools/bpf
called bpf_dbg.c
that is essentially a debugger that allows you to load a program and a pcap file to test the execution step by step.
The BPF_PROG_TYPE_SOCKET_FILTER
program type allows you to attach the BPF program to a socket. All of the packets received by it will be passed to the program in the form of an sk_buff
struct, and then the program can decide whether to discard or allow them. This kind of programs also has the ability to access and work on maps.
Let’s look at an example to see how this kind of BPF program can be used.
The purpose of our example program is to count the number of TCP, UDP, and Internet Control Message Protocol (ICMP) packets flowing in the interface under observation. To do that, we need the following:
The BPF program that can see the packets flowing
The code to load the program and attach it to a network interface
A script to compile the program and launch the loader
At this point, we can write our BPF program in two ways: as C code that is then compiled to an ELF file, or directly as a BPF assembly. For this example, we opted to use C code to show a higher-level abstraction and how to use Clang to compile the program. It’s important to note that to make this program, we are using headers and helpers available only in the Linux kernel’s source tree, so the first thing to do is to obtain a copy of it using Git. To avoid differences, you can check out the same commit SHA we’ve used to make this example:
export
KERNEL_SRCTREE
=
/tmp/linux-stable git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git$KERNEL_SRCTREE
cd
$KERNEL_SRCTREE
git checkout 4b3c31c8d4dda4d70f3f24a165f3be99499e0328
To contain BPF support, you will need clang >= 3.4.0
with llvm >= 3.7.1
. To verify BPF support in your installation, you can use the command llc -version
and look to see whether it has the BPF target.
Now that you understand socket filtering, we can get our hands on a BPF program of type socket
.
The main duty of the BPF program here is to access the packet it receives; check whether its protocol is TCP, UDP, or ICMP, and then increment the counter on the map array on the specific key for the found protocol.
For this program we are going to take advantage of the loading mechanism that parses ELF files using the helpers located in samples/bpf/bpf_load.c in the kernel source tree. The load function load_bpf_file
is able to recognize some specific ELF section headers and can associate them to the respective program types. Here’s how that code looks:
bool
is_socket
=
strncmp
(
event
,
"socket"
,
6
)
==
0
;
bool
is_kprobe
=
strncmp
(
event
,
"kprobe/"
,
7
)
==
0
;
bool
is_kretprobe
=
strncmp
(
event
,
"kretprobe/"
,
10
)
==
0
;
bool
is_tracepoint
=
strncmp
(
event
,
"tracepoint/"
,
11
)
==
0
;
bool
is_raw_tracepoint
=
strncmp
(
event
,
"raw_tracepoint/"
,
15
)
==
0
;
bool
is_xdp
=
strncmp
(
event
,
"xdp"
,
3
)
==
0
;
bool
is_perf_event
=
strncmp
(
event
,
"perf_event"
,
10
)
==
0
;
bool
is_cgroup_skb
=
strncmp
(
event
,
"cgroup/skb"
,
10
)
==
0
;
bool
is_cgroup_sk
=
strncmp
(
event
,
"cgroup/sock"
,
11
)
==
0
;
bool
is_sockops
=
strncmp
(
event
,
"sockops"
,
7
)
==
0
;
bool
is_sk_skb
=
strncmp
(
event
,
"sk_skb"
,
6
)
==
0
;
bool
is_sk_msg
=
strncmp
(
event
,
"sk_msg"
,
6
)
==
0
;
The first thing that the code does is to create an association between the section header and an internal variable—like for SEC("socket")
, we will end up with bool is_socket=true
.
Later in the same file, we see a set of if
instructions that create the association between the header and the actual prog_type
, so for is_socket
, we end up with BPF_PROG_TYPE_SOCKET_FILTER
:
if
(
is_socket
)
{
prog_type
=
BPF_PROG_TYPE_SOCKET_FILTER
;
}
else
if
(
is_kprobe
||
is_kretprobe
)
{
prog_type
=
BPF_PROG_TYPE_KPROBE
;
}
else
if
(
is_tracepoint
)
{
prog_type
=
BPF_PROG_TYPE_TRACEPOINT
;
}
else
if
(
is_raw_tracepoint
)
{
prog_type
=
BPF_PROG_TYPE_RAW_TRACEPOINT
;
}
else
if
(
is_xdp
)
{
prog_type
=
BPF_PROG_TYPE_XDP
;
}
else
if
(
is_perf_event
)
{
prog_type
=
BPF_PROG_TYPE_PERF_EVENT
;
}
else
if
(
is_cgroup_skb
)
{
prog_type
=
BPF_PROG_TYPE_CGROUP_SKB
;
}
else
if
(
is_cgroup_sk
)
{
prog_type
=
BPF_PROG_TYPE_CGROUP_SOCK
;
}
else
if
(
is_sockops
)
{
prog_type
=
BPF_PROG_TYPE_SOCK_OPS
;
}
else
if
(
is_sk_skb
)
{
prog_type
=
BPF_PROG_TYPE_SK_SKB
;
}
else
if
(
is_sk_msg
)
{
prog_type
=
BPF_PROG_TYPE_SK_MSG
;
}
else
{
printf
(
"Unknown event '%s'
"
,
event
);
return
-
1
;
}
Good, so because we want to write a BPF_PROG_TYPE_SOCKET_FILTER
program, we need to specify a SEC("socket")
as an ELF header to our function that will act as an entry point for our BPF program.
As you can see by that list, there are a variety of program types related to sockets and in general network operations. In this chapter we are showing examples with BPF_PROG_TYPE_SOCKET_FILTER
; however, you can find a definition of all the other program types in Chapter 2. Moreover, in Chapter 7 we discuss XDP programs with the program type BPF_PROG_TYPE_XDP
.
Because we want to store the count of packets for every protocol we encounter, we need to create a key/value map where the protocol is key and the packets count as value. For that purpose, we can use a BPF_MAP_TYPE_ARRAY
:
struct
bpf_map_def
SEC
(
"maps"
)
countmap
=
{
.
type
=
BPF_MAP_TYPE_ARRAY
,
.
key_size
=
sizeof
(
int
),
.
value_size
=
sizeof
(
int
),
.
max_entries
=
256
,
};
The map is defined using the bpf_map_def
struct, and it will be named countmap
for reference in the program.
At this point, we can write some code to actually count the packets. We know that programs of type BPF_PROG_TYPE_SOCKET_FILTER
are one of our options because by using such a program, we can see all the packets flowing through an interface. Therefore we attach the program to the right header with SEC("socket")
:
SEC
(
"socket"
)
int
socket_prog
(
struct
__sk_buff
*
skb
)
{
int
proto
=
load_byte
(
skb
,
ETH_HLEN
+
offsetof
(
struct
iphdr
,
protocol
));
int
one
=
1
;
int
*
el
=
bpf_map_lookup_elem
(
&
countmap
,
&
proto
);
if
(
el
)
{
(
*
el
)
++
;
}
else
{
el
=
&
one
;
}
bpf_map_update_elem
(
&
countmap
,
&
proto
,
el
,
BPF_ANY
);
return
0
;
}
After the ELF header attachment we can use the load_byte
function to extract the protocol section from the sk_buff
struct. Then we use the protocol ID as a key to do a bpf_map_lookup_elem
operation to extract the current counter value from our countmap
so that we can increment it or set it to 1 if it is the first packet ever. Now we can update the map with the incremented value using bpf_map_update_elem
.
To compile the program to an ELF file, we just use Clang with -target bpf
. This command creates a bpf_program.o
file that we will load using the loader:
clang -O2 -target bpf -c bpf_program.c -o bpf_program.o
The loader is the program that actually opens our compiled BPF ELF binary bpf_program.o
and attaches the defined BPF program and its maps to a socket that is created against the interface under observation, in our case lo
, the loopback interface.
The most important part of the loader is the actual loading of the ELF file:
if
(
load_bpf_file
(
filename
))
{
printf
(
"%s"
,
bpf_log_buf
);
return
1
;
}
sock
=
open_raw_sock
(
"lo"
);
if
(
setsockopt
(
sock
,
SOL_SOCKET
,
SO_ATTACH_BPF
,
prog_fd
,
sizeof
(
prog_fd
[
0
])))
{
printf
(
"setsockopt %s
"
,
strerror
(
errno
));
return
0
;
}
This will populate the prog_fd
array by adding one element that is the file descriptor of our loaded program that we can now attach to the socket descriptor of our loopback interface lo
opened with open_raw_sock
.
The attach is done by setting the option SO_ATTACH_BPF
to the raw socket opened for the interface.
At this point our user-space loader is able to look up map elements while the kernel sends them:
for
(
i
=
0
;
i
<
10
;
i
++
)
{
key
=
IPPROTO_TCP
;
assert
(
bpf_map_lookup_elem
(
map_fd
[
0
],
&
key
,
&
tcp_cnt
)
==
0
);
key
=
IPPROTO_UDP
;
assert
(
bpf_map_lookup_elem
(
map_fd
[
0
],
&
key
,
&
udp_cnt
)
==
0
);
key
=
IPPROTO_ICMP
;
assert
(
bpf_map_lookup_elem
(
map_fd
[
0
],
&
key
,
&
icmp_cnt
)
==
0
);
printf
(
"TCP %d UDP %d ICMP %d packets
"
,
tcp_cnt
,
udp_cnt
,
icmp_cnt
);
sleep
(
1
);
}
To do the lookup, we attach to the array map using a for
loop and bpf_map_lookup_elem
so that we can read and print the values for the TCP, UDP, and ICMP packet counters, respectively.
The only thing left is to compile the program!
Because this program is using libbpf, we need to compile it from the kernel source tree we just cloned:
$
cd
$KERNEL_SRCTREE
/tools/lib/bpf$
make
Now that we have libbpf, we can compile the loader using this script:
KERNEL_SRCTREE
=
$1
LIBBPF
=
${
KERNEL_SRCTREE
}
/tools/lib/bpf/libbpf.a clang -o loader-bin -I${
KERNEL_SRCTREE
}
/tools/lib/bpf/-I
${
KERNEL_SRCTREE
}
/tools/lib -I${
KERNEL_SRCTREE
}
/tools/include-I
${
KERNEL_SRCTREE
}
/tools/perf -I${
KERNEL_SRCTREE
}
/samples
${
KERNEL_SRCTREE
}
/samples/bpf/bpf_load.cloader.c
"
${
LIBBPF
}
"
-lelf
As you can see, the script includes a bunch of headers and the libbpf library from
the kernel itself, so it must know where to find the kernel source code. To do that, you can replace $KERNEL_SRCTREE
in it or just write that script into a file and use it:
$
./build-loader.sh /tmp/linux-stable
At this point the loader will have created a loader-bin
file that can be finally
started along with the BPF program’s ELF file (requires root privileges):
# ./loader-bin bpf_program.o
After the program is loaded and started it will do 10 dumps, one every second showing the packet count for each one of the three considered protocols. Because the program is attached to the loopback device lo
, along with the loader you can run ping
and see the ICMP counter increasing.
So run ping
to generate ICMP traffic to localhost:
$
ping -c100
127.0.0.1
This starts pinging localhost 100 times and outputs something like this:
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.100 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.107 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.093 ms 64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from 127.0.0.1: icmp_seq=5 ttl=64 time=0.105 ms 64 bytes from 127.0.0.1: icmp_seq=6 ttl=64 time=0.093 ms 64 bytes from 127.0.0.1: icmp_seq=7 ttl=64 time=0.104 ms 64 bytes from 127.0.0.1: icmp_seq=8 ttl=64 time=0.142 ms
Then, in another terminal, we can finally run our BPF program:
# ./loader-bin bpf_program.o
It begins dumping out the following:
TCP 0 UDP 0 ICMP 0 packets TCP 0 UDP 0 ICMP 4 packets TCP 0 UDP 0 ICMP 8 packets TCP 0 UDP 0 ICMP 12 packets TCP 0 UDP 0 ICMP 16 packets TCP 0 UDP 0 ICMP 20 packets TCP 0 UDP 0 ICMP 24 packets TCP 0 UDP 0 ICMP 28 packets TCP 0 UDP 0 ICMP 32 packets TCP 0 UDP 0 ICMP 36 packets
At this point, you already know a good amount of what is needed to filter packets on Linux using a socket filter eBPF program. Here’s some big news: that’s not the only way! You might want to instrument the packet scheduling subsystem in place by using the kernel instead of on sockets directly. Just read the next section to learn how.
Traffic Control is the kernel packet scheduling subsystem architecture. It is made of mechanisms and queuing systems that can decide how packets flow and how they are accepted.
Some use cases for Traffic Control include, but are not limited to, the following:
Prioritize certain kinds of packets
Drop specific kind of packet
Bandwidth distribution
Given that in general Traffic Control is the way to go when you need to redistribute network resources in a system, to get the best out of it, specific Traffic Control configurations should be deployed based on the kind of applications that you want to run. Traffic Control provides a programmable classifier, called cls_bpf
, to let the hook into different levels of the scheduling operations where they can read and update socket buffer and packet metadata to do things like traffic shaping, tracing, preprocessing, and more.
Support for eBPF in cls_bpf
was implemented in kernel 4.1, which means that this kind of program has access to eBPF maps, has tail call support, can access IPv4/IPv6 tunnel metadata, and in general use helpers and utilities coming with eBPF.
The tooling used to interact with networking configuration related to Traffic Control is part of the iproute2 suite, which contains ip
and tc
, which are used to manipulate network interfaces and traffic control configuration, respectively.
At this point, learning Traffic Control can be difficult without the proper reference in terms of terminology. The following section can help.
As mentioned, there are interaction points between Traffic Control and BPF programs, so you need to understand some Traffic Control concepts. If you have already mastered Traffic Control, feel free to skip this terminology section and go straight to the examples.
Queuing disciplines (qdisc) define the scheduling objects used to enqueue packets going to an interface by changing the way they are sent; those objects can be classless or classful.
The default qdisc is pfifo_fast
, which is classless and enqueues packets on three FIFO (first in first out) queues that are dequeued based on their priority; this qdisc is not used for virtual devices like the loopback (lo
) or Virtual Ethernet devices (veth
) that use noqueue
instead. Besides being a good default for its scheduling algorithm, pfifo_fast
also doesn’t require any configuration to work.
Virtual interfaces can be distinguished from physical interfaces (devices) by asking the /sys pseudo filesystem:
ls -la /sys/class/net total 0 drwxr-xr-x 2 root root 0 Feb 13 21:52 . drwxr-xr-x 64 root root 0 Feb 13 18:38 .. lrwxrwxrwx 1 root root 0 Feb 13 23:26 docker0 -> ../../devices/virtual/net/docker0 lrwxrwxrwx 1 root root 0 Feb 13 23:26 enp0s31f6 -> ../../devices/pci0000:00/0000:00:1f.6/net/enp0s31f6 lrwxrwxrwx 1 root root 0 Feb 13 23:26 lo -> ../../devices/virtual/net/lo
At this point, some confusion is normal. If you’ve never heard about qdiscs, one thing you can do is to use the ip a
command to show the list of network interfaces configured in the current system:
ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s31f6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel stateDOWN group default qlen 1000 link/ether 8c:16:45:00:a7:7e brd ff:ff:ff:ff:ff:ff 6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:38:54:3c:98 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:38ff:fe54:3c98/64 scope link valid_lft forever preferred_lft forever
This list already tells us something. Can you find the word qdisc
in it? Let’s analyze the situation:
We have three network interfaces in this system: lo
, enp0s31f6
, and
docker0
.
The lo
interface is a virtual interface, so it has qdisc noqueue
.
The enp0s31f6
is a physical interface. Wait, why is the qdisc here fq_codel
(fair queue controlled delay)? Wasn’t pfifo_fast
the default? It turns out that the system we’re testing the commands on is running Systemd, which is setting the default qdisc differently using the kernel parameter net.core.default_qdisc
.
The docker0
interface is a bridge interface, so it uses a virtual
device and has noqueue
qdisc.
The noqueue
qdisc doesn’t have classes, a scheduler, or a classifier. What it does is that it tries to send the packets immediately. As stated, noqueue
is used by default by virtual devices, but it’s also the qdisc that becomes effective to any interface when you delete its current associated qdisc.
fq_codel
is a classless qdisc that classifies the incoming packets using a stochastic model in order to be able to queue traffic flows in a fair way.
The situation should be clearer now; we used the ip
command to find information about qdiscs
but it turns out that in the iproute2
toolbelt there’s also a tool called tc
that has a specific subcommand for qdiscs you can use to list them:
tc qdisc ls qdisc noqueue 0: dev lo root refcnt 2 qdisc fq_codel 0: dev enp0s31f6 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn qdisc noqueue 0: dev docker0 root refcnt 2
There’s much more going on here! For docker0
and lo
we basically see the
same information as with ip a
, but for enp0s31f6
, for example, it has the following:
A limit of 10,240 incoming packets that it can handle.
As mentioned, the stochastic model used by fq_codel
wants to queue traffic into
different flows, and this output contains the information about how many of them we
have, which is 1,024.
Now that the key concepts of qdiscs have been introduced, we can take a closer look at classful and classless qdiscs in the next section to understand their differences and which ones are suitable for BPF programs.
Classful qdiscs allow the definition of classes for different kinds of traffic in order to apply different rules to them. Having a class for a qdisc means that it can contain further qdiscs. With this kind of hieararchy, then, we can use a filter (classifier) to classify the traffic by determining the next class where the packet should be enqueued.
Filters are used to assign packets to a particular class based on their type. Filters are used inside a classful qdiscs to determine in which class the packet should be enqueued, and two or more filters can map to the same class, as shown in Figure 6-2. Every filter uses a classifier to classify packets based on their information.
As mentioned earlier, cls_bpf
is the classifier that we want to use to write BPF programs for Traffic Control—we have a concrete example in the next sections on how to use it.
Classes are objects that can live only in a classful qdisc; classes are used in Traffic Control to create hierarchies. Complex hierarchies are made possible by the fact that a class can have filters attached to it, which can then be used as an entry point for another class or for a qdisc.
A classless qdiscs is a qdisc that can’t have any children because it is not allowed to have any classes associated. This means that is not possible to attach filters to classless qdiscs. Because classless qdiscs can’t have children, we can’t add filters and classifiers to them, so classless qdiscs are not interesting from a BPF point of view but still useful for simple Traffic Control needs.
After building up some knowledge on qdiscs, filters, and classes, we now show you how to write BPF programs for a cls_bpf
classifier.
As we said, Traffic Control is a powerful mechanism that is made even more powerful thanks to classifiers; however, among all the classifiers, there is one that allows you to program the network data path cls_bpf
classifier. This classifier is special because it can run BPF programs, but what does that mean? It means that cls_bpf
will allow you to hook your BPF programs directly in the ingress and egress layers, and running BPF programs hooked to those layers means that they will be able to access the sk_buff
struct for the respective packets.
To understand better this relationship between Traffic Control and BPF programs, see Figure 6-3, which shows how BPF programs are loaded against the cls_bpf
classifier. You will also notice that such programs are hooked into ingress and egress qdiscs. All the other interactions in context are also described. By taking the network interface as the entry point for network traffic, you will see the following:
The traffic first goes to the Traffic Control’s ingress hook.
Then the kernel will execute the BFP program loaded into the ingress from userspace for every request coming in.
After the ingress program is executed, the control is given to the networking stack that informs the user’s application about the networking event.
After the application gives a response, the control is passed to the Traffic Control’s egress using another BPF program that executes, and upon completion gives back control to the kernel.
A response is given to the client.
You can write BPF programs for Traffic Control in C and compile them using LLVM/Clang with the BPF backend.
Ingress and egress qdiscs allow you to hook Traffic Control into inbound (ingress) and outbound (egress) traffic, respectively.
To make this example work, you need to run it on a kernel that has been compiled with cls_bpf
directly or as a module. To verify that you have everything you need, you can do the following:
cat /proc/config.gz|
zcat|
grep -i BPF
Make sure you get at least the following output with either y
or m
:
CONFIG_BPF
=
yCONFIG_BPF_SYSCALL
=
yCONFIG_NET_CLS_BPF
=
mCONFIG_BPF_JIT
=
yCONFIG_HAVE_EBPF_JIT
=
yCONFIG_BPF_EVENTS
=
y
Let’s now see how we write the classifier:
SEC
(
"classifier"
)
static
inline
int
classification
(
struct
__sk_buff
*
skb
)
{
void
*
data_end
=
(
void
*
)(
long
)
skb
->
data_end
;
void
*
data
=
(
void
*
)(
long
)
skb
->
data
;
struct
ethhdr
*
eth
=
data
;
__u16
h_proto
;
__u64
nh_off
=
0
;
nh_off
=
sizeof
(
*
eth
);
if
(
data
+
nh_off
>
data_end
)
{
return
TC_ACT_OK
;
}
The “main” of our classifier is the classification
function. This function is annotated with a section header called
classifier
so that tc
can know that this is the classifier to use.
At this point, we need to extract some information from the skb
; the data
member contains all the data for the current packet and all its protocol details. To let our program know what’s inside of it, we need to cast it to an Ethernet frame (in our case, with the *eth
variable). To make the static verifier happy, we need to check that the data, summed up with the size of the eth
pointer, does not exceed the space where data_end
is. After that, we can go one level inward and get the protocol type from the h_proto
member in *eth
:
if
(
h_proto
==
bpf_htons
(
ETH_P_IP
))
{
if
(
is_http
(
skb
,
nh_off
)
==
1
)
{
trace_printk
(
"Yes! It is HTTP!
"
);
}
}
return
TC_ACT_OK
;
}
After we have the protocol, we need to convert it from the host to check whether it is equal to the IPv4 protocol, the one we are interested in, and if it is, we check whether the inner packet is HTTP using our own is_http
function. If it is, we print a debug message stating that we found an HTTP packet:
void
*
data_end
=
(
void
*
)(
long
)
skb
->
data_end
;
void
*
data
=
(
void
*
)(
long
)
skb
->
data
;
struct
iphdr
*
iph
=
data
+
nh_off
;
if
(
iph
+
1
>
data_end
)
{
return
0
;
}
if
(
iph
->
protocol
!=
IPPROTO_TCP
)
{
return
0
;
}
__u32
tcp_hlen
=
0
;
The is_http
function is similar to our classifier function, but it will start from an skb
by knowing already the start offset for the IPv4 protocol data. As we did earlier, we need to do a check before accessing the IP protocol data with the *iph
variable to let the static verifier know our intentions.
When that’s done, we just check whether the IPv4 header contains a TCP packet so that we can go ahead. If the packet’s protocol is of type IPPROTO_TCP
, we need to do some more checks again to get the actual TCP header in the *tcph
variable:
plength
=
ip_total_length
-
ip_hlen
-
tcp_hlen
;
if
(
plength
>=
7
)
{
unsigned
long
p
[
7
];
int
i
=
0
;
for
(
i
=
0
;
i
<
7
;
i
++
)
{
p
[
i
]
=
load_byte
(
skb
,
poffset
+
i
);
}
int
*
value
;
if
((
p
[
0
]
==
'H'
)
&&
(
p
[
1
]
==
'T'
)
&&
(
p
[
2
]
==
'T'
)
&&
(
p
[
3
]
==
'P'
))
{
return
1
;
}
}
return
0
;
}
Now that the TCP header is ours, we can go ahead and load the first seven bytes from the skb
struct at the offset of the TCP payload poffset
. At this point we can check whether the bytes array is a sequence saying HTTP
; then we know that the Layer 7 protocol is HTTP, and we can return 1—otherwise, we return zero.
As you can see, our program is simple. It will basically allow everything, and when receiving an HTTP packet, it will let us know with a debugging message.
You can compile the program with Clang, using the bpf
target, as we did before with the socket filter example. We cannot compile this program for Traffic Control in the same way; this will generate an ELF file classifier.o
that will be loaded by tc
this time and not by our own custom loader:
clang -O2 -target bpf -c classifier.c -o classifier.o
Now we can install the program on the interface we want our program to operate on; in our case, it was eth0
.
The first command will replace the default qdisc for the eth0
device, and the second one will actually load our cls_bpf
classifier into that ingress
classful qdisc. This means that our program will handle all traffic going into that interface. If we want to handle outgoing traffic, we would need to use the egress
qdisc instead:
# tc qdisc add dev eth0 handle 0: ingress
# tc filter add dev eth0 ingress bpf obj classifier.o flowid 0:
Our program is loaded now—what we need is to send some HTTP traffic to that interface.
To do that you need any HTTP server on that interface. Then you can curl
the interface IP.
In case you don’t have one, you can obtain a test HTTP server using Python 3 with the http.server
module. It will open the port 8000 with a directory listing of the current working directory:
python3 -m http.server
At this point you can call the server with curl
:
$
curl http://192.168.1.63:8080
After doing that you should see your HTTP response from the HTTP server. You can now get your debugging messages (created with trace_printk
), confirming that using the dedicated tc
command:
# tc exec bpf dbg
The output will be something like this:
Running! Hang up with ^C! python3-18456[
000]
..s1 283544.114997: 0: Yes! It is HTTP! python3-18754[
002]
..s1 283566.008163: 0: Yes! It is HTTP!
Congratulations! You just made your first BPF Traffic Control classifier.
Instead of using a debugging message like we did in this example, you could use a map to communicate to user-space that the interface just received an HTTP packet. We leave this as an exercise for you to do. If you look at classifier.c
in the previous example, you can get an idea of how to do that by looking at how we used the map countmap
there.
At this point, what you might want is to unload the classifier. You can do that by deleting the ingress qdisc that you just attached to the interface:
# tc qdisc del dev eth0 ingress
You might have noticed that another object exists for BPF programs called act_bpf
. It turns out that act_bpf
is an action, not a classifier. This makes it operationally different because actions are objects attached to filters, and because of this it is not able to perform filtering directly, requiring Traffic Control to consider all the packets first. For this property, it is usually preferable to use the cls_bpf
classifier instead of the act_bpf
action.
However, because act_bpf
can be attached to any classifier, there might be cases for which you find it useful to just reuse a classifier you already have and attach a BPF program to it.
Even though the Traffic Control cls_bpf
and XDP programs look very similar, they are pretty different.
XDP programs are executed earlier in the ingress data path, before entering into the main kernel network stack, so our program does not have access to a socket buffer struct sk_buff
like with tc
. XDP programs instead take a different structure called xdp_buff
, which is an eager representation of the packet without metadata. All this comes with advantages and disadvantages. For example, being executed even before the kernel code, XDP programs can drop packets in an efficient way. Compared to Traffic Control programs, XDP programs can be attached only to traffic in ingress to the system.
At this point, you might be asking yourself when it’s an advantage to use one instead of the other. The answer is that because of their nature of not containing all the kernel-enriched data structures and metadata, XDP programs are better for use cases covering OSI layers up to Layer 4. But let’s not spoil all the content of the next chapter!
It should now be pretty clear to you that BPF programs are useful for getting visibility and control at different levels of the networking data path. You’ve seen how to take advantage of them to filter packets using high-level tools that generate a BPF assembly. Then we loaded a program to a network socket, and in the end we attached our programs to the Traffic Control ingress qdisc to do traffic classification using BPF programs. In this chapter we also briefly discussed XDP, but be prepared, because in Chapter 7 we cover the topic in its entirety by expanding on how XDP programs are constructed, what kind of XDP programs there are, and how to write and test them.