Skip to content

A Practical Guide to Capturing Production Traffic with eBPF

Guy Arbitman
By Guy Arbitman
10 min read
eBPF
A Practical Guide to Capturing Production Traffic with eBPF

Have you ever worked on an HTTP server and experienced unintended behavior? Then spent hours trying to debug it without any luck?

It can be tough to figure out what’s causing these issues and a common way to really analyze what’s going on is to capture some traffic and observe it.

By now, you’ve probably tried a few tools for capturing traffic in production, such as tcpdump, which is one of the most common solutions. However, it doesn’t allow application-level filtering (aka L7 filtering), so whenever you try to capture relevant HTTP sessions only, you end up with hundreds of MegaBytes (maybe even get GigaBytes) of traffic that you need to store and search through.

Another solution would be add an algorithm to your source code that would look for relevant HTTP sessions, but this requires code instrumentation in production and best practices always call for non-intrusive observability.

This is precisely where eBPF comes in. It is a mechanism for Linux applications to execute code in the Linux kernel space. Using eBPF, we can create a magical traffic capturing tool that goes far beyond your standard solution (like WireShark, Fiddler, and tcpdump).

eBPF allows you to add multiple filtering layers and it captures traffic directly from the kernel, significantly reducing the volume of output to only relevant data and ensuring you can handle the high throughput of your application traffic.

In this article, we’ll explore what eBPF is and how you can build an eBPF-based protocol tracer to capture production traffic without any hassle.

What exactly is eBPF?

If you've read about tcpdump, you probably heard about BPF, which is a utility that allows tcpdump to filter irrelevant packets very quickly. But again, tcpdump handles packets and an HTTP session is generally composed over multiple TCP packets.

The Berkeley Packet Filter (BPF) is a technology used in certain computer operating systems for programs that need to, among other things, analyze network traffic.

In recent years, BPF has evolved and a new technology called eBPF (extended BPF) was created. It allows us to easily add hooks to kernel syscalls and functions and manipulate or only observe the input parameters to the kernel function! Using eBPF, many companies are now offering security and observability features for your servers without any instrumentation or any knowledge about your code.

There are plenty of wonderful articles about eBPF and its full capabilities, so if you are interested in learning more and gaining a deeper understanding of the technology, check out the references section for additional articles.

Now that we’ve touched on the basics, let’s get started on building an eBPF protocol tracer.

Here’s what you’ll need

  • Any Linux machine (ubuntu, debian, etc.)

  • BCC tool, follow the installation guide here.

  • GO version 1.16+, follow the installation guide here.

For simplicity, we have created a docker (debian based) with the dependencies above. Check out the repo for running instructions.

Building your eBPF-based traffic capturer

This walkthrough is inspired by Pixie Lab’s eBPF-based data collector and the example code snippets are taken from the Pixie tracer public repo.

First, find the syscalls to track

The full code for this workshop can be found here. (The code snippets present relevant parts of the code for simplicity; the full code is in the repo). Let’s assume you have a REST API server, you want to capture its traffic, and you are eager to do it with eBPF but you don’t know how. We are here for you!

Assume the following is your REST API server written in GO.

TL;DR: The following is an example of an HTTP web server that gets a single POST request and responds with a random-generated payload.

package main
...
const (
defaultPort = "8080"
maxPayloadSize = 10 * 1024 * 1024 // 10 MB
letterBytes = "a[email protected]#$"
)
...
// customResponse holds the requested size for the response payload.
type customResponse struct {
Size int `json:"size"`
}
func postCustomResponse(context *gin.Context) {
var customResp customResponse
if err := context.BindJSON(&customResp); err != nil {
_ = context.AbortWithError(http.StatusBadRequest, err)
return
}
if customResp.Size > maxPayloadSize {
_ = context.AbortWithError(http.StatusBadRequest, fmt.Errorf("requested size %d is bigger than max allowed %d", customResp, maxPayloadSize))
return
}
context.JSON(http.StatusOK, map[string]string{"answer": randStringBytes(customResp.Size)})
}
func main() {
engine := gin.New()
engine.Use(gin.Recovery())
engine.POST("/customResponse", postCustomResponse)
port := os.Getenv("PORT")
if port == "" {
port = defaultPort
}
fmt.Printf("listening on 0.0.0.0:%s\n", port)
if err := engine.Run(fmt.Sprintf("0.0.0.0:%s", port)); err != nil {
log.Fatal(err)
}
}

We can run it using:

go run server.go

And test it using:

curl -X POST http://localhost:8080/customResponse -d '{"size": 100}'

The first thing you need to do is to understand which syscalls are being used, so we will use the strace tool for this.

Run the server as the following:

sudo strace -f -o syscalls_dump.txt go run server.go

We can use -f to capture syscalls from the threads of the server and -o to write all output into a file.

After that, we will run the above curl command, check the syscalls_dump.txt, and observe the following:

38988 accept4(3, <unfinished ...>
38987 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
38988 <... accept4 resumed>{sa_family=AF_INET, sin_port=htons(57594), sin_addr=inet_addr("127.0.0.1")}, [112->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 7
...
38988 read(7, <unfinished ...>
38987 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
38988 <... read resumed>"POST /customResponse HTTP/1.1\r\nH"..., 4096) = 175
...
38988 write(7, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 237 <unfinished ...>
...
38989 close(7)

So, we can see that, at first, the server used the accept4 syscall to accept a new connection. We can also see that the FD (file descriptor) for the new socket is 7 (the return code of the syscall). Furthermore, we can see that for every other syscall, the first argument (which is the fd) is 7, so all the operations are happening on the same socket.

Here is the flow:

  • Accept new connection using the accept4 syscall

  • Read the content from the socket using the read syscall on the socket file descriptor

  • Write the response to the socket using the write syscall on the socket file descriptor

  • And finally, close the file descriptor using the close syscall

Now that we understand how the server is working, it's showtime!

Planning

We will implement 8 hooks (entry and exit hook for accept4, read, write, and close syscalls). The hooks reside in the kernel and are written in C language. We need the combination of all hooks to perform the full capturing process, thus we will explain the basics about each hook, and you can review the entire kernel code in our repo.

The user-mode client is written in Go languages, and it reads the kernel code from a file and compiles the source code at runtime using clang during the start-up of the client user-mode.

Next, build the eBPF hooks

We will use the BCC framework as it is more common today (although we suggest working with libbpf).

In most cases, eBPF code is composed of a kernel agent that performs the hooking and a user mode agent that handles the events being sent from the kernel. There are some other use-cases in which we have only the kernel agent (for example, a simple firewall that blocks irrelevant traffic).

To start, we need to hook the accept4 syscall. In eBPF, we can place a hook for each system call (syscall) upon its entry and exit (AKA just before the actual code run, and right after it ran). Why is that? The entry is useful for getting the input arguments of the syscall and the return is relevant to know whether the syscall worked as expected.

TL;DR: In the following snippet, we declare structs to save the syscall’s input arguments in the entry of the accept4 syscall and use them in the exit of the syscall, where we can know if the syscall succeeded or not.

// Copyright (c) 2018 The Pixie Authors.
// Licensed under the Apache License, Version 2.0 (the "License")
// Original source: https://github.com/pixie-io/pixie/blob/main/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c
// An helper struct that hold the addr argument of the syscall.
struct accept_args_t {
struct sockaddr_in* addr;
};
// An helper map that will help us cache the input arguments of the accept syscall
// between the entry hook and the return hook.
BPF_HASH(active_accept_args_map, uint64_t, struct accept_args_t);
// Hooking the entry of accept4
// the signature of the syscall is int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
int syscall__probe_entry_accept4(struct pt_regs* ctx, int sockfd, struct sockaddr* addr, socklen_t* addrlen) {
// Getting a unique ID for the relevant thread in the relevant pid.
// That way we can link different calls from the same thread.
uint64_t id = bpf_get_current_pid_tgid();
// Keep the addr in a map to use during the accpet4 exit hook.
struct accept_args_t accept_args = {};
accept_args.addr = (struct sockaddr_in *)addr;
active_accept_args_map.update(&id, &accept_args);
return 0;
}
// Hooking the exit of accept4
int syscall__probe_ret_accept4(struct pt_regs* ctx) {
uint64_t id = bpf_get_current_pid_tgid();
// Pulling the addr from the map.
struct accept_args_t* accept_args = active_accept_args_map.lookup(&id);
// If the id exist in the map, we will get a non empty pointer that holds
// the input address argument from the entry of the syscall.
if (accept_args != NULL) {
process_syscall_accept(ctx, id, accept_args);
}
// Anyway, in the end clean the map.
active_accept_args_map.delete(&id);
return 0;
}

The code snippet above shows us the minimum piece of code for hooking both the entry and exit of the syscall and the method to save the input arguments in the entry of the syscall to be later used in the exit of the syscall.

Why do we do this? Since we cannot know if a syscall will succeed during the entry of the syscall, and we cannot access the input arguments during the exit of the syscall, we need to store the arguments until we know for sure that the syscall managed to succeed and only then can we perform our logic.

Our special logic is in process_syscall_accept, which checks that the syscall finished successfully, then we save the connection info in a global map so we can use it in other syscalls (read, write and close).

TL;DR: In the following snippet, we create functions used by the accept4 hooks and register any new connection made to the server in our own mapping.

// Copyright (c) 2018 The Pixie Authors.
// Licensed under the Apache License, Version 2.0 (the "License")
// Original source: https://github.com/pixie-io/pixie/blob/main/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c
// A struct representing a unique ID that is composed of the pid, the file
// descriptor and the creation time of the struct.
struct conn_id_t {
// Process ID
uint32_t pid;
// The file descriptor to the opened network connection.
int32_t fd;
// Timestamp at the initialization of the struct.
uint64_t tsid;
};
// This struct contains information collected when a connection is established,
// via an accept4() syscall.
struct conn_info_t {
// Connection identifier.
struct conn_id_t conn_id;
// The number of bytes written/read on this connection.
int64_t wr_bytes;
int64_t rd_bytes;
// A flag indicating we identified the connection as HTTP.
bool is_http;
};
// A struct describing the event that we send to the user mode upon a new connection.
struct socket_open_event_t {
// The time of the event.
uint64_t timestamp_ns;
// A unique ID for the connection.
struct conn_id_t conn_id;
// The address of the client.
struct sockaddr_in addr;
};
// A map of the active connections. The name of the map is conn_info_map
// the key is of type uint64_t, the value is of type struct conn_info_t,
// and the map won't be bigger than 128KB.
BPF_HASH(conn_info_map, uint64_t, struct conn_info_t, 131072);
// A perf buffer that allows us send events from kernel to user mode.
// This perf buffer is dedicated for special type of events - open events.
BPF_PERF_OUTPUT(socket_open_events);
// An helper function that checks if the syscall finished successfully and if it did
// saves the new connection in a dedicated map of connections
static __inline void process_syscall_accept(struct pt_regs* ctx, uint64_t id, const struct accept_args_t* args) {
// Extracting the return code, and checking if it represent a failure,
// if it does, we abort the as we have nothing to do.
int ret_fd = PT_REGS_RC(ctx);
if (ret_fd <= 0) {
return;
}
struct conn_info_t conn_info = {};
uint32_t pid = id >> 32;
conn_info.conn_id.pid = pid;
conn_info.conn_id.fd = ret_fd;
conn_info.conn_id.tsid = bpf_ktime_get_ns();
uint64_t pid_fd = ((uint64_t)pid << 32) | (uint32_t)ret_fd;
// Saving the connection info in a global map, so in the other syscalls
// (read, write and close) we will be able to know that we have seen
// the connection
conn_info_map.update(&pid_fd, &conn_info);
// Sending an open event to the user mode, to let the user mode know that we
// have identified a new connection.
struct socket_open_event_t open_event = {};
open_event.timestamp_ns = bpf_ktime_get_ns();
open_event.conn_id = conn_info.conn_id;
bpf_probe_read(&open_event.addr, sizeof(open_event.addr), args->addr);
socket_open_events.perf_submit(ctx, &open_event, sizeof(struct socket_open_event_t));
}

Up until now, we have only been able to identify new connections and to alert the user-mode about them.

Next, we will hook the read syscall.

TL;DR: In the following snippet, we create the hooks for the read syscall.

// Copyright (c) 2018 The Pixie Authors.
// Licensed under the Apache License, Version 2.0 (the "License")
// Original source: https://github.com/pixie-io/pixie/blob/main/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c
// An helper struct to cache input argument of read/write syscalls between the
// entry hook and the exit hook.
struct data_args_t {
int32_t fd;
const char* buf;
};
// Helper map to store read syscall arguments between entry and exit hooks.
BPF_HASH(active_read_args_map, uint64_t, struct data_args_t);
// original signature: ssize_t read(int fd, void *buf, size_t count);
int syscall__probe_entry_read(struct pt_regs* ctx, int fd, char* buf, size_t count) {
uint64_t id = bpf_get_current_pid_tgid();
// Stash arguments.
struct data_args_t read_args = {};
read_args.fd = fd;
read_args.buf = buf;
active_read_args_map.update(&id, &read_args);
return 0;
}
int syscall__probe_ret_read(struct pt_regs* ctx) {
uint64_t id = bpf_get_current_pid_tgid();
// The return code the syscall is the number of bytes read as well.
ssize_t bytes_count = PT_REGS_RC(ctx);
struct data_args_t* read_args = active_read_args_map.lookup(&id);
if (read_args != NULL) {
// kIngress is an enum value that let's the process_data function
// to know whether the input buffer is incoming or outgoing.
process_data(ctx, id, kIngress, read_args, bytes_count);
}
active_read_args_map.delete(&id);
return 0;
}

You should see very high similarities between the first hook that we wrote (accept4) and that new hook. The structure is identical, but there are slight differences mainly in the name of the maps and helper function.

Let’s dive into the helper method process_data that will handle both read and write syscall.

TL;DR: In the following snippet, we create the helper functions to process the read and write syscalls.

// Copyright (c) 2018 The Pixie Authors.
// Licensed under the Apache License, Version 2.0 (the "License")
// Original source: https://github.com/pixie-io/pixie/blob/main/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c
// Data buffer message size. BPF can submit at most this amount of data to a perf buffer.
// Kernel size limit is 32KiB. See https://github.com/iovisor/bcc/issues/2519 for more details.
#define MAX_MSG_SIZE 30720 // 30KiB
struct socket_data_event_t {
// We split attributes into a separate struct, because BPF gets upset if you do lots of
// size arithmetic. This makes it so that it's attributes followed by message.
struct attr_t {
// The timestamp when syscall completed (return probe was triggered).
uint64_t timestamp_ns;
// Connection identifier (PID, FD, etc.).
struct conn_id_t conn_id;
// The type of the actual data that the msg field encodes, which is used by the caller
// to determine how to interpret the data.
enum traffic_direction_t direction;
// The size of the original message. We use this to truncate msg field to minimize the amount
// of data being transferred.
uint32_t msg_size;
// A 0-based position number for this event on the connection, in terms of byte position.
// The position is for the first byte of this message.
uint64_t pos;
} attr;
char msg[MAX_MSG_SIZE];
};
// Perf buffer to send to the user-mode the data events.
BPF_PERF_OUTPUT(socket_data_events);
...
// An helper function that handles read/write syscalls.
static inline __attribute__((__always_inline__)) void process_data(struct pt_regs* ctx, uint64_t id,
enum traffic_direction_t direction,
const struct data_args_t* args, ssize_t bytes_count) {
// Always check access to pointer before accessing them.
if (args->buf == NULL) {
return;
}
// For read and write syscall, the return code is the number of bytes written or read, so zero means nothing
// was written or read, and negative means that the syscall failed. Anyhow, we have nothing to do with that syscall.
if (bytes_count <= 0) {
return;
}
uint32_t pid = id >> 32;
uint64_t pid_fd = ((uint64_t)pid << 32) | (uint32_t)args->fd;
struct conn_info_t* conn_info = conn_info_map.lookup(&pid_fd);
if (conn_info == NULL) {
// The FD being read/written does not represent an IPv4 socket FD.
return;
}
// Check if the connection is already HTTP, or check if that's a new connection, check protocol and return true if that's HTTP.
if (is_http_connection(conn_info, args->buf, bytes_count)) {
// allocate new event.
uint32_t kZero = 0;
struct socket_data_event_t* event = socket_data_event_buffer_heap.lookup(&kZero);
if (event == NULL) {
return;
}
// Fill the metadata of the data event.
event->attr.timestamp_ns = bpf_ktime_get_ns();
event->attr.direction = direction;
event->attr.conn_id = conn_info->conn_id;
// Another helper function that splits the given buffer to chunks if it is too large.
perf_submit_wrapper(ctx, direction, args->buf, bytes_count, conn_info, event);
}
// Update the conn_info total written/read bytes.
switch (direction) {
case kEgress:
conn_info->wr_bytes += bytes_count;
break;
case kIngress:
conn_info->rd_bytes += bytes_count;
break;
}
}

So our helper function is checking if the read (or write) syscall finished successfully (by checking the number or read (written) bytes, and then it check if the data being read (or written) is describing HTTP. If it does, we send it to the user-mode as an event.

Then, go quickly to the write syscall, which is very similar to the read syscall hooks (you can see the code in the repo).

We just need to handle the close syscall and we are done. Here as well the hooks are very similar to the other hooks.

Finally, the code can handle a close event.

TL;DR: In the following snippet, we create the helper functions to process the close syscall.

// Copyright (c) 2018 The Pixie Authors.
// Licensed under the Apache License, Version 2.0 (the "License")
// Original source: https://github.com/pixie-io/pixie/blob/main/src/stirling/source_connectors/socket_tracer/bcc_bpf/socket_trace.c
// Struct describing the close event being sent to the user mode.
struct socket_close_event_t {
// Timestamp of the close syscall
uint64_t timestamp_ns;
// The unique ID of the connection
struct conn_id_t conn_id;
// Total number of bytes written on that connection
int64_t wr_bytes;
// Total number of bytes read on that connection
int64_t rd_bytes;
};
// Perf buffer to send to the user-mode the close events.
BPF_PERF_OUTPUT(socket_close_events);
static inline __attribute__((__always_inline__)) void process_syscall_close(struct pt_regs* ctx, uint64_t id,
const struct close_args_t* close_args) {
int ret_val = PT_REGS_RC(ctx);
// Syscall failed, nothing to do.
if (ret_val < 0) {
return;
}
uint32_t pid = id >> 32;
uint64_t pid_fd = ((uint64_t)pid << 32) | (uint32_t)close_args->fd;
struct conn_info_t* conn_info = conn_info_map.lookup(&pid_fd);
if (conn_info == NULL) {
// The FD being closed does not represent an IPv4 socket FD.
return;
}
// Send to the user mode an event indicating the connection was closed.
struct socket_close_event_t close_event = {};
close_event.timestamp_ns = bpf_ktime_get_ns();
close_event.conn_id = conn_info->conn_id;
close_event.rd_bytes = conn_info->rd_bytes;
close_event.wr_bytes = conn_info->wr_bytes;
socket_close_events.perf_submit(ctx, &close_event, sizeof(struct socket_close_event_t));
// Remove the connection from the mapping.
conn_info_map.delete(&pid_fd);
}

We are officially done with the kernel code!

The user mode is written in GO using the gobpf library.

I’ll just describe the main aspects, which you can find in our example repo.

The first step will be to compile the code:

bpfModule := bcc.NewModule(string(bpfSourceCodeContent), nil)
defer bpfModule.Close()

Then, we create a connection factory responsible for holding all connection instances and to print ready connections and delete inactive or malformed ones.

// Create connection factory and set 1m as the inactivity threshold
// Meaning connections that didn't get any event within the last minute are being closed.
connectionFactory := connections.NewFactory(time.Minute)
// A go routine that runs every 10 seconds and prints ready connections
// And deletes inactive or malformed connections.
go func() {
for {
connectionFactory.HandleReadyConnections()
time.Sleep(10 * time.Second)
}
}()

Load the perf buffer handlers:

if err := bpfwrapper.LaunchPerfBufferConsumers(bpfModule, connectionFactory); err != nil {
log.Panic(err)
}

Here’s a short explanation on the single user mode perf buffer handler, as an example.

Each handler gets the events over a channel (inputChan) and each event is of type byte array ([]byte).

For each event, we are converting it to a golang representation of the struct.

// ConnID is a conversion of the following C-Struct into GO.
// struct conn_id_t {
// uint32_t pid;
// int32_t fd;
// uint64_t tsid;
// };.
type ConnID struct {
PID uint32
FD int32
TsID uint64
}
...

We fix the timestamp of the event, as the kernel mode returns a monotonic clock instead of real time clock, and finally, we update the connection object fields with the new event.

func socketCloseEventCallback(inputChan chan []byte, connectionFactory *connections.Factory) {
for data := range inputChan {
if data == nil {
return
}
var event structs.SocketCloseEvent
if err := binary.Read(bytes.NewReader(data), bpf.GetHostByteOrder(), &event); err != nil {
log.Printf("Failed to decode received data: %+v", err)
continue
}
event.TimestampNano += settings.GetRealTimeOffset()
connectionFactory.GetOrCreate(event.ConnID).AddCloseEvent(event)
}
}

For the last part, attach the hooks.

if err := bpfwrapper.AttachKprobes(bpfModule); err != nil {
log.Panic(err)
}

Finally, the demonstration

Send the client curl request:

client curl request

The sniffer:

the sniffer

Summary

So we have gone through the process of creating an eBPF-based protocol tracer from zero to a fully working sniffer (although very limited, and not flawless 😀.

As you can see, understanding and implementing the first syscalls were the hardest part, but after you get that, writing more syscalls is much easier!

Takeaways

Here at Seekret, we specialize in API-first practices by bringing production knowledge to local API development.

We are able to:

  • Observe all of your APIs by deploying our magical eBPF solution on your production servers

  • Create an API inventory with history and revisions, listed differences between releases, and even diffs between production and dev inventories

  • Find user flows over your different APIs

  • Integrate into your CI/CD pipeline and notify your developers before they make a breaking change to an API (as well as how that change will impact dependencies)

All of the magic above is due to our traffic capturing solution, which utilizes eBPF as a fast and slim tool that captures application protocols (HTTP, kafka, etc.).

Learn more about eBPF and join our eBPF IL Slack community!

References