Skip to content

How We Test Our eBPF Traffic Capturing Tool at Seekret

Guy Arbitman
By Guy Arbitman
8 min read
eBPFTesting
How We Test Our eBPF Traffic Capturing Tool at Seekret

Creating a production-ready eBPF tool is not an easy task. It requires a clear understanding of the use case, the capabilities and limitations of the technology, and a fine work of art to make it happen. No production-ready tool can live without proper testing, so how can we test an eBPF tool?

Bugs? That’s not possible!

errors

Tests. Such a terrifying word to say to a developer. After all, every piece of code we add is no less than perfection. But in rare cases, once in a lifetime, accidentally, the keyboard itself starts creating these annoying bugs, and who takes the blame? We do. What happens next? We'll probably be asked to add tests to the code.

Then, we reveal an enormous world with plenty of paradigms for implementing tests. We have unit tests (UTs) to check a small function in the code and integration tests to check multiple components working together. Then there's end-to-end (e2e) tests to cover a full flow of the system with all (or most) of the components.

We can also run tests with mocks to mimic the behavior of components and to reduce the complexity of the tests. Tests can run manually and can run as part of the CI/CD. We split tests into different suites and run each suite when it fits best, like PR testing suites, nightly, pre-release, and so on.

And what about frameworks? Each language has at least 2 to 3 leading frameworks to help us write tests easily.

Each project has its own testing methodology, which is influenced by the tech stack (the language we use, the components we develop), the project leaders, the latest innovation in the testing field, the complexity of testing components, and so on.

A real mess, if you ask me. But you should be grateful! Those are rich-people problems. For most projects, you can always take inspiration from other projects or articles. But there are some domains, including eBPF, that are harder to test as there are no industry leaders and everyone needs to re-invent the wheel.

Hopefully, in this article, we will be able to give you some inspiration to test your eBPF program.

No man's land

No man's land

For eBPF projects, there are no testing frameworks at the moment, and there are enormous complexities to test such a project. When building an eBPF program, you're probably intending to run the code in a kernel. Today, there are plenty of Linux distributions and for each distribution, multiple versions of the kernels, and each version of the kernel, might have a change that messes with your eBPF program.

Thus we encounter our first problem - the diversity of distributions with a range of kernels. It's not feasible to gain 100% on that matrix, so we have to be realistic and select a small matrix to test on.

The second problem is derived from the first problem. Suppose we have a small and maintainable matrix. We need to run the actual kernel and test upon that kernel. So our testing framework requires a full virtualization solution, for example, with multiple images on VirtualBox, or multiple machines in a cloud environment. But this is hard to scale, heavy to pack and deliver, or expensive to use.

Last but not least, an eBPF program requires triggering events in the kernel (a.k.a function calls). Implementing a generic framework is possible, but probably complex as some events cannot be triggered directly from the user space (for example - an event being raised when a packet arrives at the network card [XDP]).

So what can we do?

Creating a hybrid testing solution

Although we wish to have 100% coverage, that’s not possible, so we do our best. Like many other fields, we can blend different paradigms into a hybrid solution.

Our project

Before I dive into the solution, let’s understand our specific scenario and use case.

eBPF with Seekret

At Seekret, we implemented an eBPF based traffic capturing tool. The tool hooks several system calls (in the kernel), those hooks give us visibility about every traffic payload being received or sent, and eventually, we are able to assemble different pieces into a complete traffic protocol (HTTP, gRPC, Mongo, Postgres, Kafka, AMQP, etc.).

Blend-in

Our methodology is a combination of different testing paradigms. In that way, we are able to cover most areas and feel more comfortable when releasing a new version to our customers. We leverage each paradigm's advantages to have the best testing mesh we can use.

We blend:

  1. Unit Tests

  2. End-To-End Tests

  3. Real Environment

  4. Performance Testing

Unit tests (UTs)

unit testing

UTs are the closest testing paradigm when we write code. We are able to test a unit of the code (a single function or several functions altogether) and cover all use-cases for that unit, resulting in high confidence in the function's behavior.

But how can we use UTs when the functions are handling eBPF functionality?

For example, if we have a function that handles a perf event being sent from the kernel as a response to a kernel event, how can we test it?

The easiest solution would be - don’t test it via UTs.

Or…. let’s try and make it happen.

We can have 2 options:

  1. Use mocks

  2. Trigger such an event

For mocks - we can implement the function to be more abstract and to receive the event from a channel (in Golang) or an FD (in C) to poll on. Then, in the test setup, we mock the events and inject dummy events to test the function’s behavior.

func TestFunc(t *testing.T) {
// Setting op a mock channel with predefined mock events
eventsChannel := setupMockChannelWithEvents()
// The method that runs our eBPF program and handles events raised from the kernel.
testKernelHookHandler(eventsChannel)
}
For some scenarios, we can trigger a real event. We might need to add more capabilities to our eBPF program (for example, filter in events that were triggered by process X), but then we could just trigger the event by code, for instance - assume our eBPF program sends perf event for every “Close” syscall:
func TestFunc(t *testing.T) {
// The method that runs our eBPF program and catch kernel events
go catchEventAndAssert()
fd, err := syscall.Open("<file>", syscall.O_RDWR, 0)
require.NoError(t, err)
syscall.Close(fd)
// Waiting for status from the goroutine for 10 seconds
waitForStatus(time.Second * 10)
}

End-to-end

Here we are zooming out, and instead of testing a function, we test the entire solution from A to Z.

The first step is to define the testing scenario. They can be endless, so we have the ask ourselves the following questions:

  1. Is it a flow of the product?

  2. Was it a bug in a customer’s env?

  3. Is it hard to test via UT?

For each test, we set up the external dependencies it needs (Kafka/Mongo/HTTP/gRPC/TLS servers), and then we run our eBPF traffic capturing tool and use the appropriate client to perform a flow.

We let the eBPF program fully run and check if the outcome fits our expectations, which was, in our use case, checking to see if we captured the appropriate traffic.

def test_mongo():
run_capturing_tool()
run_mongo_server()
run_mongo_client()
assert_captured_traffic()

Real environment

We are closer to releasing a version, and now we can run a limited E2E test suite on multiple real environments.

We have multiple, dedicated cloud environments (K8S, ECS, docker-compose on machines) with different Linux distributions and kernel versions, and we run the limited suite on all environments to check there is no supported distribution that we fail to run on.

We chose the distribution carefully, and we tried to make sure we ran on every Linux distribution we saw at our customer's site!

Performance and stability testing

The last piece of the puzzle. Our traffic capturing tool runs in our customer's production environment, and we must ensure any new version won’t affect the customer's performance. Thus, we have a dedicated methodology for performance testing.

During that suite, we make sure our CPU and Memory consumption during heavy load is maintainable and never passes predefined limitations.

We check our product for long periods to ensure it is stable and won’t crash.

Finding performance or stability issues at that point is expensive, but less expensive than finding it in a customer’s environment.

When we find a performance issue, we go back to the code and we try to use benchmark tools (like the built-in benchmark tool of Golang) to find bottlenecks in the code affecting our performance.

Tips and tricks

tips

Today, I’m smarter than I was yesterday

Our philosophy is not the absolute truth, and even if it fits our needs today, we must ask ourselves every day, is it still relevant? Are we out-of-date? What was missing in the philosophy to find the latest issues? Is our SDLC fast enough? Should we update the distribution matrix? Can we automate more processes?

libbpf + CO-RE

We have found it very useful to use libbpf+CO-RE in our eBPF product. It was also great for enhancing our testing solution. With libbpf+CO-RE, we have fewer portability issues, so adding more distributions and kernels to our testing matrix is easy peasy.

Check out our earlier blog to better understand the portability issues we've faced in the past.

Automation, automation, automation

We are running most of our testing suites (PR, nightly, pre-release) using GitHub actions. The suites contain E2E testing, which means we deploy our eBPF product on GitHub actions machines and simulate a real installation of our product.

Since the E2E tests are running automatically within every push to GitHub, it helps us find problems closer to development and reduce time in fixing the issues.

For the grand finale -

  1. We have several GitHub machines with different distributions and kernel versions

  2. We have UTs & E2E tests running automatically upon every push

The combination of these two aspects allows us to “overcome” the pitfalls I mentioned in the second section.

For more tips & tricks on testing eBPF tools, or to network with other eBPF enthusiasts, check out our eBPF IL Community Slack channel!

eBPF IL Community #2