Popular technologies get more and more attention from hackers who want to spread malware. This affects containers too. Containers are not completely secure – even the ones we create ourselves can be hacked. Malware can be contained in images or downloaded into containers.
Generic containers should keep the intruders at bay. However, we can still increase the level of security. One approach is to instrumentalise a well-known Linux function and tell containers which system calls (syscalls) they are allowed to execute.
What are system calls?
Every Linux process receives a slice of memory when it is started. The code is then free to operate on this memory, for example to perform calculations. For everything else, it must ask the kernel for permission. Some examples:
- Write to a file (write)
- Receive network packets (read)
- Create a directory (mkdir)
- Start a new process (fork)
- Get the time of day (gettimeofday)
The code is given permission by sending the request via syscall to the kernel, which then checks the permissions before responding to the request.
There are about 330 syscalls in Linux, which you can find on the homepage of the Linux Syscall Reference. System calls are independent of the programming language. write remains write, whether it’s written in C or GoLang.
Why filter syscalls?
Roughly speaking, there are three main container attack vectors.
- Backdoors in Docker upstream images
- Exploitable application bugs
- Vulnerable system calls in the Linux kernel
Filtering syscalls can prevent programs and containers from doing anything we do not want them to. For example, if you use Nginx, you can disable system calls that you’re sure Nginx will never need, such as the ones listed in Figure 1. By doing so, you make Nginx more secure without reducing its functionality.
Which calls should we filter?
The example above looks straightforward, but how do you find out which syscalls an application uses? If you filter out too many of them, the application won’t work. But if you don’t filter enough, you leave room for attacks. It’s near impossible to create a perfect list of filters. However, there are five different approaches you can use to approximate a suitable filter list.
Read the source – all of it, including all libraries.
This is basically the only way to be really certain that you have excluded malicious code. However, due to the almost infinite number of dependency chains, this approach is unfeasible in practice.
Trial and error
Of course, you can just have a go at setting filters. But with around 330 syscalls and infinite possible combinations, this approach isn’t practical either.
Making an educated guess
A better option is to make a targeted guess. This of course only makes sense if you have extensive knowledge of software design and syscalls. Even then, the likely result is too many or too few filters being set. Making an educated guess is better than nothing but is still a long way from a well-functioning filter.
Analysing the binaries
Theoretically, you could discern which syscalls are being used by analysing the binaries. The advantage is that binaries are always in machine language. However, different languages and compilers produce different machine code. This makes it basically impossible to determine automatically which syscalls are included and which are not.
Call tracing with strace
Syscalls executed by an application can be tracked with a tracing tool. This can be done during a unit test or CI pipeline, for example. Linux offers the tool strace, which is a great tool for debugging and troubleshooting. strace also shows call parameters. Here’s an example in “Counter Mode”:
> strace -c -S name ./helloworld Hello World! % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 4 brk 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 uname 0.00 0.000000 0 1 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000000 8 total
Recap
- It’s difficult to write a good filter list.
- If you filter too much, the app stops working.
- Filtering too little leaves the door open for attackers.
- Start by using strace to track all required system calls.
- You can then refine your list with educated guesses.
Creating an eBPF with Seccomp
Linux has the ability to run small state machines before any syscall. These programs must be delivered as “extended Berkeley Packet Filters”, eBPF for short, and can be loaded into the kernel via syscall bpf().
eBPF can be used for all kinds of things, e.g. performance measurement, debugging, tracing etc. However, writing eBPF programs is complex, and you just want to filter system calls, not learn a new programming language.
Seccomp BPF can help with this. Created by Google in 2005, Seccomp hides the complexity of eBPF from the user and can filter individual system calls. It also offers other possibilities beyond filtering and can:
- pretend that a syscall was executed although it was not
- return fake results and error numbers
- trigger breakpoints
So Seccomp is a good tool for testing, injecting errors and debugging. Listing 1 shows an example of using Seccomp in C.
int main(int argc, char *argv[]) { scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW); seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(getpid), 0); seccomp_load(ctx); pid_t pid = getpid(); /* never reached: process killed */ return 0; }
Seccomp can filter based on the parameters, as Listing 2 shows:
unsigned char buf[BUF_SIZE]; int fd = open(“data.raw", 0); int rc = seccomp_rule_add( ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 3, SCMP_A0(SCMP_CMP_EQ, fd), SCMP_A1(SCMP_CMP_EQ, (scmp_datum_t)buf), SCMP_A2(SCMP_CMP_LE, BUF_SIZE));
This filters read() calls that don’t meet all three of the following conditions:
- The data is read from the exact file descriptor that was previously created with open()
- The read data is stored in the designated memory area buf.
- No more data is read than will fit in buf.
Filtering by parameters can be useful for a lot of reasons, for example to:
- force read-only system calls
- limit reads and writes to STDOUT and STDIN
- limit setuid() to specific User IDs
- forbid sending signals other than SIGHUP
- prevent setting overly generous file permissions
However, parameter filtering has its limits. Only Pass by Value parameters can be evaluated. That means you can’t look into strings or structures. As an example, you can’t limit open() to certain filenames.
Applied to containers und K8s
The good news is that Seccomp support was added to Docker in v1.10. 44 syscalls are blocked by default, including reboot(). Undesired syscalls will fail, but the program isn’t killed. When writing a custom filter, it is recommended to start with the default filter and adjust it as needed. Custom filters are expressed as JSON files. What does this look like in Docker? Listing 3 has the answer to that question.
{ "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": [ "accept", "access", … ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} } ] }
The system calls listed in Listing 3 are allowed, regardless of their parameters (Listing 4).
{ "names": [ "Ptrace" ], "action": "SCMP_ACT_ALLOW", "args": null, "comment": "", "includes": { "minKernel": "4.8" }, "excludes": {} }
In Listing 4 ptrace() is allowed, but only on kernels newer than or equal to linux-4.8. The filter JSON file (Docker calls it a “seccomp profile”) can be given as a command line parameter:
# docker run -ti --rm --security-opt seccomp:custom_filter.json alpine /bin/sh
Any seccomp profile given will replace the default one, not extend it. The filter will apply to the whole container.
Syscall filters in Kubernetes
Seccomp syscall filters were added in Kubernetes 1.3 and are supported by most runtimes, not just Docker. Seccomp profiles apply to the entire pod, not just to individual containers. To create custom profiles via Seccomp, you need to enable pod security policies in the K8s cluster, then define a pod security policy that allows seccomp profiles to be used. By creating a RoleBinding, you enable pods to use this policy. To activate pod security policies, add at least one permissive policy and create at least one matching role and a role binding for the kube-system namespace. Otherwise K8s will not be able to start any pods. Then, add PodSecurityPolicy to the list of enabled admission controllers:
kube-apiserver \ --enable-admission-plugins= \ PodSecurityPolicy,LimitRanger ...
Next, provide Seccomp profiles by writing profiles and placing them on the worker nodes:
kubelet --seccomp-profile-root= (Default: /var/lib/kubelet/seccomp).
To apply a Seccomp filter to a pod, add annotations to the pod (template):
[…] metadata: labels: app: problemsolver annotations: kubernetes.io/psp: allowseccomp seccomp.security.alpha.kubernetes.io/pod: localhost/custom-profile.json […]
You can download the example file from GitHub to get started quickly.
Summary
Creating a suitable filter is not easy. If it is too generous, it won’t defend against malware. If it’s too strict, the application may not be executable. In addition, the default Docker settings are already quite sophisticated.
However, there are some cases in which it makes sense to invest time in creating your own filter. If you know your application well, it can be a quick win. If you need a highly secure Docker environment (e.g. fin-tech), it can be worth the effort. If you are a container host, it’s worthwhile to define for yourself which syscalls are used and which are not.