Elucidating Containers and Container Runtimes!

7 min readSep 8, 2020

Not a day goes by without hearing the word: Container :)

Ever since Docker released it’s first version back in 2013, it triggered a major shift in the way the software industry works. “Lightweight VMs” suddenly caught the attention of the world and opened opportunities of unlimited possibilities. But what does Lightweight VMs really mean, how exactly do containers work, and why are they so important?

What is a container?

Containers are not first class objects in the Linux kernel. Containers are fundamentally composed of several underlying kernel primitives:
namespaces (who you are allowed to talk to),
cgroups (the amount of resources you are allowed to use),
and LSMs (Linux Security Modules — what you are allowed to do).
Together, these kernel primitives allow us to set up secure, isolated, and metered execution environments for our processes. This is great, but doing all of this manually each time we want to create a new isolated process would be tiresome.

Instead of unshare-ing, cgcreat-ing, and semodul-ing custom namespaces, cgroups, and selinux policies every time we want to create a new isolated process, these components have been bundled together in a concept called a “container”. Tools we call “container runtimes” make it easy to compose these pieces into an isolated, secure execution environment that we can deploy in a repeatable manner.

What is a container runtime?

It’s a software that runs and manages the components required to run containers.

The container runtime is responsible for:

Consuming the container mount point provided by the Container Engine (can also be a plain directory for testing)
Consuming the container metadata provided by the Container Engine (can be a also be a manually crafted config.json for testing)
Communicating with the kernel to start containerized processes (clone system call)
Setting up cgroups
Setting up SELinux Policy
Setting up App Armor rules

History of Container runtime’s:
Cgroups were added to Linux kernel back in 2008, after that a RACE STARTED! several projects emerged that took advantage of them by creating containerization processes:

Cgroups is a feature of the Linux kernel that isolates and controls the resource usage for user processes. These processes can be put into namespaces, essentially collections of processes that share the same resource limitations. A computer can have multiple namespaces, each with the resource properties enforced by the kernel.

Linux cgroups paved the way for a technology called linux containers (LXC). LXC was really the first major implementation of what we know today to be a container, taking advantage of cgroups and namespace isolation to create virtual environment with separate process and networking space.

Systemd also gained similar container support — systemd-nspawn could run namespaced processes and systemd itself could control cgroups. Neither LXC nor systemd-nspawn really caught on with end-users, but they did see some use in other systems. For example, Canonical’s JuJu and Docker (briefly) were notable tools built on top of LXC.

Docker (at the time, “dotCloud”), began building tooling around LXC to make containers more developer and user friendly. Before long, Docker dropped LXC, created the “Open Container Initiative” to establish container standards (more on this later), and open sourced some of their container components as the libcontainer project.

Google also open sourced a version of their internal container stack, LMCTFY, but abandoned it as Docker gained popularity. Most of the functionality was gradually replicated in Docker’s libcontainer by the LMCTFY developers.

CoreOS, after initially exclusively using Docker in their Container Linux product, created an alternative to Docker called rkt. rkt had features ahead of its time that differentiated it from Docker and the other early runtimes. Notably, it did not need to run everything as root, was daemonless and CLI driven, and had amenities like cryptographic verification and full Docker image compatibility.

Container Runtime Comparison

Open Container Initiative (OCI) Runtimes

Native Runtimes

runC
Railcar (deprecated)
Crun
rkt (deprecated)

runC is the result of all of Docker’s work on libcontainer and the OCI. It is the de-facto standard low-level container runtime. It is written in Go and maintained under Docker’s open source moby project.

Railcar was an OCI Runtime implementation created by Oracle. It was written in Rust. It has been abandoned later.

crun is a Redhat led OCI implementation that is part of the broader containers project and a sibling to libpod. It is developed in C, is performant and lightweight, and was one of the first runtimes to support cgroups v2.

rkt is not an OCI Runtime implementation, but it is a similar low-level container runtime. It supports running Docker and OCI images in addition to appc bundles, but is not interoperable with higher level components which use OCI Runtimes.

As per this Open Source Summit presentation, the performance of the low-level runtime is only significant during container creation or deletion. Once the process is running, the container runtime is out of the picture.

Sandboxed Runtimes

gvisor
nabla-containers

gVisor and Nabla are sandboxed runtimes, which provide further isolation of the host from the containerized process. Instead of sharing the host kernel, the containerized process runs on a unikernel or kernel proxy layer, which then interacts with the host kernel on the container’s behalf. Because of this increased isolation, these runtimes have a reduced attack surface and make it less likely that a containerized process can have a maleffect on the host.

Wait Wait!
What is UniKernel now??

Unikernels have been addressing this since the 1990s. The concept is straightforward: Take just the what you need out of both the user and the kernel space, and bake it into a highly customized OS supporting only the needs of your application, as shown in below figure 1 .

Figure 1: Unikernels only contain the parts of the OS they need and get deployed on top of a hypervisor/VMM.

Virtualized Runtimes

clearcontainers (deprecated)
run-v (deprecated)
kata-containers

They are implementations of the OCI Runtime spec that are backed by a virtual machine interface rather than the host kernel. runV and Clear have been deprecated and their feature sets absorbed by Kata. They all can run standard OCI container images, although they do it with stronger host isolation. They start a lightweight virtual machine with a standard Linux kernel image and run the “containerized” process in that virtual machine.

In contrast to native runtimes, sandboxed and virtualized runtimes have performance impacts through the entire life of a containerized process. In sandboxed containers, there is an extra layer of abstraction: the process runs on the sandbox unikernel/proxy, which relays instructions to the host kernel. In virtualized containers, there is a layer of virtualization: the process runs entirely in a virtual machine, which is inherently slower than running natively.

Container Runtime Interface

containerd
cri-o

When the Kubernetes container orchestrator was introduced, the Docker runtime was hardcoded into its machine daemon, the kubelet. However, as Kubernetes rapidly became popular the community began to need alternative runtime support.

The first CRI implementation was the dockershim, which provided the agreed-upon layer of abstraction in front of the Docker engine. As containerd and runC were split out from the core of Docker, though, it has become less relevant. containerd currently provides a full CRI implementation.

containerd is Docker’s high-level runtime, managed and developed out in the open under the Moby project. By default it uses runC under the hood.

cri-o is a slim CRI implementation led by Redhat, designed specifically for Kubernetes. It is intended to serve as a lightweight bridge between the CRI and a backing OCI Runtime.

With the CRI, the Kubernetes developers created a well-defined interface to develop container runtimes against. If a certain container runtime implements the CRI, it is able to be used with Kubernetes.

Figure 2: Docker vs. containerd in a Kubernetes context. The dockershim and cri-containerd implementations make the respective APIs CRI-compliant by translating calls back and forth.

Figure 3: containerd allows for the usage of multiple low-level container runtimes, which can be used in Kubernetes interchangeably based on the requirements for a specific application. In this case, Kata is used to run untrusted containers. We’ll talk about Kata in detail in part three.

Container Engines

You may notice in reading the above that Docker is not a CRI or OCI implementation but uses both (via containerd and runC). In fact, it has additional features like image building and signing that are out of scope of either CRI or OCI specs. So where does this fit in?

Docker calls their product the “Docker Engine”, and generically these full container tools suites may be referred to as Container Engines. No one except Docker provides such a full featured single executable, but we can piece a comparable suite of tools together from the Containers Tools project.

The Container Tools project follows the UNIX philosophy of small tools which do one thing well:

podman — image running
buildah — image building
skopeo — image distribution

In fact, these are alternatives to the standalone Docker stack, and including the cri-o project as a CRI replacement provides the last missing piece.

The Container Network Interface

It belongs to the CNCF (Cloud Native Computing Foundation) and defines how connectivity among containers as well as between the container and its host can be achieved. The CNI is not concerned with the properties or architecture of the container itself, which makes it narrow-focused and simple to implement. This statement is supported by the list of organizations and enterprises that committed themselves to the CNI for their projects: Kubernetes, OpenShift, Cloud Foundry, Amazon ECS, Calico and Weave, to name a few. Find the CNI and a more extensive list on GitHub.

Summarizing container runtimes :

https://gist.github.com/TroubleshooteRJ/787e52492cdc11888893d6610b3ef64b

Ohh yes !! That was a lot of input, and I hope you have learned a bunch.

Hope you like the tutorial. Please let me know your feedback in the response section.