The magic behind LibOS and Unikernels
It’s been a while since I’ve written anything on Medium. In fact, I have been writing a series of blog posts about cryptography on my personal blog. Since these posts were math-heavy and had a lot of LaTeX equations, I didn’t bother to repost them onto Medium.
But I guess it’s time to come back to Medium every once a while and writing something about software engineering. So here I am and today let’s take a look at the technology that is the backbone of tomorrow’s cloud application infrastructure — the Unikernel paradigm, and Library OS (LibOS).
Preface
Recently I was reading into how Docker containers work and cloud virtualization technologies. Suddenly, I came across this new idea called a Library OS (or LibOS).
LibOS is basically an entire operating system implemented as a library so user-space applications can directly use it to utilize system resources more efficiently. It totally disrupts how we traditionally understand how OSes work. LibOS introduces a new paradigm for running cloud-native applications in virtualized environments that is efficient, low-overhead, and ultra-portable.
Let’s see a very simple example where LibOS excels.
The Web Server Problem
For example, normally when a web server running on Linux tries to send out a TCP segment, it has to save its application context, prepare to signal a system call, context-switch into kernel-space, and let the Linux kernel’s TCP/IP stack assemble the segment and send it out on the wire. This approach is great for normal users and developers — because the application (web server in this case) itself doesn’t need to care about how TCP segments are assembled and can simply hand off the task to the kernel to do it. This paradigm greatly reduces the burden of the application layer and leaves all the complexity to the kernel of the underlying OS.
While the standard approach works great for most of the use cases, it isn’t super optimal when we want high efficiency. Imagine if we have a production web backend running on the cloud that also sends out TCP segments upon user request. Here we want to make sure that the segment gets sent out as quickly as possible and sometimes we might want to dynamically tweak the congestion control logic to optimize the service quality. Now this networking abstraction layer that the underlying OS provides us becomes a bit annoying — because it shields away all the implementation details and introduces extra overhead due to context-switch into kernel-space.
Additionally, the web server could also run a SQL database, so it would need to perform I/O operations to the filesystem on the disk. If you are familiar with how Linux filesystems work, you’d know that Linux encapsulates everything as a file in a tree structure. If there is an external flash or hard drive, the Linux kernel would first represent this piece of storage as a block device which is just another file. Then users can mount a file system onto that device and interact with the “blocks” using files and directory structures. While this design is really cool (leverages all kinds of data structures and abstractions), it inadvertently introduces overhead when user-space programs are trying to read/write files. If the web server wants to update some information to the DB, it needs to modify the underlying file, which gets translated to a few operations to the block device. Once it reaches the block device, the kernel takes over the requests and directly performs them on the hardware. As we can see, the process here could really slow down DB updates and hide these inefficiencies in the abstraction layer. If MySQL can directly get access to the hardware, bypassing all the drivers and block device abstractions, it could potentially be much much faster.
Moreover, the Linux kernel also does more things in the background that we don’t always need — such as managing memory, multi-process scheduling, disk I/O, and peripherals. Although our web server doesn’t need any of these things, the Linux kernel still occupies a great amount of memory and CPU time to maintain these things.
Here, we can see that we have enumerated a few things (especially abstractions) that were meant to do good in the Linux kernel design, but inadvertently makes the system inefficient. Now imagine if we spawn 100 Linux containers and each runs a microservice (which is a fancier decentralized version of a web server), then we are basically 100x’ing the amount of overhead that the Linux kernel gives us when doing these network or disk operations. If we look at how many containers we are running on AWS, GCP, or Azure nowadays, it wouldn’t be hard to imagine how much of these cloud platform’s computing power is wasted on the overhead introduced by the underlying OS. We are essentially wasting electricity (or even worse, fossil fuels) on these unnecessary computations, and this isn’t looking great.
LibOS comes into the picture
The solution to the above-mentioned problem is pretty straight-forward: if the OS is posing too much overhead for us, why don’t we, the user-space program, take over the control of the machine, so we become the kernel and enter god mode?
This idea is basically saying that let’s just drop the user-space/kernel-space separation, so our web server becomes the kernel and gets access to the network card and disk directly. Of course, it would be unwise for each application to implement the TCP/IP stack or the Linux filesystem, so we still need to keep the OS-level logic somewhere. Turns out, we can move all the OS logic into a library, so our web server application can just import and directly use. But this time, we don’t need any context-switches or indirect driver layers — We just need a direct function invocation into the OS library. And this is where it gets its name — a Library OS, or LibOS.
With LibOS, our web server takes full control of the underlying hardware and uses a library full of OS logic to interact with the hardware without going through system calls and context switches. More importantly, our application basically blends with the OS and becomes the OS itself. This allows for optimal hardware utilization with a small footprint.
Why LibOS now
Now we’ve seen what LibOS is and its benefits. The natural question that follows next is: Why are we considering LibOSes now? Why haven’t we done this before?
In fact, LibOS isn’t really a new concept. This idea has been around for a while but it just didn’t get wide adoption. However, with cloud computing and containerization technologies become the new norm, this idea has been resurrected, and now it’s more promising than ever.
To fully understand the answer to these questions, we have to travel back in time a bit and understand why we introduced OSes in the first place.
A not-so-brief overview of OSes and Kernels
Back in the days when we first invent the computer, it was really as simple as a processor (CPU), a memory, and some peripheral devices such as displays or keyboards.
A classic computer is capable of running programs, which are basically sequences of machine-readable binary commands packed together. When it runs the program, it places the program inside its memory and the processor reads them out line-by-line and performs operations based on what these commands say.
But when we plug-in power to the computer, how does the machine load these programs into the memory in the first place? This process is usually handled by the bootloader program, where there usually is a fixed block of code (stored in read-only memory, or ROM) in the memory address space. This program contains the instructions for the processor to load the actual program we want to run into the memory and then “jumps” to the target program. When the computer gets power, it will always by default (hardwired) to run the bootloader program.
Usually, the bootloader program would contain logic such as — load the program stored on disk block #1 into the memory and then jumps to execute that program. This gives us the freedom to put any program at block #1 on the disk and the computer will then “magically” pick up this program and start running.
Now we know that our program is running on this computer, but let’s say that we want to access the network card to send out IP packets. Usually, the network card is plugged into a shared extension bus (such as I2C or PCI-E) and can be accessed via a specific memory address region using its supported protocols. Therefore, we would want to write some dedicated logic for converting a sequence of bytes into actual IP packets onto the wire, and this logic can be encapsulated into functions so it can be reused in many places via function calls. This specialized logic is what we commonly call drivers.
Let’s evolve our computer a bit more. Now our computer has many drivers controlling all kinds of hardware, such as network, display, disk, keyboard, webcam, sound, etc. This finally starts to look like modern-day general-purpose personal computers. Once we have these capabilities, we’d like to run more interesting programs on the computer. For example, we could run a game that utilizes the display, the disk, and the keyboard. In order to do so, we can flash the game program at block #1 and let the bootloader pick up the game automatically. The game then uses all the drivers we defined earlier to talk to the hardware.
While playing this game is fun, we cannot just buy a computer that only runs one game forever. Given all this hardware, we would want to run different programs on it — for example, a video chat application that uses the webcam, the network card, and the sound card. If we use the same infrastructure, it means that we have to overwrite the new application at block #1 on disk so it can be run.
While this is doable, it’s no fun at all — imagine pulling out a set of JTAG cables every time you want to launch a different program on your laptop. Ideally, our machine should have the capability to choose what program to run. A potential solution would be that, instead of directly booting into the target program, we load a second stage bootloader program into block #1 that does more interesting logic such as scanning the disk for all available programs to run and prompting the user which one to run. This chained booting idea is indeed a very commonly used approach.
Now let’s upgrade the challenge a bit more. What if the user wants to browse the internet while doing video chatting at the same time? In other words, what if we want to run multiple programs at the same time?
Now it gets complicated real fast. Previously, programs running on the computer solely own all the hardware and see the memory as a long noodle of continuous address space. But if we were to allow multiple programs to run on the device, we need to figure out a way to fairly break up the resources. More importantly, we need to figure out how the processor can juggle multiple things at once without screwing up.
This is when all these OS concepts such as processes, context switching, scheduling, and memory virtualization come in. Indeed, I would say that this is the most complicated component within an OS. But basically, we introduce these concepts to solve one big problem: how to distribute resources fairly.
If you take an OS class at school, this is probably what you’d spend most of your time on learning. But in short, there are two major approaches to this resource sharing problem.
The first approach is non-preemptive. What this means is that each program would occupy the system resources as much as it wants, and then it voluntarily gives it up to other programs. This is usually the simplest to implement, but it could go wrong very fast if there are dead loops or malicious programs.
The second approach is preemptive. What this means is that there’s something above the program (i.e. the kernel) executions that take care of resource distribution for them. Then it basically means that we are granting all the system privileges to the kernel and let the kernel decide what to do with it. Additionally, to make sure that programs don’t mess up with system resources (such as peripheral hardware) when they shouldn’t, we only allow the kernel to access the hardware drivers that we defined above. Then we can define a system call interface for the programs to “ask” the kernel to do things for them and the kernel can vet each request before doing anything.
The second approach is more like today’s OSes. It’s generally harder to implement but it provides a general-purpose infrastructure to run all kinds of applications on one machine. It also has security protection mechanisms such as ACLs or permissions to make sure that programs with malicious intent cannot damage the device or impact other programs.
Phew… That’s the history of computer and operating systems in a few paragraphs. Next, we can take a look at different types of kernels.
Types of Kernels
Now we already know why we need a kernel in the OS. But what should the kernel do?
Turns out, there is no correct answer on this one, as different OSes implement kernels differently. Mainly, there are a few types of kernels worth mentioning:
First, we have the monolithic kernel, with the Linux kernel as a classic example. A monolithic kernel is a kernel where all the core functionality (scheduling, memory management) and the peripheral services (disk I/O, device drivers) are tightly knit together into one giant “program” that runs in privileged space. Basically, all the system resources are shielded away from user-space programs. If the program wants to access anything outside itself, most likely it would need to go through the kernel.
The problem of the monolithic kernel is very prominent: if some unimportant services fail (such as some random hardware), it could crash the entire kernel and result in a system-wide panic. Additionally, it is very hard to track down potential memory leaks or bugs in a gigantic codebase.
Hence, we have a different type of kernel, a.k.a. microkernel. A microkernel is a minimal kernel that runs the core functionalities isolated from the other system services and device drivers. This ensures that if something goes wrong, it doesn’t bring down the entire system. Additionally, it moves some unimportant logic back to the user-space so it only keeps what’s absolutely necessary for the privileged kernel-space.
The monolithic kernel and the microkernel represent two extremes of the scale. Of course, we can have a hybrid kernel that sits somewhere in between. There isn’t really a universally correct kernel option. Each OS chooses its kernel on this spectrum based on its creators’ beliefs and ideologies.
Finally, we also have a very unique candidate — the Unikernel.
Unikernels are specialized, single-address-space machine images constructed by using library operating systems.
If you read carefully, you realize that the Unikernel really gives up all the technologies that we have been building into modern-day OSes to fairly distribute system resources. Instead, it makes the target user-space program the sole owner of the machine and offers the OS functionalities as a library (via LibOS).
Now it might seem a bit funny. Why would one want to give up all the efforts of building complex OSes and adopt this simplistic model? It eventually means that our system can only run one program that has full control of all the resources. You might feel that this sounds nothing but catastrophic as there would be countless scenarios where poorly-implemented or malicious programs could result in very bad consequences.
Yes, if you feel this way, you are absolutely right and this is what everyone has been believing. But hold on to that thought for now and let’s see why this is no longer the case today.
Hypervisors and Virtualized Instances
15 years ago, if you were to host your own website, you’ll have to rent a server rack in some server farm. The rack contains some heavy machinery that probably runs Linux or Windows Server operating systems. You’ll have to routinely maintain the actual hardware, wipe out dust, and properly manage the system resources.
However, today is very different. One can easily start a virtual EC2 instance on AWS and instantly gain access to a Linux machine. The only difference is that now this machine is no longer a real physical device. Instead, it’s a virtualized instance that probably uses a tiny fraction of AWS’s gigantic computing clusters.
What does this mean? This means that all the hardware resources we use today are already virtualized. We are essentially running a Linux instance inside a shielded VM. Underneath the VM, AWS probably uses XEN or other hypervisor technologies to make sure that the virtual machines are fully shielded and won’t come out from its boundary.
Remember that one of the original purposes of having a privileged kernel in an OS is to protect the machine from malicious programs? Well since now the underlying machine is well-virtualized, this isn’t a good purpose anymore. Even if the program gains privileged access to the system, it still cannot escape the underlying virtual machine and do anything actually harmful.
This is where the Unikernel starts to shine.
Imagine we were originally running three (containerized) processes in an EC2 Linux instance: a web server, a SQL database, and a chat application. Now, what we can do is to package each application into a separate Unikernel VM by compiling/linking them against a LibOS.
Due to the nature of how compilers work, it will only compile code logic that the programs actually use and strip out unnecessary parts. Then we will end up with three ultra-lightweight Unikernel VMs that can be directly run on AWS’s XEN hypervisor. They can still communicate with each other via TCP sockets just like before. But what’s different is that we just broke up a fat OS + three programs into three skinny Unikernel OSes, each specialized to just do one thing (the program it encapsulates).
By doing so, we can greatly strip away the overhead of OSes when deploying cloud-native applications on established hypervisor infrastructures. We can now deploy ultra-lightweight, application-specific VM instances that directly run on AWS where our applications get blazing-fast privileged system resources access.
Wide Adoption of Containers
If the previous section still didn’t convince you enough on why Unikernels and LibOSes are great for the cloud, here’s one more reason.
One of the biggest doubt for these kinds of application-specific Unikernel VMs is that processes are now running isolated in different environments. We can no longer share a memory or perform IPC via Unix domain sockets.
However, if you have been using Docker containers, you’ll realize that this is not really a problem at all. The wide adoption of containers has already made modern software systems isolate each program into a sandboxed environment.
If I want to start a web server, I would probably use docker-compose
to start a cluster of containers. One container could be the MySQL
instance, while there could be three other ones each doing its own job as a microservice. Therefore, in order to make these instances talk to each other, I need to use TCP sockets or other kinds of high-level networking mechanisms to do so.
If I move to use Unikernel VMs for these container instances, it would have no harm at all, because these VMs can still communicate with each other just like they were encapsulated in Docker containers.
By breaking down applications into a bunch of separate programs and encapsulate each program in an application-specific Unikernel VMs via LibOSes, we can build a light-weight, cloud-native, and platform-agnostic solution to deploy our software solutions anywhere.
Unikernel/LibOS is the holy grail of the cloud computing era.
What can we do with LibOSes?
If you read until this point, you should have a rough idea of what LibOS is and how the Unikernel approach could become the best strategy to deploy software systems on the cloud.
Finally, let’s talk about the other (potential) applications of this technology.
Secure Confidential Computing via SGX
The first application is Confidential Computing, which is a very hot topic right now given all the information security risks out there.
Basically speaking, Confidential Computing is an approach where we use cryptographic privacy-preserving technologies to encrypt or hide sensitive data during the computation. This ensures that even if the actual AWS server got hacked by its admins and the entire memory is dumped out for inspection, our program memory still doesn’t reveal any meaningful secrets or user data.
By adopting Confidential Computing, we can truly make sure that user data are not only stored securely and properly but also processed in a privacy-preserved way.
While many cryptographic primitives are already present (such as Fully Homomorphic Encryption, Private Information Retrieval, Secure Multi-Party Computation, etc.), they are still pretty slow given the protocol’s convoluted algebraic structure. Today, the most promising and fastest way to achieve Confidential Computing is via specialized hardware called Secure Enclaves.
A Secure Enclave is basically a special type of processor/memory hardware where it can contain secret information but it’s impossible to obtain those secrets physically by a malicious attacker, even if they disassemble and probe the device with extra care. In 2015, Intel added in a secure enclave and introduced the Software Guard Extensions (SGX) in its latest line of processors. Basically, one can run a secure program in the enclave and the outside would have no way to extract out sensitive information from the enclave memory. Instead, SGX offers a standardized interface for user-space “insecure” programs to perform API calls with “secure” programs that are running in the enclave memory.
Although SGX is great, it did pose one big challenge: we essentially need to break up all of our applications into two parts — one “insecure” part that runs outside of the enclave, and one “secure” part that runs inside the enclave and deals with sensitive data. This essentially puts a migration barrier for people who wanted to adopt this new technology.
The other challenge of SGX is that each enclave only contains limited memory space, so if programs running inside of it want to access some OS-level resources or extra memory, it has to make communication to the outside and enlarge the attack surface.
This is where LibOS comes in and shot two birds with one stone: We could compile the entire original program with the LibOS and load the Unikernel VM into the enclave, so the program itself is solely standalone and doesn’t require any API calls to the outside. The VM instance is also lightweight so it satisfies the limited memory requirement. Typical examples include Alibaba’s Occlum and Oscar Lab’s Graphene.
Speeding up Serverless Computing
Another cool thing LibOSes can do is to speed up Function-as-a-Service, or Serverless Computing.
One big problem for Serverless is the problem of cold-start. If the function that we wish to invoke is not actively loaded in the system, it takes some time for the server to load the Lambda function before executing it. This could cause delays when users are loading a Serverless web page after a long time of inactivity. The AWS Lambda function we want to call is evicted from AWS’s Lambda function pool so it needs a few seconds to load the function, acquire necessary resources, and finally do the computation.
However, if we could encapsulate every Lambda function with LibOS into Unikernel VMs, then this cold-start problem can be easily solved. Since these VMs are very lightweight (boots up in ~100ms) and are self-contained (doesn’t require extra resources), the Lambda server can pull up the VM and just run it without any delays.
IoT devices
The third application is for IoT devices. We know that IoT devices suffer from privacy and security flaws due to their nature of being deployed to the public and built with cheap hardware. Indeed, lots of IoT devices are just running Linux and can be easily infiltrated. Attackers can then turn them into DDoS bot-net or simply collect sensitive sensor data from it.
If we build our applications with LibOS into a Unikernel image, we can basically secure all the risks that are caused by the OS itself. This would greatly reduce the attack surface of IoT deployments and all the resources are solely managed by the application. Plus, the lightweight image size means that we can make these devices even cheaper.
Decentralized computing
Finally, the last application could be Decentralized Computing such as Ethereum. We know that Ethereum miners run EVMs (Ethereum Virtual Machines) that executes smart contracts and validates the world state.
However, the smart contracts are highly limited in its functionalities to prevent malicious behaviors in the EVM. For example, we cannot allow smart contracts to make system calls or control hardware devices because it’s hard to ensure security.
With LibOS, we can compile more advanced programs into Unikernel images and let the miners simply run those images to execute the smart contract programs. Since the entire OS is encapsulated in the image, we don’t need to worry about it breaking the security boundary and damaging the host machine.
To Conclude
I think LibOS and the Unikernel approach to deploying software applications give us a unique perspective of how to run untrusted programs on untrusted/virtualized platforms.
To be honest, as an engineer from an embedded background, I’m amazed because this is essentially how we would program on an embedded device — with no OS boundaries and programs can directly control the hardware via driver libraries. It’s is truly fascinating that this same approach can be applied to modern-day cloud-native applications and make things run faster and more secure.