The_One_VM_Thesis | apple/container

The One-VM Thesis

Per-container VMs aren't a waste of resources — they're a fundamentally different isolation posture, and the boot-time/memory evidence inverts the obvious reading.

Key Takeaways

container runs one lightweight VM per container, not many containers in one shared VM. This is its architectural thesis, and every other design choice flows from it.
A per-container VM is faster to boot than a shared VM despite higher *steady-state* overhead, because there is no shared VM to spin up in the first place.
Isolation moves from a configuration problem (capabilities, seccomp, namespaces) to a structural property (separate kernels, separate init namespaces).
The trade-off is honest: more memory under load, no memory ballooning back to the host, and full VM overhead per container. Apple accepted that cost; the question is whether you should.

The paradox nobody warns you about

I came into this code expecting the per-container VM decision to be obviously wasteful. One kernel per container sounds like overengineering when Docker, Podman, and Lima all run dozens of containers inside a single Linux VM and call it a day. Then I read docs/technical-overview.md and the numbers rearranged my priors:

"Containers created using container require less memory than full VMs, with boot times that are comparable to containers running in a shared VM."

Less memory than full VMs. Boot times comparable to shared-VM containers. Both of those are true because the comparison isn't being made the way I assumed. The relevant comparison is not "per-container VM vs shared VM holding one container." It's "per-container VM that exists for the life of one process vs shared VM that lives forever whether you use it or not." That comparison is much closer than it looks on paper, and once you internalize it, the architectural choice stops looking like waste and starts looking like precision.

What the alternatives actually are

To see why Apple rejected the obvious model, it helps to lay the alternatives side by side. There are three reasonable ways to run Linux containers on macOS, and each one spends your resources differently:

Option A — Shared VM (Docker Desktop, OrbStack, Podman Desktop, Colima). A single Linux VM boots when the tool starts and stays up. Containers inside it are Linux processes — namespaces, cgroups, capabilities, seccomp profiles. The kernel is shared. The init system is shared. The network namespace is shared unless you make it private. Boot time per container is essentially "fork a process" (sub-second). Memory overhead per container is essentially the memory of its processes. The cost you pay is structural coupling: a kernel-level bug in one container's syscall path is a kernel-level bug in every container, because they all run in the same kernel.

Option B — Per-container VM (container). Each container run boots a small VM configured for one workload. A minimal Linux rootfs, a minimal set of core utilities, the dynamic libraries your process actually needs. When the process exits, the VM tears down. There is no long-lived "the VM." Boot time per container is on the order of seconds — slower than fork, faster than spinning up a shared VM from cold. Memory overhead is the cost of one kernel per container, which on Apple silicon with a tuned rootfs is small but not zero. The cost you pay is duplicated kernel state and the absence of memory ballooning back to the host.

Option C — Linux processes on macOS natively. Not a real option. macOS isn't Linux. You can run Linux ELF binaries through a compatibility layer, but you don't get a Linux filesystem tree, an init system, or the syscall surface most server software assumes. Drop this from the decision matrix.

The decision is between A and B. They are not equivalent; they are postures.

graph TB
    subgraph "Option A: Shared VM (Docker Desktop, OrbStack)"
        H1[macOS host] --> VM1[One long-lived Linux VM]
        VM1 --> K1[Single shared kernel]
        K1 --> P1[Container 1<br/>Linux process]
        K1 --> P2[Container 2<br/>Linux process]
        K1 --> P3[Container N<br/>Linux process]
    end
    subgraph "Option B: Per-container VM (container)"
        H2[macOS host] --> A[container-apiserver]
        A --> V1[VM 1<br/>own kernel]
        A --> V2[VM 2<br/>own kernel]
        A --> V3[VM N<br/>own kernel]
        V1 --> C1[Container 1<br/>only process]
        V2 --> C2[Container 2<br/>only process]
        V3 --> C3[Container N<br/>only process]
    end

A is a building with apartments. The plumbing is shared, the heat is shared, the front door is shared. You're trusting the building's infrastructure to enforce who can reach whom. B is a row of hotel rooms, each with its own front door, its own plumbing, its own thermostat. Trust shifts from "the building's rules" to "the door locks."

That's the architectural thesis, in one image. It's not about performance numbers. It's about where you put the trust boundary.

What a per-container VM actually buys you

Three properties matter, and they all follow from "separate kernel":

1. Kernel isolation. A container in Option B cannot trigger a kernel panic that takes down siblings, because they have different kernels. A kernel-level container escape (the historical CVE class that keeps Linux container security teams employed) is contained by the VM boundary rather than relying on seccomp + capabilities to hold. The risk model is smaller and the surface to audit is smaller.

2. Per-container kernel configuration. Want a kernel with CONFIG_KVM=y enabled for nested virtualization? Pass --kernel /path/to/vmlinux-kvm. Want a custom init image that injects eBPF programs before your OCI container starts? Use --init-image. You are not asking the host admin to rebuild a shared kernel; you are swapping one file on a per-container basis. The flexibility is real.

3. Independent lifecycle. A container in Option B can crash, hang, or exit without affecting any other container, because there's no shared process supervisor, no shared init, no shared mount namespace. The "one container wedges the daemon" failure mode of shared-VM setups is structurally absent here.

What it doesn't buy you:

1. Memory return to the host. Pages freed inside a container's VM stay allocated to that VM until it tears down. The macOS Virtualization.framework's memory ballooning support is partial; the docs are explicit about this. Long-running container fleets need attention. 2. Sub-second startup. Per-container VMs boot in seconds. If your workload is "spin up 200 short-lived containers to handle a queue," this model will not be the cheapest option. 3. The familiar Docker developer experience. Volumes work, but the model is "mount what you need into the VM," not "share a filesystem with the daemon VM." Port publishing works, but it goes through Apple's vmnet.framework, which has its own constraints. SSH forwarding works, but through an explicit --ssh flag with documented limitations.

The boot-time evidence, in concrete terms

When the docs say "boot times comparable to containers running in a shared VM," they're not exaggerating. On an M-series Mac, a container run for a small Alpine image typically completes the user-facing command in well under five seconds from a cold start of the container system start daemon. That's not "VM boot speed" in the data-center sense; it's the cost of bringing up a tiny init system on top of the Virtualization framework's already-warm virtio plumbing, plus the OCI image layer cache doing its job.

The reason it's that fast is also why it's defensible: the per-container VM doesn't carry the weight of a general-purpose Linux userland. The vminitd init image is intentionally minimal. The default rootfs is small. The kernel is the Kata static kernel, which is itself tuned for fast boot in container scenarios. Every layer of the stack is making the same bet.

Now imagine you're running a CI matrix of twelve test containers in parallel. With Option A, you'd amortize one VM across all twelve. With Option B, you boot twelve small VMs. The memory pressure is real and the boot time is multiplied. The honest version of the trade-off is: if your containers are long-lived and you run a few of them, Option B is competitive or better. If your containers are short-lived and you run many, Option A wins on cost. The single-container dev workflow that most readers actually have? Option B holds its own.

What this means for how you use the tool

Here's the part where I want to push back on a common misreading. The per-container VM is not "more secure in every dimension." It is more secure along the kernel axis: kernel escapes are contained, kernel panics are contained, kernel CVEs that depend on sysctl defaults are contained per-VM. It is *not* more secure along the application axis: if your application has a vulnerability, the per-container VM doesn't help.

What this means in practice:

Don't choose container *because* it's "more secure than Docker." Choose it if the kernel-isolation axis is what you actually need (multi-tenant CI, defense in depth against kernel CVEs, isolation of untrusted workloads).
Don't dismiss container *because* it's "wasteful." On Apple silicon, with the tuned rootfs and the minimal init, the per-container overhead is smaller than the framing suggests. The waste shows up under specific workload patterns, not under the typical dev pattern.
Do treat the macOS-version matrix seriously. Chapter 4 walks it.

I came into this analysis assuming Apple shipped per-container VMs because they couldn't figure out how to do shared-VM. The source convinced me they did it because the kernel-isolation axis was non-negotiable, the boot-time evidence held up, and the team was willing to pay the memory-ballooning tax to keep that axis intact. That's a defensible architectural posture, not a workaround.

The next chapter walks what that posture actually looks like when you implement it — the API server, the XPC helpers, and the per-container runtime that turn the architectural thesis into a Swift process tree.