Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Dive into firecracker-containerd (re:Invent 2019, CON408)

Samuel Karp
December 02, 2019

Deep Dive into firecracker-containerd (re:Invent 2019, CON408)

Last year, we released the Firecracker virtual machine monitor (VMM) built on top of the Linux KVM subsystem, which is optimized for lightweight, container-like “microVMs.” In this session, we dive deep into the architecture of the firecracker-containerd project, which aims to allow portability between standard OCI container images and the larger container ecosystem with Firecracker microVMs. Topics covered include the standard containerd architecture with the reference OCI runtime (runc), challenges adapting containers into microVMs, and the firecracker-containerd suite.

Samuel Karp

December 02, 2019
Tweet

More Decks by Samuel Karp

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deep dive into firecracker-containerd Samuel Karp C O N 4 0 8 - R Senior Software Development Engineer Amazon Web Services
  2. Related breakouts CON409 Security and monitoring in a serverless world

    on AWS Fargate CON423 AWS Fargate under the hood
  3. Agenda Linux containers, virtual machines, and isolation What is a

    container runtime? What is containerd? The Firecracker virtual machine monitor (VMM) Adapting containerd to Firecracker Current status and roadmap Q&A
  4. A really brief overview of containers • A mechanism for

    running software • With some isolation • With some repeatability • With a standard format for distribution • With common tooling
  5. Linux container primitives • Namespaces – Visibility restrictions • Control

    groups (cgroups) – Resource limits • Capabilities – Permission restrictions • Seccomp – Syscall allow/deny lists • Linux Security Modules – Resource access control • Union filesystems – Image layers
  6. What don’t containers give you? • Independent kernel behavior (kernel

    tuning) • Security isolation from other containers
  7. Containers and VMs Containers • Use Linux primitives to separate

    processes • Share a Linux kernel • Fast starts, minimal overhead • Flexible configuration Virtual Machines • Virtualize or emulate hardware components • Completely separate kernels (maybe not Linux) • Slower starts: must boot kernel and set up hardware Hardware Linux kernel namespaces cgroups ... Container Container Hardware Linux kernel KVM Virtual hardware Virtual hardware VM guest VM guest
  8. Why use VMs? • Independent Linux kernel in each VM

    • Virtual machine monitor (VMM) is an additional isolation boundary • Interface between VM and VMM (hypercalls) is defined by the hypervisor • Hardware interfaces are standardized • Good for defining trust and resource boundaries • Isolating multi-tenant workloads • Isolating non-trusted workloads
  9. What do we mean by isolation? • Prevent customers from

    affecting each other • Prevent customers from affecting the infrastructure • Defense in depth • Container security • seccomp • Linux security modules • Capabilities • Hypervisor • Emulation, virtualization, or pass-through
  10. Common container tooling • Docker UX (docker build, docker run)

    • Images and registries for software distribution (docker push, docker pull) • Container orchestrators • Amazon Elastic Container Service (Amazon ECS) • Kubernetes • Mesos • Open Containers Initiative (OCI) • Image standard • Runtime standard • Distribution standard
  11. How you run containers Part of stack Example components Cluster

    orchestrator Amazon ECS, Kubernetes, Mesos Local management Docker or containerd Container runtime runc or Firecracker
  12. How you run containers Part of stack Example components Cluster

    orchestrator Amazon ECS, Kubernetes, Mesos Local management Docker or containerd Container runtime runc or Firecracker
  13. Container runtimes • Mechanism for starting and managing container workloads

    • (Linux containers) Set up cgroups, namespaces, filesystems, capabilities, seccomp, etc. • OCI runtime specification • Command-line interface for setting up a container • On-disk “bundle” • Root filesystem • JSON file describing configuration • runc • Reference implementation • Split out from Docker
  14. How you run containers Part of stack Example components Cluster

    orchestrator Amazon ECS, Kubernetes, Mesos Local management Docker or containerd Container runtime runc or Firecracker
  15. containerd • Daemon for managing containers • Modular framework for

    container lifecycle workflows • Integrates with OCI runtimes and containerd v2 runtimes
  16. The containerd stack • gRPC API and services • Storage

    services • Content store • Snapshotters • Runtime (OCI/runc, v2) gRPC Metrics Storage Content Snapshot Diff Metadata Images Containers Tasks Events Runtimes
  17. The containerd stack • gRPC API and services • Storage

    services • Content store • Snapshotters • Runtime (OCI/runc, v2) gRPC Metrics Storage Content Snapshot Diff Metadata Images Containers Tasks Events Runtimes
  18. The containerd stack • gRPC API and services • Storage

    services • Content store • Snapshotters • Runtime (OCI/runc, v2) gRPC Metrics Storage Content Snapshot Diff Metadata Images Containers Tasks Events Runtimes
  19. The containerd stack • gRPC API and services • Storage

    services • Content store • Snapshotters • Runtime (OCI/runc, v2) gRPC Metrics Storage Content Snapshot Diff Metadata Images Containers Tasks Events Runtimes
  20. Firecracker virtual machine monitor (VMM) • KVM-based VMM in Rust

    • Open source • Targeted at serverless workloads • Not a general-purpose VMM
  21. Firecracker design goals Security • Very limited device model •

    Very limited feature set • Eliminate guest interactions with host kernel • Sandbox/jail the VMM • Memory-safe programming language • Single VM per Firecracker process Efficiency • Fast boot time • Low memory and CPU overhead • API driven
  22. firecracker-containerd goals Containers • Compatible images • Familiar tooling •

    Support existing workflows • Allow composition of containers • Integrate with orchestrators • Minimal additional overhead Security • Hypervisor-based isolation • Limited access to the host
  23. Adapting containers to Firecracker Firecracker VMM considerations • No filesystem

    sharing • No dynamic device attachments • Limited networking options (tap, not veth) • Cross-boundary communication with vsock Adapting to containerd • Block-device snapshotter • API to manage VMM lifecycle • Split the “shim” into two parts • “Runtime” on the host • “Agent” inside the VM • …that runs containers via runc • Network • tap device • Usable with Container Network Interface (CNI) plugins
  24. firecracker-containerd architecture microVM containerd runc Content store Disk Kernel Internal

    FC agent Container FC control plugin Block-device snapshotter Container root fs Firecracker VMM FC runtime
  25. firecracker-containerd architecture microVM containerd runc Content store Disk Kernel Internal

    FC agent Container FC control plugin Block-device snapshotter Container root fs Firecracker VMM FC runtime
  26. What is a “block device” snapshotter? • Store the container

    image layers • Manage writable space for each container • Compose container image layers and writable space (snapshot) • Treat each snapshot as a device to attach to the VM • Inside the VM, mount the device to expose its filesystem
  27. firecracker-containerd architecture microVM containerd runc Content store Disk Kernel Internal

    FC agent Container FC control plugin Block-device snapshotter Container root fs Firecracker VMM FC runtime
  28. What does the “firecracker-control” plugin do? • First-class VM construct

    and API for Firecracker • Specify VM-related parameters like the kernel and VM root filesystem • Allocate and manage VM resources: block devices, network interfaces, etc. • Manages the VM lifecycle • Compiled-in plugin • gRPC API over the same socket • Specific to Firecracker for now • Looking for a better generic solution
  29. firecracker-containerd architecture microVM containerd runc Content store Disk Kernel Internal

    FC agent Container FC control plugin Block-device snapshotter Container root fs Firecracker VMM FC runtime
  30. What does the “runtime” do? • Proxies container management commands

    to agent inside VM • Proxies I/O streams • Proxies events and metrics from inside the VM back out to containerd • Built on containerd’s V2 API • ttrpc API for communication • Many containers per VM, one runtime
  31. firecracker-containerd architecture microVM containerd runc Content store Disk Kernel Internal

    FC agent Container FC control plugin Block-device snapshotter Container root fs Firecracker VMM FC runtime
  32. What does the “agent” do? • Manages the lifecycle of

    the containers inside the VM • Communicates over vsock • Receives container management commands from runtime • Proxies I/O streams • Proxies events and metrics from inside the VM back out to the runtime • Associates each container with the appropriate block device • Mounts container filesystems • Uses runc to set up cgroups, namespaces, etc. • Looks like a standard Linux container to the workload inside
  33. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container
  34. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container
  35. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot
  36. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot snapshot $snap1 snapshot $snap1
  37. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot create VM start start & boot
  38. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot VM $vm1
  39. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot
  40. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot run container $c1 from $snap1 run container $c1 from $snap1
  41. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot attach $snap1
  42. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot mount $snap1
  43. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot run container $c1 start process
  44. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot container $c1 running container $c1 running
  45. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container prepare snapshot prepare snapshot subscribe events for $c1
  46. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container After some time, container process exits prepare snapshot prepare snapshot
  47. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container After some time, container process exits prepare snapshot prepare snapshot container $c1 exited container $c1 exited container $c1 exited
  48. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container After some time, container process exits prepare snapshot prepare snapshot stop VM $vm1 stop VM
  49. How we run containers with firecracker-containerd orchestrator containerd snapshotter control

    plugin Firecracker runtime agent for each container container process for each container After some time, container process exits prepare snapshot prepare snapshot
  50. Current status • Working toward production readiness • Features •

    Multiple containers per VM & a VM-management API • Networking with CNI plugins • Compatibility with existing container images • Active work • “Jailing” the Firecracker VMM • Still to come • Integration with Kubernetes Container Runtime Interface (CRI) • Removing or replacing the custom containerd build • Polishing (documentation, bug bash, releases)
  51. Kubernetes CRI • Need to automatically model VM lifecycle and

    container grouping • Most straightforward approach is to infer from the “pause” container • Need to determine the size of the VM • CRI doesn’t convey total pod resources at creation time • Need to determine a path for volumes • Kubernetes expects to mount volumes to containers with things like secrets • Need to determine if we can reuse cri-containerd • containerd’s CRI implementation can’t currently talk to our VM API
  52. firecracker-control plugin • firecracker-control plugin (VM API) needs a custom

    build • Needed to expose VM API over containerd’s domain socket • Want to eliminate this requirement • VM API is specific to Firecracker • Requires integrators to know details about Firecracker
  53. A brief note before we finish Session surveys provide valuable

    information to speakers Feedback that is very helpful • Topics you were excited to learn about • Suggestions for improving understanding and clarity Feedback that is extremely unhelpful • Comments unrelated to talk content (please refer to the AWS Live Events Code of Conduct) The “hallway track” is always open! Feedback and questions welcome ([email protected], @samuelkarp) For support, use the AWS forums or contact AWS Support
  54. Thank you! © 2019, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Samuel Karp [email protected] @samuelkarp