Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Orchestrating and monitoring Fedora CoreOS auto-updates

Luca Bruno
January 26, 2020

Orchestrating and monitoring Fedora CoreOS auto-updates

Fedora CoreOS follows the “auto-updates by design” model. It sports a modern architecture which has been designed based on learnings from past experiences, and built with safety in mind.

This talk describes Fedora CoreOS approach to orchestrating and monitoring auto-updates for a cluster of Linux machines. In particular it covers the auto-updates protocol (Cincinnati), the OS agent and client-side logic (Zincati), cluster-wide orchestration (Airlock), and overall observability from a single-pane of glass (with Prometheus metrics).

Luca Bruno

January 26, 2020
Tweet

More Decks by Luca Bruno

Other Decks in Programming

Transcript

  1. “OS engineer, Rust & Go developer, enthusiast FLOSS supporter” •

    ex-CoreOS, Software Engineer • Red Hat, Berlin office • Previously: security researcher/engineer $ whoami
  2. • Fedora CoreOS (FCOS) ◦ Auto-updates as first-class OS feature

    • Cincinnati ◦ Graphs and update-hints • Zincati ◦ Safety and observability • Airlock ◦ Fleet-wide reboot orchestration Overview
  3. • local_exporter ◦ Bridging host-local services & cluster monitoring •

    Demo ◦ Prometheus & Grafana in action Overview (cont.)
  4. Overall goals • Port Container Linux model to Fedora CoreOS

    • Continuous auto-updates as first-class OS feature • Atomic OS updates/rollbacks • Phased rollouts with multiple update channels • Cluster-orchestrated reboots • Observability, single pane of glass • Safety across the board (Go & Rust)
  5. Phased rollouts • Release artifacts are uploaded to public Web

    ◦ But not immediately announced for auto-updates • Release engineers define a timeframe for rollouts ◦ 0% -> 100% over a time window (pausable) • Updates are gradually exposed to nodes 0% 100% t
  6. Observability Single panel of glass for all auto-updates machinery: •

    Prometheus for metrics collection • Thanos for history recording • AlertManager for whitebox alerting • Grafana for visualization
  7. Auto-updates Cincinnati fedoraproject.org infra OSTree repo Zincati airlock (or other)

    etcd3 (or other) Local cluster FCOS Host rpm-ostree Our focus for today Monitoring Our focus for today
  8. Cincinnati JSON-based protocol for FCOS updates (OpenShift too!) • Provides

    update-hints as a DAG • Backend on fedora-infra (on k8s, Rust) • Scrapes FCOS metadata, builds a graph (per stream) • Serve the graph to clients, with specific mutations: ◦ Barriers ◦ Dead-ends ◦ Phased Rollouts
  9. Phased rollouts Node identity and time influence the (client-observed) graph

    v0 v1 v0 v1 v0 v1 v0 v1 Time 0+x Time 0 Node A Node B
  10. Checklist-based Release-Engineering flow • Handled via git+review process (on Github)

    • Metadata for parallel rollouts, pauses/resumes, update barriers, dead-ends signaling • Automatable • Fully public • Auditable • No sprawling of private DBs Less eye-catching, more devops-friendly GitOps applied to Release Eng
  11. Zincati Update agent • Long-running service (on-host, Rust) • Actor-based

    architecture • Checks for auto-updates, triggers reboots • TOML configuration, with systemd-style dropins • Internal state-machine with few possible states (<10) • Exposes metrics in Prometheus format Written from scratch, but in practice an evolution of update-engine and locksmith
  12. Zincati - noteworthy Metrics: • Metrics exposed over node-local IPC

    (Unix-domain socket). • Progress of the state-machine exposed as state changes and refresh timestamp Errors: • Rust enum (sum types) allows strongly typed, exhaustive error encapsulation • Error variant kinds tracked as label in error metrics
  13. Airlock Reboot coordination • Go service (containerized) • Counting semaphore

    with recursive locking • Simple HTTPS operations for locking/unlocking • etcd3 as backend DB • Exposes metrics in Prometheus format Decoupled from OS, we expect communities to adapt it for their preferred non-etcd3 backends
  14. local_exporter • Go application (containerized) • Web service, bind to

    a TCP port on container network • Fan-out to local targets • TOML configuration, single file • Allow defining multiple selectors/endpoints It may have rough edges (I’m happy if it finds a new owner)
  15. local_exporter - design Heavily inspired by node-exporter, however: • Configuration

    via file only (TOML) • Keep different endpoint metrics separate • Can pick up single files from different directories • Does not contain content-translation logic • Only bridges across “transports” • No internal caches
  16. local_exporter - backend selectors Local IPC Regular file local_exporter Targets:

    • A • B • C TCP (HTTP) Unix socket DBus endpoint Selectors config: • A • B • C A B C
  17. Example with Zincati [bridge.selectors] "zincati" = { kind = "uds",

    path = "/host/run/zincati/private/metrics.promsock" } - job_name: 'os_updates' metrics_path: '/bridge' params: selector : [ 'zincati' ] prometheus.yml local_exporter.toml Host bind-mount Selector
  18. Demo (recorded) Single pane of glass for all auto-updates machinery:

    • Prometheus for metrics collection • Grafana for visualization Recorded demo with subtitles: https://youtu.be/_gU1mHKlmQw (equivalent screenshots in backup slides)
  19. References • Fedora CoreOS docs https://docs.fedoraproject.org/en-US/fedora-coreos/ • Airlock https://github.com/coreos/airlock •

    Cincinnati https://github.com/openshift/cincinnati • Zincati https://github.com/coreos/zincati • Local_exporter https://github.com/lucab/local_exporter