Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from the Kubernetes Adventure (Eric Brewer, Google | UC Berkeley)

Lessons from the Kubernetes Adventure (Eric Brewer, Google | UC Berkeley)

After a brief review of Kubernetes, we examine some of the keys to success for a large-scale open-source platform, including making room for a wide range innovation by a wide range of players, eventually leading to the vibrant ecosystem we see today.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 19, 2021
Tweet

Transcript

  1. Lessons from Kubernetes Eric Brewer VP Infrastructure, Fellow Ray Summit,

    June 2021
  2. Goal: “Cloud Native” Applications Middle of a great transition... •

    unlimited “ethereal” resources in the Cloud • an environment of services not machines • thinking in APIs and co-designed services • high availability offered and expected
  3. Google has been developing and using containers to manage our

    applications for over 15 years. Images by Connie Zhou “billions” launched per week • simplifies management • performance isolation • efficiency
  4. Kubernetes: Higher level of Abstraction Don’t Worry About • OS

    details • Packages — no conflicts • Machine sizes (much) • Mixing languages • Port conflicts Think About • Composition of services • Load-balancing • Names of services • State management • Monitoring and Logging • Upgrading
  5. Evolution is the Real Value Services are Abstract • A

    “Service” is just a long-lived abstract name • Varied implementations over time (versions) • Kubernetes routes to the right implementation Apps Structured as Independent Microservices • Encapsulated state with APIs (like “objects”) • Mixture of languages • Mixture of teams
  6. Lesson: The value of Open Source Key decision: Kubernetes should

    be open source 1) Even Google needs “fellow travelers” for a mission this big ◦ Created the “Cloud Native Computing Foundation” (2015) 2) The “standard” is the code, not a traditional specification Spec-based standards cannot handle high-velocity innovation 3) Enables broad customized use: on prem, hybrid, multi-cloud, …even Raspberry PI clusters
  7. Lesson: The Innovation Tree Early days: all the work on

    the core (trunk) • Soon new efforts around networking and storage… • Eventually large parallel “SIG” structure [special interest group] The key is parallel innovation, mostly at the leaves (reduced coordination) API infrastructure is a big part of the success • Enables custom extensions with consistency
  8. Lesson: Success is an Ecosystem The parallel innovation grows into

    an ecosystem CNCF has three levels of project maturity: (innovation subtrees!) • Sandbox • Incubating • Graduated Istio is itself now 4 years old Many startups created, many companies pivoted
  9. None
  10. Lesson: “Chop Wood and Carry Water” Lots of the important

    work is … mundane • Bug fixes, security patches • Breaking changes in dependencies • Documentation, ease of use, ... Critical parts need to work and be stable Need investments in testing to enable velocity • Without good test cases, we can’t tell if changes break stuff! • Including conformance testing
  11. Summary Vision: “Cloud” should run at a higher level of

    abstraction … but still be able to run all the things Kubernetes “won” — it’s the platform for modern development • Ray should itself be a platform on top of Kubernetes Open Source is a key part of driving adoption Parallel innovation is the only way You have to do the mundane stuff (too)
  12. BACKUP

  13. The beginning: Merging Two Kinds of Containers Docker • It’s

    about packaging • Control: ◦ packages ◦ versions ◦ (some config) • Layered file system • ⇒ Prod matches testing Linux Containers • It’s about isolation … performance isolation • not security isolation … use VMs for that • Manage CPUs, memory, bandwidth, … • Nested groups
  14. Istio: insert a services control layer using L7 proxy Simple

    k8s: services have a load balancer Istio: services have an extensible L7 proxy • Advanced load balancing • Telemetry: uniform data collection about services ◦ E.g. latency distribution • Security: handle auth and access control • Quota: limit usage by some callers • Uniform policies Most important: change policies without changing application code
  15. Hybrid Cloud and Multi-Cloud Strong demand to mix on prem

    and Cloud(s) Open Source makes this vastly easier Two models, both are used together: • Partition services — run different things in different places ◦ Secure bidirectional traffic with direct peering • Consistent Environment ◦ Run services in either place without code changes ◦ May involve some storage replication for latency/cost