Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from the Kubernetes Adventure (Eric Brewer, Google | UC Berkeley)

Lessons from the Kubernetes Adventure (Eric Brewer, Google | UC Berkeley)

After a brief review of Kubernetes, we examine some of the keys to success for a large-scale open-source platform, including making room for a wide range innovation by a wide range of players, eventually leading to the vibrant ecosystem we see today.

Anyscale
PRO

July 19, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Lessons from Kubernetes Eric Brewer VP Infrastructure, Fellow Ray Summit,

    June 2021
  2. Goal: “Cloud Native” Applications Middle of a great transition... •

    unlimited “ethereal” resources in the Cloud • an environment of services not machines • thinking in APIs and co-designed services • high availability offered and expected
  3. Google has been developing and using containers to manage our

    applications for over 15 years. Images by Connie Zhou “billions” launched per week • simplifies management • performance isolation • efficiency
  4. Kubernetes: Higher level of Abstraction Don’t Worry About • OS

    details • Packages — no conflicts • Machine sizes (much) • Mixing languages • Port conflicts Think About • Composition of services • Load-balancing • Names of services • State management • Monitoring and Logging • Upgrading
  5. Evolution is the Real Value Services are Abstract • A

    “Service” is just a long-lived abstract name • Varied implementations over time (versions) • Kubernetes routes to the right implementation Apps Structured as Independent Microservices • Encapsulated state with APIs (like “objects”) • Mixture of languages • Mixture of teams
  6. Lesson: The value of Open Source Key decision: Kubernetes should

    be open source 1) Even Google needs “fellow travelers” for a mission this big ◦ Created the “Cloud Native Computing Foundation” (2015) 2) The “standard” is the code, not a traditional specification Spec-based standards cannot handle high-velocity innovation 3) Enables broad customized use: on prem, hybrid, multi-cloud, …even Raspberry PI clusters
  7. Lesson: The Innovation Tree Early days: all the work on

    the core (trunk) • Soon new efforts around networking and storage… • Eventually large parallel “SIG” structure [special interest group] The key is parallel innovation, mostly at the leaves (reduced coordination) API infrastructure is a big part of the success • Enables custom extensions with consistency
  8. Lesson: Success is an Ecosystem The parallel innovation grows into

    an ecosystem CNCF has three levels of project maturity: (innovation subtrees!) • Sandbox • Incubating • Graduated Istio is itself now 4 years old Many startups created, many companies pivoted
  9. None
  10. Lesson: “Chop Wood and Carry Water” Lots of the important

    work is … mundane • Bug fixes, security patches • Breaking changes in dependencies • Documentation, ease of use, ... Critical parts need to work and be stable Need investments in testing to enable velocity • Without good test cases, we can’t tell if changes break stuff! • Including conformance testing
  11. Summary Vision: “Cloud” should run at a higher level of

    abstraction … but still be able to run all the things Kubernetes “won” — it’s the platform for modern development • Ray should itself be a platform on top of Kubernetes Open Source is a key part of driving adoption Parallel innovation is the only way You have to do the mundane stuff (too)
  12. BACKUP

  13. The beginning: Merging Two Kinds of Containers Docker • It’s

    about packaging • Control: ◦ packages ◦ versions ◦ (some config) • Layered file system • ⇒ Prod matches testing Linux Containers • It’s about isolation … performance isolation • not security isolation … use VMs for that • Manage CPUs, memory, bandwidth, … • Nested groups
  14. Istio: insert a services control layer using L7 proxy Simple

    k8s: services have a load balancer Istio: services have an extensible L7 proxy • Advanced load balancing • Telemetry: uniform data collection about services ◦ E.g. latency distribution • Security: handle auth and access control • Quota: limit usage by some callers • Uniform policies Most important: change policies without changing application code
  15. Hybrid Cloud and Multi-Cloud Strong demand to mix on prem

    and Cloud(s) Open Source makes this vastly easier Two models, both are used together: • Partition services — run different things in different places ◦ Secure bidirectional traffic with direct peering • Consistent Environment ◦ Run services in either place without code changes ◦ May involve some storage replication for latency/cost