Service Backplanes for the Modern Data Center

© 2016 Mesosphere, Inc. All Rights Reserved. 1 NABDConf 2016
- Neil Conway Service Backplanes for the Modern Data Center

© 2016 Mesosphere, Inc. All Rights Reserved. 2 About Me
2002-2008 Postgres Developer 2006-2008 Stream Processing Startup 2008-2014 PhD, Distributed Computing 2015- Mesosphere

© 2016 Mesosphere, Inc. All Rights Reserved. 3 Sound and
Fury?

© 2016 Mesosphere, Inc. All Rights Reserved. 4 One Possible
Reaction

© 2016 Mesosphere, Inc. All Rights Reserved. 5 1. What
has changed about modern distributed systems? 2. What is a service backplane? 3. How should we build service backplanes? Outline

© 2016 Mesosphere, Inc. All Rights Reserved. 6 The Good
Old Days

© 2016 Mesosphere, Inc. All Rights Reserved. 7 Old-School Clusters
Size Rate of Change Data Sets Small (TBs) Applications Few (1-10s) Machines Few (~100s) User Expertise High Humans involved

© 2016 Mesosphere, Inc. All Rights Reserved. 8 Resource Allocation
Via Configuration n1 n4 n2 n5 n3 n6 n7 n8 n9 Config: • N1, N4, N7 ➝ Hadoop • N2, N5, N8 ➝ Postgres • N3, N6, N9 ➝ NGINX n1 n4 n2 n5 n3 n6 n7 n8 n9 Manual, static configuration Applications take resource allocation as an input

© 2016 Mesosphere, Inc. All Rights Reserved. 9 Analogy: Manual
Memory Management 0x0 0x8 Config: • [0x0,0x1) ➝ calc.exe • [0x1,0x4) ➝ winmine.exe • [0x4,0x8) ➝ notepad.exe Physical Memory 0x0 0x8 Applications take physical memory address range as an input

© 2016 Mesosphere, Inc. All Rights Reserved. 10 Consequences Utilization
Low ❌ Deployment Agility Low ❌ Elasticity None ❌ Test / Dev / Staging Envs Difficult ❌ Simplicity High ✅ … but it basically worked, and it was simple.

© 2016 Mesosphere, Inc. All Rights Reserved. 11 Modern Clusters
Size Rate of Change Data Sets Massive Applications Many Machines Many User Expertise “Less” Expert Increasingly automated

© 2016 Mesosphere, Inc. All Rights Reserved. 12 Scaling Static
Configuration? f(resources, apps) ➝ resource allocation n1 n3 n2 n4 n1 n4 n2 n5 n3 n6 n7 n8 n9 Static Config Tool

© 2016 Mesosphere, Inc. All Rights Reserved. 13 Dynamic Resource
Management “Service Backplane” n1 n3 n2 n4 Replace static configuration with program logic Unmodified application software

© 2016 Mesosphere, Inc. All Rights Reserved. 14 Architecture •
Allow unmodified application software to run at scale • Interface between application instances and provisioning APIs Service Backplanes Cassandra Backplane n1 n3 n2 n4 Postgres Backplane

© 2016 Mesosphere, Inc. All Rights Reserved. 15 Resource Management
• Allocate resources to apps ◦ Fairness, utilization, etc. • Elasticity and auto-scaling • Oversubscription, perf isolation • Abstractions for complex resources (e.g., GPUs) Key Backplane Functionality Lifecycle Management • Replace failed instances ◦ Migrate state/data as needed • Allow machines, racks to be replaced (safely!) • Allow apps to be upgraded (safely!) Resource Management Lifecycle Management Backplane: interface between application and “cluster context”

© 2016 Mesosphere, Inc. All Rights Reserved. 16 Upgrading 3-10
Cassandra nodes: annoying but manageable. Upgrading 25k Cassandra nodes: really hard problem. Example: Upgrades at Scale Challenges: • Roll-backs, non- destructive upgrades • Deploy upgrade to subset of cluster • Move traffic away to avoid downtime • Data migration Hard to solve “inside” the app

© 2016 Mesosphere, Inc. All Rights Reserved. 17 • Scheduling
is important • But: much more to backplanes than bin-packing or max-min fairness • Requires deep knowledge of ◦ Application semantics ◦ Ops procedures • Goal: transform prepackaged “server software” into “service” Not (Just) “Scheduling” or “Container Orchestration” “... there are not very many things that have aged as well as the [Linux] scheduler. Which is just another proof that scheduling is easy.” —Linus Torvalds, 2001

© 2016 Mesosphere, Inc. All Rights Reserved. 18 The State
of the Art

© 2016 Mesosphere, Inc. All Rights Reserved. 19 Many organizations
already build service backplanes. What Do People Do Today? … they just don’t know it.

© 2016 Mesosphere, Inc. All Rights Reserved. 20 Goal Provide
a software service to the rest of the organization E.g., object storage, streaming data analysis, batch analytics, ML, etc. Common Pattern Solution • Start with off-the-shelf (OSS) software package • Write “scripts” to deploy, manage, and upgrade instances

© 2016 Mesosphere, Inc. All Rights Reserved. 21 Building fault-tolerant
control planes for cluster services is not easy! Problem #1: Backplanes Are Hard • Often >10,000s LOCs • Hard to test and debug • Maintenance burden Backplane downtime is service downtime

© 2016 Mesosphere, Inc. All Rights Reserved. 22 • In
many cases, the service is the “product” • Backplane is just a “bunch of scripts” ◦ Not a distinct component of the system architecture • Sometimes built in an ad-hoc way • Often no rigorous specification or API Problem #2: Not Seen As A Product

© 2016 Mesosphere, Inc. All Rights Reserved. 23 • Many
backplanes are similar • Typically built by different teams that don’t collaborate ◦ No opportunity for code reuse ◦ No shared infrastructure • Each backplane cannot examine global cluster state • Hard to define global policies that apply to all backplanes Problem #3: Redundancy Between Services

© 2016 Mesosphere, Inc. All Rights Reserved. 24 • Many
organizations have custom- written backplanes for Cassandra, Kafka, HDFS, etc. • Often tightly coupled to their production environment ◦ Result: fragile, not portable to other environments Problem #4: Redundancy Between Organizations

© 2016 Mesosphere, Inc. All Rights Reserved. 25 Developer “ships”
a release of their software package • Then >10k LOC is needed to deploy it at scale! This sucks • The upstream developer is the domain expert • Developer ships code their customer can’t (directly) use The Gap From “Done” to “Deployable” Can we standardize the functionality needed for large- scale deployments? • Allow backplane functionality to move “up” the stack • Tested and developed as part of the upstream software

© 2016 Mesosphere, Inc. All Rights Reserved. 26 1. Deploy
to prod and pray 2. Document best practices (“runbook”) 3. Write scripts to handle common scenarios 4. Encode best practices as a service backplane Opportunity: Shrink Runbooks

© 2016 Mesosphere, Inc. All Rights Reserved. 28 1. Embrace
backplanes as a standard component in large- scale distributed systems • Not just “a few scripts” 2. Build infrastructure to make writing backplanes easier 3. Define standard APIs for communicating between backplanes and cluster infrastructure 4. Enable upstream software developers to ship backplanes as part of their software packages Rethinking Service Backplanes

© 2016 Mesosphere, Inc. All Rights Reserved. 29 Example Architecture
Backplane Manager Cassandra Backplane Postgres Backplane Abstract away details of cloud or on-prem env. Clear API / interface for service backplanes Cluster Operator Single operator interface, define global policy

© 2016 Mesosphere, Inc. All Rights Reserved. 30 • “Manage
your data center as a single pool of resources.” • UC Berkeley: 2008 • Battle-tested at Twitter: 2009-2016 • Other users: Apple, eBay, Netflix, Microsoft, PayPal, AirBnb, Criteo, Yelp, Uber, ... Background: Apache Mesos Mesos Master Scheduler X Mesos Agent Task Executor Scheduler Y Machine M “I have 8 CPUs, 8 disks, 64GB RAM” “Offer: would you like 8 CPUs, 8 disks, and 64GB of RAM?” “Accept: Launch container X.” “Launch container X.”

© 2016 Mesosphere, Inc. All Rights Reserved. 33 • Backplane
↔ backplane manager • Application ↔ backplane • Dimensions: ◦ Push or pull (offer vs. request) ◦ Optimistic or pessimistic ◦ Declarative or imperative ◦ Narrow or wide • How to represent cluster resources? Open Question: APIs

© 2016 Mesosphere, Inc. All Rights Reserved. 34 • Where
does the functionality live? ◦ Application, backplane, or backplane manager • Does this change how we should build common service features? ◦ Security? Logging? Metrics? Fault tolerance? Service discovery? Data migration? Open Question: Co-Design of Applications and Backplanes

© 2016 Mesosphere, Inc. All Rights Reserved. 35 1. Many
people are building service backplanes, even if they don’t call them that 2. Driven by industry forces that are likely to persist 3. We should embrace the need for backplanes and figure out how to build them properly Conclusion

Service Backplanes for the Modern Data Center

Service Backplanes for the Modern Data Center

More Decks by Neil Conway

Other Decks in Programming

Featured

Transcript