$30 off During Our Annual Pro Sale. View Details »

Service Backplanes for the Modern Data Center

Service Backplanes for the Modern Data Center

Modern data centers and clouds allow cluster resources to be mapped to applications dynamically, which can dramatically improve utilization and deployment agility. However, deploying off-the-shelf software in this environment is not easy: devops staff typically need to write "backplanes" to deploy, manage, and upgrade each software package that is deployed at scale. Despite their ad-hoc nature, these backplanes quickly grow in complexity and importance: a bug in the application backplane often means a production outage.

What is needed is a consistent framework for designing "backplanes": controllers that mediate between cluster resource APIs and server software. Backplanes capture much of the operational complexity of running a software package in production, allowing off-the-shelf software to be turned into an elastic service. In this talk, we motivate the need for service backplanes, discuss the functionality that belongs in a backplane, and consider how backplanes should be built, using Apache Mesos as an example.

Neil Conway

May 26, 2016
Tweet

More Decks by Neil Conway

Other Decks in Programming

Transcript

  1. © 2016 Mesosphere, Inc. All Rights Reserved. 1
    NABDConf 2016 - Neil Conway
    Service Backplanes
    for the
    Modern Data Center

    View Slide

  2. © 2016 Mesosphere, Inc. All Rights Reserved. 2
    About Me
    2002-2008 Postgres Developer
    2006-2008 Stream Processing Startup
    2008-2014 PhD, Distributed Computing
    2015- Mesosphere

    View Slide

  3. © 2016 Mesosphere, Inc. All Rights Reserved. 3
    Sound and Fury?

    View Slide

  4. © 2016 Mesosphere, Inc. All Rights Reserved. 4
    One Possible Reaction

    View Slide

  5. © 2016 Mesosphere, Inc. All Rights Reserved. 5
    1. What has changed about modern distributed systems?
    2. What is a service backplane?
    3. How should we build service backplanes?
    Outline

    View Slide

  6. © 2016 Mesosphere, Inc. All Rights Reserved. 6
    The
    Good Old
    Days

    View Slide

  7. © 2016 Mesosphere, Inc. All Rights Reserved. 7
    Old-School Clusters
    Size Rate of Change
    Data Sets Small (TBs)
    Applications Few (1-10s)
    Machines Few (~100s)
    User Expertise High
    Humans
    involved

    View Slide

  8. © 2016 Mesosphere, Inc. All Rights Reserved. 8
    Resource Allocation Via Configuration
    n1
    n4
    n2
    n5
    n3
    n6
    n7 n8 n9
    Config:
    ● N1, N4, N7 ➝ Hadoop
    ● N2, N5, N8 ➝ Postgres
    ● N3, N6, N9 ➝ NGINX
    n1
    n4
    n2
    n5
    n3
    n6
    n7 n8 n9
    Manual, static
    configuration
    Applications
    take resource
    allocation as
    an input

    View Slide

  9. © 2016 Mesosphere, Inc. All Rights Reserved. 9
    Analogy: Manual Memory Management
    0x0 0x8
    Config:
    ● [0x0,0x1) ➝ calc.exe
    ● [0x1,0x4) ➝ winmine.exe
    ● [0x4,0x8) ➝ notepad.exe
    Physical Memory
    0x0 0x8
    Applications take
    physical memory
    address range as
    an input

    View Slide

  10. © 2016 Mesosphere, Inc. All Rights Reserved. 10
    Consequences
    Utilization Low ❌
    Deployment Agility Low ❌
    Elasticity None ❌
    Test / Dev / Staging Envs Difficult ❌
    Simplicity High ✅
    … but it basically worked, and it was simple.

    View Slide

  11. © 2016 Mesosphere, Inc. All Rights Reserved. 11
    Modern Clusters
    Size Rate of Change
    Data Sets Massive
    Applications Many
    Machines Many
    User Expertise “Less”
    Expert
    Increasingly
    automated

    View Slide

  12. © 2016 Mesosphere, Inc. All Rights Reserved. 12
    Scaling Static Configuration?
    f(resources, apps) ➝
    resource
    allocation
    n1
    n3
    n2
    n4
    n1
    n4
    n2
    n5
    n3
    n6
    n7 n8 n9
    Static Config
    Tool

    View Slide

  13. © 2016 Mesosphere, Inc. All Rights Reserved. 13
    Dynamic Resource Management
    “Service
    Backplane”
    n1
    n3
    n2
    n4
    Replace static
    configuration with
    program logic
    Unmodified
    application
    software

    View Slide

  14. © 2016 Mesosphere, Inc. All Rights Reserved. 14
    Architecture
    ● Allow unmodified
    application software to
    run at scale
    ● Interface between
    application instances
    and provisioning APIs
    Service Backplanes
    Cassandra
    Backplane
    n1
    n3
    n2
    n4
    Postgres
    Backplane

    View Slide

  15. © 2016 Mesosphere, Inc. All Rights Reserved. 15
    Resource Management
    ● Allocate resources to apps
    ○ Fairness, utilization, etc.
    ● Elasticity and auto-scaling
    ● Oversubscription, perf isolation
    ● Abstractions for complex
    resources (e.g., GPUs)
    Key Backplane Functionality
    Lifecycle Management
    ● Replace failed instances
    ○ Migrate state/data as
    needed
    ● Allow machines, racks to
    be replaced (safely!)
    ● Allow apps to be
    upgraded (safely!)
    Resource Management Lifecycle Management
    Backplane: interface between
    application and “cluster context”

    View Slide

  16. © 2016 Mesosphere, Inc. All Rights Reserved. 16
    Upgrading 3-10
    Cassandra nodes:
    annoying but
    manageable.
    Upgrading 25k
    Cassandra nodes:
    really hard problem.
    Example: Upgrades at Scale
    Challenges:
    ● Roll-backs, non-
    destructive
    upgrades
    ● Deploy upgrade to
    subset of cluster
    ● Move traffic away to
    avoid downtime
    ● Data migration
    Hard to solve
    “inside” the app

    View Slide

  17. © 2016 Mesosphere, Inc. All Rights Reserved. 17
    ● Scheduling is important
    ● But: much more to backplanes than
    bin-packing or max-min fairness
    ● Requires deep knowledge of
    ○ Application semantics
    ○ Ops procedures
    ● Goal: transform prepackaged “server
    software” into “service”
    Not (Just) “Scheduling” or “Container Orchestration”
    “... there are not very many things
    that have aged as well as the [Linux]
    scheduler. Which is just another
    proof that scheduling is easy.”
    —Linus Torvalds, 2001

    View Slide

  18. © 2016 Mesosphere, Inc. All Rights Reserved. 18
    The
    State of
    the Art

    View Slide

  19. © 2016 Mesosphere, Inc. All Rights Reserved. 19
    Many organizations already
    build service backplanes.
    What Do People Do Today?
    … they just don’t know it.

    View Slide

  20. © 2016 Mesosphere, Inc. All Rights Reserved. 20
    Goal
    Provide a software
    service to the rest of the
    organization
    E.g., object storage, streaming data
    analysis, batch analytics, ML, etc.
    Common Pattern
    Solution
    ● Start with off-the-shelf
    (OSS) software package
    ● Write “scripts” to deploy,
    manage, and upgrade
    instances

    View Slide

  21. © 2016 Mesosphere, Inc. All Rights Reserved. 21
    Building fault-tolerant
    control planes for cluster
    services is not easy!
    Problem #1: Backplanes Are Hard
    ● Often >10,000s LOCs
    ● Hard to test and debug
    ● Maintenance burden
    Backplane downtime is
    service downtime

    View Slide

  22. © 2016 Mesosphere, Inc. All Rights Reserved. 22
    ● In many cases, the service is the “product”
    ● Backplane is just a “bunch of scripts”
    ○ Not a distinct component of the system
    architecture
    ● Sometimes built in an ad-hoc way
    ● Often no rigorous specification or API
    Problem #2: Not Seen As A Product

    View Slide

  23. © 2016 Mesosphere, Inc. All Rights Reserved. 23
    ● Many backplanes are similar
    ● Typically built by different teams
    that don’t collaborate
    ○ No opportunity for code reuse
    ○ No shared infrastructure
    ● Each backplane cannot examine
    global cluster state
    ● Hard to define global policies that
    apply to all backplanes
    Problem #3: Redundancy Between Services

    View Slide

  24. © 2016 Mesosphere, Inc. All Rights Reserved. 24
    ● Many organizations have custom-
    written backplanes for Cassandra,
    Kafka, HDFS, etc.
    ● Often tightly coupled to their
    production environment
    ○ Result: fragile, not portable to
    other environments
    Problem #4: Redundancy Between Organizations

    View Slide

  25. © 2016 Mesosphere, Inc. All Rights Reserved. 25
    Developer “ships” a release of
    their software package
    ● Then >10k LOC is needed to
    deploy it at scale!
    This sucks
    ● The upstream developer is
    the domain expert
    ● Developer ships code their
    customer can’t (directly) use
    The Gap From “Done” to “Deployable”
    Can we standardize the
    functionality needed for large-
    scale deployments?
    ● Allow backplane
    functionality to move “up”
    the stack
    ● Tested and developed as
    part of the upstream
    software

    View Slide

  26. © 2016 Mesosphere, Inc. All Rights Reserved. 26
    1. Deploy to prod and pray
    2. Document best practices (“runbook”)
    3. Write scripts to handle common scenarios
    4. Encode best practices as a service backplane
    Opportunity: Shrink Runbooks

    View Slide

  27. © 2016 Mesosphere, Inc. All Rights Reserved. 27
    A
    Modest
    Proposal

    View Slide

  28. © 2016 Mesosphere, Inc. All Rights Reserved. 28
    1. Embrace backplanes as a standard component in large-
    scale distributed systems
    ● Not just “a few scripts”
    2. Build infrastructure to make writing backplanes easier
    3. Define standard APIs for communicating between
    backplanes and cluster infrastructure
    4. Enable upstream software developers to ship backplanes
    as part of their software packages
    Rethinking Service Backplanes

    View Slide

  29. © 2016 Mesosphere, Inc. All Rights Reserved. 29
    Example Architecture
    Backplane
    Manager
    Cassandra
    Backplane
    Postgres
    Backplane
    Abstract away details of
    cloud or on-prem env.
    Clear API / interface for
    service backplanes
    Cluster Operator
    Single operator interface,
    define global policy

    View Slide

  30. © 2016 Mesosphere, Inc. All Rights Reserved. 30
    ● “Manage your data
    center as a single pool
    of resources.”
    ● UC Berkeley: 2008
    ● Battle-tested at
    Twitter: 2009-2016
    ● Other users: Apple,
    eBay, Netflix, Microsoft,
    PayPal, AirBnb, Criteo,
    Yelp, Uber, ...
    Background: Apache Mesos
    Mesos
    Master
    Scheduler X
    Mesos
    Agent
    Task
    Executor
    Scheduler Y
    Machine M
    “I have 8
    CPUs, 8 disks,
    64GB RAM”
    “Offer: would
    you like 8 CPUs,
    8 disks, and
    64GB of RAM?”
    “Accept: Launch
    container X.”
    “Launch
    container X.”

    View Slide

  31. © 2016 Mesosphere, Inc. All Rights Reserved. 31
    Mesos + Service Backplanes?
    Mesos
    Master
    Cassandra
    Framework
    Postgres
    Framework
    Cluster Operator

    View Slide

  32. © 2016 Mesosphere, Inc. All Rights Reserved. 32
    Open Question: Software Architecture
    Backplane as part of application,
    or as separate component?

    View Slide

  33. © 2016 Mesosphere, Inc. All Rights Reserved. 33
    ● Backplane ↔ backplane manager
    ● Application ↔ backplane
    ● Dimensions:
    ○ Push or pull (offer vs. request)
    ○ Optimistic or pessimistic
    ○ Declarative or imperative
    ○ Narrow or wide
    ● How to represent cluster resources?
    Open Question: APIs

    View Slide

  34. © 2016 Mesosphere, Inc. All Rights Reserved. 34
    ● Where does the functionality live?
    ○ Application, backplane, or backplane manager
    ● Does this change how we should build common
    service features?
    ○ Security? Logging? Metrics? Fault tolerance?
    Service discovery? Data migration?
    Open Question: Co-Design of Applications and Backplanes

    View Slide

  35. © 2016 Mesosphere, Inc. All Rights Reserved. 35
    1. Many people are building service backplanes, even if
    they don’t call them that
    2. Driven by industry forces that are likely to persist
    3. We should embrace the need for backplanes and figure
    out how to build them properly
    Conclusion

    View Slide

  36. © 2016 Mesosphere, Inc. All Rights Reserved. 36
    Thanks!
    [email protected]
    @neil_conway

    View Slide