Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Vertical to Horizontal

From Vertical to Horizontal

A presentation about the challenges of scalability in the cloud, given at LinuxWochen Wien 2017

Pierre-Yves Ritschard

May 05, 2017
Tweet

More Decks by Pierre-Yves Ritschard

Other Decks in Programming

Transcript

  1. @pyr
    From Vertical to Horizontal
    The challenges of scalability in the cloud

    View full-size slide

  2. @pyr
    Four-line bio
    ● CTO & co-founder at Exoscale
    ● Open Source Developer
    ● Monitoring & Distributed Systems Enthusiast
    ● Linux since 1997

    View full-size slide

  3. @pyr
    Scalability
    “The ability of a system, network, or process to handle
    a growing amount of work in a capable manner or its
    ability to be enlarged to accommodate that growth”
    - Wikipedia

    View full-size slide

  4. @pyr
    Scalability
    ● Culture
    ● Organization and Process
    ● Technical Architecture
    ● Operations

    View full-size slide

  5. @pyr
    Scalability
    ● Culture
    ● Organization and Process
    ● Technical Architecture
    ● Operations

    View full-size slide

  6. @pyr
    Scaling Geometry
    Recent History
    Enter the cloud
    Distributed Headaches
    Architecture Drivers
    Looking forward

    View full-size slide

  7. Quick Notes
    ● “Cloud” an umbrella term
    ● Here conflated with public IAAS
    ● Oriented toward web application design

    View full-size slide

  8. @pyr
    Scaling Geometry
    Vertical, Horizontal, and Diagonal

    View full-size slide

  9. @pyr
    Vertical (scaling up)
    Adding resources to a single system

    View full-size slide

  10. @pyr
    Vertical (scaling up)
    This is how you typically approach scaling MySQL

    View full-size slide

  11. @pyr
    Horizontal (scaling out)
    Accommodate growth by spreading workload over
    several systems

    View full-size slide

  12. @pyr
    Horizontal (scaling out)
    Typical approach to scaling web servers

    View full-size slide

  13. @pyr
    Diagonal
    Most common strategy: vertical first, and then
    horizontal

    View full-size slide

  14. @pyr
    Recent History
    Leading up to IAAS

    View full-size slide

  15. @pyr
    Whenever possible, a great approach

    View full-size slide

  16. @pyr
    So, why stop?

    View full-size slide

  17. @pyr
    Moore’s law
    “Over the history of computing, the number of
    transistors on integrated circuits doubles
    approximately every two years.”

    View full-size slide

  18. @pyr
    Average core speed has been stable for several years
    Consistent increase in cores per node

    View full-size slide

  19. @pyr
    “You mean I have to use threads?”

    View full-size slide

  20. @pyr
    Vertical Scaling Challenges
    (424 pages)

    View full-size slide

  21. @pyr
    Vertical Scaling Challenges
    Threads?

    View full-size slide

  22. @pyr
    No more automatic vertical approach

    View full-size slide

  23. @pyr
    Meanwhile...

    View full-size slide

  24. @pyr
    “What if I put an API on it?”

    View full-size slide

  25. @pyr
    Enter: the Cloud

    View full-size slide

  26. @pyr
    ● IT as a utility
    ● Programmable resources
    ● Decoupling of storage from system resources
    ● Usage-based billing model

    View full-size slide

  27. @pyr
    ● Much lower capacity planning overhead
    ● OPEX makes accounting department happy
    ● Nobody likes to change disks or rack servers

    View full-size slide

  28. @pyr
    ● Switches? gone.
    ● VLANs? gone.
    ● IP allocation and translation? gone.
    ● OS partitioning? gone.
    ● OS RAID management? gone.

    View full-size slide

  29. @pyr
    provider "exoscale" {
    api_key = "${var.exoscale_api_key}"
    secret_key = "${var.exoscale_secret_key}"
    }
    resource "exoscale_instance" "web" {
    template = "ubuntu 17.04"
    disk_size = "50g"
    template = "ubuntu 17.04"
    profile = "medium"
    ssh_key = "production"
    }

    View full-size slide

  30. @pyr
    “There is no cloud, there is just someone else’s computer”

    View full-size slide

  31. @pyr
    ● It’s hard to break out of the big iron mental model
    ● It’s hard to change our trust model
    ○ “I want to be able to see my servers!”
    ● There is still an upper limit on node size
    ● Horizontal-first approach to building infrastructure

    View full-size slide

  32. @pyr
    Distributed Headaches

    View full-size slide

  33. @pyr
    Two nodes interacting imply a distributed system
    Reduces SPOF, increases amount of failure scenarios

    View full-size slide

  34. @pyr
    Distributed systems are subject to Brewer/CAP
    Cannot enjoy three of Consistency, Availability, Partition tolerance

    View full-size slide

  35. @pyr
    ● Consistency: Simultaneous requests see a consistent set of data
    ● Availability: Each incoming request is acknowledged and receives a success or failure
    response
    ● Partition Tolerance: The system will continue to process incoming requests in the face
    of failures

    View full-size slide

  36. @pyr
    Architecture Drivers
    Reducing complexity to focus on higher order problems

    View full-size slide

  37. @pyr
    Inspectable services
    Queues over RPC
    Degrade gracefully
    Prefer concerned citizens
    Configuration from a service registry
    Nodes as immutable data structures

    View full-size slide

  38. @pyr
    Inspectable services

    View full-size slide

  39. @pyr
    Build introspection within services
    Number of acknowledged, processed, failed requests
    Time actions to quickly identify hotspots

    View full-size slide

  40. @pyr
    Avoid the monitor effect
    Small unobtrusive probes
    UDP is often sufficient

    View full-size slide

  41. @pyr
    Leverage proven, existing tools
    Collectd, Syslog-NG, Statsd, Riemann

    View full-size slide

  42. @pyr
    @wrap_riemann('activate-account')
    def activate_account(self, id):
    self.accounts.by_id(id).try_activate()

    View full-size slide

  43. @pyr
    Queues over RPC

    View full-size slide

  44. @pyr
    RPC couples systems
    Your service’s CAP properties are tied to the RPC provider

    View full-size slide

  45. @pyr
    Take responsibility out of the callee as soon as
    possible
    Textbook example: SMTP

    View full-size slide

  46. @pyr
    Queues promote statelessness
    {
    request_id: "97d4f7b3",
    host_id: "64e4-41b5",
    action: "mailout",
    recipients: [ "[email protected]" ],
    content: "..."
    }

    View full-size slide

  47. @pyr
    Queues help dynamically shape systems
    Queue backlog growing? Spin-up new workers!

    View full-size slide

  48. @pyr
    Degrade Gracefully

    View full-size slide

  49. @pyr
    Embrace failure
    Systems will fail. In ways you didn’t expect.

    View full-size slide

  50. @pyr
    Avoid failure propagation
    Implement backpressure to avoid killing loaded systems. Queues make great pressure valves.

    View full-size slide

  51. @pyr
    Don’t give up
    Use connection pooling and retry policies.
    Best in class: finagle, cassandra-driver

    View full-size slide

  52. @pyr
    Keep systems up
    SQL down? No more account creations, still serving existing customers.

    View full-size slide

  53. @pyr
    Prefer Concerned Citizens

    View full-size slide

  54. @pyr
    All moving parts force new compromises
    This is true of internal and external components

    View full-size slide

  55. @pyr
    Choose components accordingly

    View full-size slide

  56. @pyr
    You probably want an AP queueing system
    So please avoid using MySQL as one!
    Candidates: Apache Kafka, RabbitMQ, Redis (to a lesser extent)

    View full-size slide

  57. @pyr
    Cache locally
    Much higher aggregated cache capacity
    No Huge SPOF

    View full-size slide

  58. @pyr
    Choose your storage compromises
    Object Storage, Distributed KV (eventual consistency), SQL (no P or A).

    View full-size slide

  59. @pyr
    Configuration through service registries

    View full-size slide

  60. @pyr
    Keep track of node volatility
    Reprovisioning of configuration on cluster topology changes
    Load-balancers make a great interaction point (concentrate changes there)

    View full-size slide

  61. @pyr
    The service registry is critical
    Ideally needs to be a strongly consistent, distributed system.
    You already have an eventually consistent one: DNS!

    View full-size slide

  62. @pyr
    Zookeeper and Etcd
    Current best in class. Promotes usage in-app as well as distributed locks, barriers, etc.

    View full-size slide

  63. @pyr
    Immutable Infrastructure

    View full-size slide

  64. @pyr
    No more fixing nodes
    Human intervention means configuration drift

    View full-size slide

  65. @pyr
    ● Configuration Drift? Reprovision node.
    ● New version of software? Reprovision node.
    ● Configuration file change? Reprovision node.

    View full-size slide

  66. @pyr
    Depart from using the machine as the base unit of
    reasoning
    All nodes in clusters should be equivalent

    View full-size slide

  67. @pyr
    Looking Forward
    The cluster is the computer

    View full-size slide

  68. @pyr
    A new layer of abstraction
    Virtual resources pooled and orchestrated

    View full-size slide

  69. @pyr
    Generic platform abstractions
    PAAS solutions are a commodity (cf: OpenShift)
    Generic scheduling and failover frameworks (Mesos, Kubernetes Operators)

    View full-size slide

  70. @pyr
    Thanks!
    Questions?

    View full-size slide