From Vertical to Horizontal

From Vertical to Horizontal

A presentation about the challenges of scalability in the cloud, given at LinuxWochen Wien 2017


Pierre-Yves Ritschard

May 05, 2017


  1. @pyr From Vertical to Horizontal The challenges of scalability in

    the cloud
  2. @pyr Four-line bio • CTO & co-founder at Exoscale •

    Open Source Developer • Monitoring & Distributed Systems Enthusiast • Linux since 1997
  3. @pyr Scalability “The ability of a system, network, or process

    to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth” - Wikipedia
  4. @pyr Scalability • Culture • Organization and Process • Technical

    Architecture • Operations
  5. @pyr Scalability • Culture • Organization and Process • Technical

    Architecture • Operations
  6. @pyr Scaling Geometry Recent History Enter the cloud Distributed Headaches

    Architecture Drivers Looking forward
  7. Quick Notes • “Cloud” an umbrella term • Here conflated

    with public IAAS • Oriented toward web application design
  8. @pyr Scaling Geometry Vertical, Horizontal, and Diagonal

  9. @pyr Vertical (scaling up) Adding resources to a single system

  10. @pyr Vertical (scaling up) This is how you typically approach

    scaling MySQL
  11. @pyr

  12. @pyr Horizontal (scaling out) Accommodate growth by spreading workload over

    several systems
  13. @pyr Horizontal (scaling out) Typical approach to scaling web servers

  14. @pyr

  15. @pyr

  16. @pyr Diagonal Most common strategy: vertical first, and then horizontal

  17. @pyr Recent History Leading up to IAAS

  18. None
  19. @pyr Whenever possible, a great approach

  20. @pyr So, why stop?

  21. None
  22. @pyr Moore’s law “Over the history of computing, the number

    of transistors on integrated circuits doubles approximately every two years.”
  23. @pyr Average core speed has been stable for several years

    Consistent increase in cores per node
  24. @pyr “You mean I have to use threads?”

  25. @pyr Vertical Scaling Challenges (424 pages)

  26. @pyr Vertical Scaling Challenges Threads?

  27. @pyr No more automatic vertical approach

  28. @pyr Meanwhile...

  29. @pyr “What if I put an API on it?”

  30. @pyr Enter: the Cloud

  31. @pyr • IT as a utility • Programmable resources •

    Decoupling of storage from system resources • Usage-based billing model
  32. Upside

  33. @pyr • Much lower capacity planning overhead • OPEX makes

    accounting department happy • Nobody likes to change disks or rack servers
  34. @pyr • Switches? gone. • VLANs? gone. • IP allocation

    and translation? gone. • OS partitioning? gone. • OS RAID management? gone.
  35. @pyr

  36. @pyr provider "exoscale" { api_key = "${var.exoscale_api_key}" secret_key = "${var.exoscale_secret_key}"

    } resource "exoscale_instance" "web" { template = "ubuntu 17.04" disk_size = "50g" template = "ubuntu 17.04" profile = "medium" ssh_key = "production" }
  37. Downside

  38. @pyr “There is no cloud, there is just someone else’s

  39. @pyr • It’s hard to break out of the big

    iron mental model • It’s hard to change our trust model ◦ “I want to be able to see my servers!” • There is still an upper limit on node size • Horizontal-first approach to building infrastructure
  40. @pyr Distributed Headaches

  41. @pyr Two nodes interacting imply a distributed system Reduces SPOF,

    increases amount of failure scenarios
  42. @pyr Distributed systems are subject to Brewer/CAP Cannot enjoy three

    of Consistency, Availability, Partition tolerance
  43. @pyr • Consistency: Simultaneous requests see a consistent set of

    data • Availability: Each incoming request is acknowledged and receives a success or failure response • Partition Tolerance: The system will continue to process incoming requests in the face of failures
  44. @pyr Architecture Drivers Reducing complexity to focus on higher order

  45. @pyr Inspectable services Queues over RPC Degrade gracefully Prefer concerned

    citizens Configuration from a service registry Nodes as immutable data structures
  46. @pyr Inspectable services

  47. @pyr Build introspection within services Number of acknowledged, processed, failed

    requests Time actions to quickly identify hotspots
  48. @pyr Avoid the monitor effect Small unobtrusive probes UDP is

    often sufficient
  49. @pyr Leverage proven, existing tools Collectd, Syslog-NG, Statsd, Riemann

  50. @pyr @wrap_riemann('activate-account') def activate_account(self, id): self.accounts.by_id(id).try_activate()

  51. @pyr Queues over RPC

  52. @pyr RPC couples systems Your service’s CAP properties are tied

    to the RPC provider
  53. @pyr Take responsibility out of the callee as soon as

    possible Textbook example: SMTP
  54. @pyr Queues promote statelessness { request_id: "97d4f7b3", host_id: "64e4-41b5", action:

    "mailout", recipients: [ "" ], content: "..." }
  55. @pyr Queues help dynamically shape systems Queue backlog growing? Spin-up

    new workers!
  56. @pyr Degrade Gracefully

  57. @pyr Embrace failure Systems will fail. In ways you didn’t

  58. @pyr Avoid failure propagation Implement backpressure to avoid killing loaded

    systems. Queues make great pressure valves.
  59. @pyr Don’t give up Use connection pooling and retry policies.

    Best in class: finagle, cassandra-driver
  60. @pyr Keep systems up SQL down? No more account creations,

    still serving existing customers.
  61. @pyr Prefer Concerned Citizens

  62. @pyr All moving parts force new compromises This is true

    of internal and external components
  63. @pyr Choose components accordingly

  64. @pyr You probably want an AP queueing system So please

    avoid using MySQL as one! Candidates: Apache Kafka, RabbitMQ, Redis (to a lesser extent)
  65. @pyr Cache locally Much higher aggregated cache capacity No Huge

  66. @pyr Choose your storage compromises Object Storage, Distributed KV (eventual

    consistency), SQL (no P or A).
  67. @pyr Configuration through service registries

  68. @pyr Keep track of node volatility Reprovisioning of configuration on

    cluster topology changes Load-balancers make a great interaction point (concentrate changes there)
  69. @pyr The service registry is critical Ideally needs to be

    a strongly consistent, distributed system. You already have an eventually consistent one: DNS!
  70. @pyr Zookeeper and Etcd Current best in class. Promotes usage

    in-app as well as distributed locks, barriers, etc.
  71. @pyr Immutable Infrastructure

  72. @pyr No more fixing nodes Human intervention means configuration drift

  73. @pyr • Configuration Drift? Reprovision node. • New version of

    software? Reprovision node. • Configuration file change? Reprovision node.
  74. @pyr Depart from using the machine as the base unit

    of reasoning All nodes in clusters should be equivalent
  75. @pyr Looking Forward The cluster is the computer

  76. @pyr A new layer of abstraction Virtual resources pooled and

  77. @pyr Generic platform abstractions PAAS solutions are a commodity (cf:

    OpenShift) Generic scheduling and failover frameworks (Mesos, Kubernetes Operators)
  78. @pyr Thanks! Questions?