From Vertical to Horizontal

@pyr From Vertical to Horizontal The challenges of scalability in
the cloud

@pyr Four-line bio • CTO & co-founder at Exoscale •
Open Source Developer • Monitoring & Distributed Systems Enthusiast • Linux since 1997

@pyr Scalability “The ability of a system, network, or process
to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth” - Wikipedia

@pyr Scalability • Culture • Organization and Process • Technical
Architecture • Operations

@pyr Scaling Geometry Recent History Enter the cloud Distributed Headaches
Architecture Drivers Looking forward

Quick Notes • “Cloud” an umbrella term • Here conflated
with public IAAS • Oriented toward web application design

@pyr Scaling Geometry Vertical, Horizontal, and Diagonal

@pyr Vertical (scaling up) Adding resources to a single system

@pyr Vertical (scaling up) This is how you typically approach
scaling MySQL

@pyr Horizontal (scaling out) Accommodate growth by spreading workload over
several systems

@pyr Horizontal (scaling out) Typical approach to scaling web servers

@pyr Diagonal Most common strategy: vertical first, and then horizontal

@pyr Recent History Leading up to IAAS

@pyr Whenever possible, a great approach

@pyr So, why stop?

@pyr Moore’s law “Over the history of computing, the number
of transistors on integrated circuits doubles approximately every two years.”

@pyr Average core speed has been stable for several years
Consistent increase in cores per node

@pyr “You mean I have to use threads?”

@pyr Vertical Scaling Challenges (424 pages)

@pyr Vertical Scaling Challenges Threads?

@pyr No more automatic vertical approach

@pyr Meanwhile...

@pyr “What if I put an API on it?”

@pyr Enter: the Cloud

@pyr • IT as a utility • Programmable resources •
Decoupling of storage from system resources • Usage-based billing model

Upside

@pyr • Much lower capacity planning overhead • OPEX makes
accounting department happy • Nobody likes to change disks or rack servers

@pyr • Switches? gone. • VLANs? gone. • IP allocation
and translation? gone. • OS partitioning? gone. • OS RAID management? gone.

@pyr provider "exoscale" { api_key = "${var.exoscale_api_key}" secret_key = "${var.exoscale_secret_key}"
} resource "exoscale_instance" "web" { template = "ubuntu 17.04" disk_size = "50g" template = "ubuntu 17.04" profile = "medium" ssh_key = "production" }

Downside

@pyr “There is no cloud, there is just someone else’s
computer”

@pyr • It’s hard to break out of the big
iron mental model • It’s hard to change our trust model ◦ “I want to be able to see my servers!” • There is still an upper limit on node size • Horizontal-first approach to building infrastructure

@pyr Distributed Headaches

@pyr Two nodes interacting imply a distributed system Reduces SPOF,
increases amount of failure scenarios

@pyr Distributed systems are subject to Brewer/CAP Cannot enjoy three
of Consistency, Availability, Partition tolerance

@pyr • Consistency: Simultaneous requests see a consistent set of
data • Availability: Each incoming request is acknowledged and receives a success or failure response • Partition Tolerance: The system will continue to process incoming requests in the face of failures

@pyr Architecture Drivers Reducing complexity to focus on higher order
problems

@pyr Inspectable services Queues over RPC Degrade gracefully Prefer concerned
citizens Configuration from a service registry Nodes as immutable data structures

@pyr Inspectable services

@pyr Build introspection within services Number of acknowledged, processed, failed
requests Time actions to quickly identify hotspots

@pyr Avoid the monitor effect Small unobtrusive probes UDP is
often sufficient

@pyr Leverage proven, existing tools Collectd, Syslog-NG, Statsd, Riemann

@pyr @wrap_riemann('activate-account') def activate_account(self, id): self.accounts.by_id(id).try_activate()

@pyr Queues over RPC

@pyr RPC couples systems Your service’s CAP properties are tied
to the RPC provider

@pyr Take responsibility out of the callee as soon as
possible Textbook example: SMTP

@pyr Queues promote statelessness { request_id: "97d4f7b3", host_id: "64e4-41b5", action:
"mailout", recipients: [ "[email protected]" ], content: "..." }

@pyr Queues help dynamically shape systems Queue backlog growing? Spin-up
new workers!

@pyr Degrade Gracefully

@pyr Embrace failure Systems will fail. In ways you didn’t
expect.

@pyr Avoid failure propagation Implement backpressure to avoid killing loaded
systems. Queues make great pressure valves.

@pyr Don’t give up Use connection pooling and retry policies.
Best in class: finagle, cassandra-driver

@pyr Keep systems up SQL down? No more account creations,
still serving existing customers.

@pyr Prefer Concerned Citizens

@pyr All moving parts force new compromises This is true
of internal and external components

@pyr Choose components accordingly

@pyr You probably want an AP queueing system So please
avoid using MySQL as one! Candidates: Apache Kafka, RabbitMQ, Redis (to a lesser extent)

@pyr Cache locally Much higher aggregated cache capacity No Huge
SPOF

@pyr Choose your storage compromises Object Storage, Distributed KV (eventual
consistency), SQL (no P or A).

@pyr Configuration through service registries

@pyr Keep track of node volatility Reprovisioning of configuration on
cluster topology changes Load-balancers make a great interaction point (concentrate changes there)

@pyr The service registry is critical Ideally needs to be
a strongly consistent, distributed system. You already have an eventually consistent one: DNS!

@pyr Zookeeper and Etcd Current best in class. Promotes usage
in-app as well as distributed locks, barriers, etc.

@pyr Immutable Infrastructure

@pyr No more fixing nodes Human intervention means configuration drift

@pyr • Configuration Drift? Reprovision node. • New version of
software? Reprovision node. • Configuration file change? Reprovision node.

@pyr Depart from using the machine as the base unit
of reasoning All nodes in clusters should be equivalent

@pyr Looking Forward The cluster is the computer

@pyr A new layer of abstraction Virtual resources pooled and
orchestrated

@pyr Generic platform abstractions PAAS solutions are a commodity (cf:
OpenShift) Generic scheduling and failover frameworks (Mesos, Kubernetes Operators)

@pyr Thanks! Questions?

From Vertical to Horizontal

From Vertical to Horizontal

More Decks by Pierre-Yves Ritschard

Other Decks in Programming

Featured

Transcript