@pyr
From Vertical to Horizontal
The challenges of scalability in the cloud
Slide 2
Slide 2 text
@pyr
Four-line bio
● CTO & co-founder at Exoscale
● Open Source Developer
● Monitoring & Distributed Systems Enthusiast
● Linux since 1997
Slide 3
Slide 3 text
@pyr
Scalability
“The ability of a system, network, or process to handle
a growing amount of work in a capable manner or its
ability to be enlarged to accommodate that growth”
- Wikipedia
Slide 4
Slide 4 text
@pyr
Scalability
● Culture
● Organization and Process
● Technical Architecture
● Operations
Slide 5
Slide 5 text
@pyr
Scalability
● Culture
● Organization and Process
● Technical Architecture
● Operations
Slide 6
Slide 6 text
@pyr
Scaling Geometry
Recent History
Enter the cloud
Distributed Headaches
Architecture Drivers
Looking forward
Slide 7
Slide 7 text
Quick Notes
● “Cloud” an umbrella term
● Here conflated with public IAAS
● Oriented toward web application design
Slide 8
Slide 8 text
@pyr
Scaling Geometry
Vertical, Horizontal, and Diagonal
Slide 9
Slide 9 text
@pyr
Vertical (scaling up)
Adding resources to a single system
Slide 10
Slide 10 text
@pyr
Vertical (scaling up)
This is how you typically approach scaling MySQL
Slide 11
Slide 11 text
@pyr
Slide 12
Slide 12 text
@pyr
Horizontal (scaling out)
Accommodate growth by spreading workload over
several systems
Slide 13
Slide 13 text
@pyr
Horizontal (scaling out)
Typical approach to scaling web servers
Slide 14
Slide 14 text
@pyr
Slide 15
Slide 15 text
@pyr
Slide 16
Slide 16 text
@pyr
Diagonal
Most common strategy: vertical first, and then
horizontal
Slide 17
Slide 17 text
@pyr
Recent History
Leading up to IAAS
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
@pyr
Whenever possible, a great approach
Slide 20
Slide 20 text
@pyr
So, why stop?
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
@pyr
Moore’s law
“Over the history of computing, the number of
transistors on integrated circuits doubles
approximately every two years.”
Slide 23
Slide 23 text
@pyr
Average core speed has been stable for several years
Consistent increase in cores per node
Slide 24
Slide 24 text
@pyr
“You mean I have to use threads?”
Slide 25
Slide 25 text
@pyr
Vertical Scaling Challenges
(424 pages)
Slide 26
Slide 26 text
@pyr
Vertical Scaling Challenges
Threads?
Slide 27
Slide 27 text
@pyr
No more automatic vertical approach
Slide 28
Slide 28 text
@pyr
Meanwhile...
Slide 29
Slide 29 text
@pyr
“What if I put an API on it?”
Slide 30
Slide 30 text
@pyr
Enter: the Cloud
Slide 31
Slide 31 text
@pyr
● IT as a utility
● Programmable resources
● Decoupling of storage from system resources
● Usage-based billing model
Slide 32
Slide 32 text
Upside
Slide 33
Slide 33 text
@pyr
● Much lower capacity planning overhead
● OPEX makes accounting department happy
● Nobody likes to change disks or rack servers
Slide 34
Slide 34 text
@pyr
● Switches? gone.
● VLANs? gone.
● IP allocation and translation? gone.
● OS partitioning? gone.
● OS RAID management? gone.
@pyr
“There is no cloud, there is just someone else’s computer”
Slide 39
Slide 39 text
@pyr
● It’s hard to break out of the big iron mental model
● It’s hard to change our trust model
○ “I want to be able to see my servers!”
● There is still an upper limit on node size
● Horizontal-first approach to building infrastructure
Slide 40
Slide 40 text
@pyr
Distributed Headaches
Slide 41
Slide 41 text
@pyr
Two nodes interacting imply a distributed system
Reduces SPOF, increases amount of failure scenarios
Slide 42
Slide 42 text
@pyr
Distributed systems are subject to Brewer/CAP
Cannot enjoy three of Consistency, Availability, Partition tolerance
Slide 43
Slide 43 text
@pyr
● Consistency: Simultaneous requests see a consistent set of data
● Availability: Each incoming request is acknowledged and receives a success or failure
response
● Partition Tolerance: The system will continue to process incoming requests in the face
of failures
Slide 44
Slide 44 text
@pyr
Architecture Drivers
Reducing complexity to focus on higher order problems
Slide 45
Slide 45 text
@pyr
Inspectable services
Queues over RPC
Degrade gracefully
Prefer concerned citizens
Configuration from a service registry
Nodes as immutable data structures
Slide 46
Slide 46 text
@pyr
Inspectable services
Slide 47
Slide 47 text
@pyr
Build introspection within services
Number of acknowledged, processed, failed requests
Time actions to quickly identify hotspots
Slide 48
Slide 48 text
@pyr
Avoid the monitor effect
Small unobtrusive probes
UDP is often sufficient
@pyr
Queues help dynamically shape systems
Queue backlog growing? Spin-up new workers!
Slide 56
Slide 56 text
@pyr
Degrade Gracefully
Slide 57
Slide 57 text
@pyr
Embrace failure
Systems will fail. In ways you didn’t expect.
Slide 58
Slide 58 text
@pyr
Avoid failure propagation
Implement backpressure to avoid killing loaded systems. Queues make great pressure valves.
Slide 59
Slide 59 text
@pyr
Don’t give up
Use connection pooling and retry policies.
Best in class: finagle, cassandra-driver
Slide 60
Slide 60 text
@pyr
Keep systems up
SQL down? No more account creations, still serving existing customers.
Slide 61
Slide 61 text
@pyr
Prefer Concerned Citizens
Slide 62
Slide 62 text
@pyr
All moving parts force new compromises
This is true of internal and external components
Slide 63
Slide 63 text
@pyr
Choose components accordingly
Slide 64
Slide 64 text
@pyr
You probably want an AP queueing system
So please avoid using MySQL as one!
Candidates: Apache Kafka, RabbitMQ, Redis (to a lesser extent)
Slide 65
Slide 65 text
@pyr
Cache locally
Much higher aggregated cache capacity
No Huge SPOF
Slide 66
Slide 66 text
@pyr
Choose your storage compromises
Object Storage, Distributed KV (eventual consistency), SQL (no P or A).
Slide 67
Slide 67 text
@pyr
Configuration through service registries
Slide 68
Slide 68 text
@pyr
Keep track of node volatility
Reprovisioning of configuration on cluster topology changes
Load-balancers make a great interaction point (concentrate changes there)
Slide 69
Slide 69 text
@pyr
The service registry is critical
Ideally needs to be a strongly consistent, distributed system.
You already have an eventually consistent one: DNS!
Slide 70
Slide 70 text
@pyr
Zookeeper and Etcd
Current best in class. Promotes usage in-app as well as distributed locks, barriers, etc.
Slide 71
Slide 71 text
@pyr
Immutable Infrastructure
Slide 72
Slide 72 text
@pyr
No more fixing nodes
Human intervention means configuration drift
Slide 73
Slide 73 text
@pyr
● Configuration Drift? Reprovision node.
● New version of software? Reprovision node.
● Configuration file change? Reprovision node.
Slide 74
Slide 74 text
@pyr
Depart from using the machine as the base unit of
reasoning
All nodes in clusters should be equivalent
Slide 75
Slide 75 text
@pyr
Looking Forward
The cluster is the computer
Slide 76
Slide 76 text
@pyr
A new layer of abstraction
Virtual resources pooled and orchestrated
Slide 77
Slide 77 text
@pyr
Generic platform abstractions
PAAS solutions are a commodity (cf: OpenShift)
Generic scheduling and failover frameworks (Mesos, Kubernetes Operators)