Slide 1

Slide 1 text

@pyr From Vertical to Horizontal The challenges of scalability in the cloud

Slide 2

Slide 2 text

@pyr Four-line bio ● CTO & co-founder at Exoscale ● Open Source Developer ● Monitoring & Distributed Systems Enthusiast ● Linux since 1997

Slide 3

Slide 3 text

@pyr Scalability “The ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth” - Wikipedia

Slide 4

Slide 4 text

@pyr Scalability ● Culture ● Organization and Process ● Technical Architecture ● Operations

Slide 5

Slide 5 text

@pyr Scalability ● Culture ● Organization and Process ● Technical Architecture ● Operations

Slide 6

Slide 6 text

@pyr Scaling Geometry Recent History Enter the cloud Distributed Headaches Architecture Drivers Looking forward

Slide 7

Slide 7 text

Quick Notes ● “Cloud” an umbrella term ● Here conflated with public IAAS ● Oriented toward web application design

Slide 8

Slide 8 text

@pyr Scaling Geometry Vertical, Horizontal, and Diagonal

Slide 9

Slide 9 text

@pyr Vertical (scaling up) Adding resources to a single system

Slide 10

Slide 10 text

@pyr Vertical (scaling up) This is how you typically approach scaling MySQL

Slide 11

Slide 11 text

@pyr

Slide 12

Slide 12 text

@pyr Horizontal (scaling out) Accommodate growth by spreading workload over several systems

Slide 13

Slide 13 text

@pyr Horizontal (scaling out) Typical approach to scaling web servers

Slide 14

Slide 14 text

@pyr

Slide 15

Slide 15 text

@pyr

Slide 16

Slide 16 text

@pyr Diagonal Most common strategy: vertical first, and then horizontal

Slide 17

Slide 17 text

@pyr Recent History Leading up to IAAS

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

@pyr Whenever possible, a great approach

Slide 20

Slide 20 text

@pyr So, why stop?

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

@pyr Moore’s law “Over the history of computing, the number of transistors on integrated circuits doubles approximately every two years.”

Slide 23

Slide 23 text

@pyr Average core speed has been stable for several years Consistent increase in cores per node

Slide 24

Slide 24 text

@pyr “You mean I have to use threads?”

Slide 25

Slide 25 text

@pyr Vertical Scaling Challenges (424 pages)

Slide 26

Slide 26 text

@pyr Vertical Scaling Challenges Threads?

Slide 27

Slide 27 text

@pyr No more automatic vertical approach

Slide 28

Slide 28 text

@pyr Meanwhile...

Slide 29

Slide 29 text

@pyr “What if I put an API on it?”

Slide 30

Slide 30 text

@pyr Enter: the Cloud

Slide 31

Slide 31 text

@pyr ● IT as a utility ● Programmable resources ● Decoupling of storage from system resources ● Usage-based billing model

Slide 32

Slide 32 text

Upside

Slide 33

Slide 33 text

@pyr ● Much lower capacity planning overhead ● OPEX makes accounting department happy ● Nobody likes to change disks or rack servers

Slide 34

Slide 34 text

@pyr ● Switches? gone. ● VLANs? gone. ● IP allocation and translation? gone. ● OS partitioning? gone. ● OS RAID management? gone.

Slide 35

Slide 35 text

@pyr

Slide 36

Slide 36 text

@pyr provider "exoscale" { api_key = "${var.exoscale_api_key}" secret_key = "${var.exoscale_secret_key}" } resource "exoscale_instance" "web" { template = "ubuntu 17.04" disk_size = "50g" template = "ubuntu 17.04" profile = "medium" ssh_key = "production" }

Slide 37

Slide 37 text

Downside

Slide 38

Slide 38 text

@pyr “There is no cloud, there is just someone else’s computer”

Slide 39

Slide 39 text

@pyr ● It’s hard to break out of the big iron mental model ● It’s hard to change our trust model ○ “I want to be able to see my servers!” ● There is still an upper limit on node size ● Horizontal-first approach to building infrastructure

Slide 40

Slide 40 text

@pyr Distributed Headaches

Slide 41

Slide 41 text

@pyr Two nodes interacting imply a distributed system Reduces SPOF, increases amount of failure scenarios

Slide 42

Slide 42 text

@pyr Distributed systems are subject to Brewer/CAP Cannot enjoy three of Consistency, Availability, Partition tolerance

Slide 43

Slide 43 text

@pyr ● Consistency: Simultaneous requests see a consistent set of data ● Availability: Each incoming request is acknowledged and receives a success or failure response ● Partition Tolerance: The system will continue to process incoming requests in the face of failures

Slide 44

Slide 44 text

@pyr Architecture Drivers Reducing complexity to focus on higher order problems

Slide 45

Slide 45 text

@pyr Inspectable services Queues over RPC Degrade gracefully Prefer concerned citizens Configuration from a service registry Nodes as immutable data structures

Slide 46

Slide 46 text

@pyr Inspectable services

Slide 47

Slide 47 text

@pyr Build introspection within services Number of acknowledged, processed, failed requests Time actions to quickly identify hotspots

Slide 48

Slide 48 text

@pyr Avoid the monitor effect Small unobtrusive probes UDP is often sufficient

Slide 49

Slide 49 text

@pyr Leverage proven, existing tools Collectd, Syslog-NG, Statsd, Riemann

Slide 50

Slide 50 text

@pyr @wrap_riemann('activate-account') def activate_account(self, id): self.accounts.by_id(id).try_activate()

Slide 51

Slide 51 text

@pyr Queues over RPC

Slide 52

Slide 52 text

@pyr RPC couples systems Your service’s CAP properties are tied to the RPC provider

Slide 53

Slide 53 text

@pyr Take responsibility out of the callee as soon as possible Textbook example: SMTP

Slide 54

Slide 54 text

@pyr Queues promote statelessness { request_id: "97d4f7b3", host_id: "64e4-41b5", action: "mailout", recipients: [ "foo@example.com" ], content: "..." }

Slide 55

Slide 55 text

@pyr Queues help dynamically shape systems Queue backlog growing? Spin-up new workers!

Slide 56

Slide 56 text

@pyr Degrade Gracefully

Slide 57

Slide 57 text

@pyr Embrace failure Systems will fail. In ways you didn’t expect.

Slide 58

Slide 58 text

@pyr Avoid failure propagation Implement backpressure to avoid killing loaded systems. Queues make great pressure valves.

Slide 59

Slide 59 text

@pyr Don’t give up Use connection pooling and retry policies. Best in class: finagle, cassandra-driver

Slide 60

Slide 60 text

@pyr Keep systems up SQL down? No more account creations, still serving existing customers.

Slide 61

Slide 61 text

@pyr Prefer Concerned Citizens

Slide 62

Slide 62 text

@pyr All moving parts force new compromises This is true of internal and external components

Slide 63

Slide 63 text

@pyr Choose components accordingly

Slide 64

Slide 64 text

@pyr You probably want an AP queueing system So please avoid using MySQL as one! Candidates: Apache Kafka, RabbitMQ, Redis (to a lesser extent)

Slide 65

Slide 65 text

@pyr Cache locally Much higher aggregated cache capacity No Huge SPOF

Slide 66

Slide 66 text

@pyr Choose your storage compromises Object Storage, Distributed KV (eventual consistency), SQL (no P or A).

Slide 67

Slide 67 text

@pyr Configuration through service registries

Slide 68

Slide 68 text

@pyr Keep track of node volatility Reprovisioning of configuration on cluster topology changes Load-balancers make a great interaction point (concentrate changes there)

Slide 69

Slide 69 text

@pyr The service registry is critical Ideally needs to be a strongly consistent, distributed system. You already have an eventually consistent one: DNS!

Slide 70

Slide 70 text

@pyr Zookeeper and Etcd Current best in class. Promotes usage in-app as well as distributed locks, barriers, etc.

Slide 71

Slide 71 text

@pyr Immutable Infrastructure

Slide 72

Slide 72 text

@pyr No more fixing nodes Human intervention means configuration drift

Slide 73

Slide 73 text

@pyr ● Configuration Drift? Reprovision node. ● New version of software? Reprovision node. ● Configuration file change? Reprovision node.

Slide 74

Slide 74 text

@pyr Depart from using the machine as the base unit of reasoning All nodes in clusters should be equivalent

Slide 75

Slide 75 text

@pyr Looking Forward The cluster is the computer

Slide 76

Slide 76 text

@pyr A new layer of abstraction Virtual resources pooled and orchestrated

Slide 77

Slide 77 text

@pyr Generic platform abstractions PAAS solutions are a commodity (cf: OpenShift) Generic scheduling and failover frameworks (Mesos, Kubernetes Operators)

Slide 78

Slide 78 text

@pyr Thanks! Questions?