Slide 1

Slide 1 text

@kavya719 Practical Performance Theory

Slide 2

Slide 2 text

kavya

Slide 3

Slide 3 text

applying performance theory to practice

Slide 4

Slide 4 text

performance capacity • What’s the additional load the system can support, 
 without degrading response time? • What’re the system utilization bottlenecks? • What’s the impact of a change on response time,
 maximum throughput? • How many additional servers to support 10x load? • Is the system over-provisioned?

Slide 5

Slide 5 text

#YOLO method
 load simulation
 Stressing the system to empirically determine actual 
 performance characteristics, bottlenecks.
 Can be incredibly powerful. performance modeling

Slide 6

Slide 6 text

performance modeling real-world system theoretical model results analyze translate back model as* * makes assumptions about the system: request arrival rate, service order, times. cannot apply the results if your system does not satisfy them!

Slide 7

Slide 7 text

a cluster of many servers the USL scaling bottlenecks a single server open, closed queueing systems
 utilization law, the P-K formula, Little’s law CoDel, adaptive LIFO stepping back the role of performance modeling

Slide 8

Slide 8 text

a single server

Slide 9

Slide 9 text

model I clients web server “how can we improve the mean response time?” “what’s the maximum throughput of this server, given a response time target?” response time (ms) throughput (requests / second) response time threshold

Slide 10

Slide 10 text

model the web server as a queueing system. web server request response queueing delay + service time = response time } }

Slide 11

Slide 11 text

model the web server as a queueing system. assumptions 1. requests are independent and random, arrive at some “arrival rate”. 2. requests are processed one at a time, in FIFO order;
 requests queue if server is busy (“queueing delay”). 3. “service time” of a request is constant. web server request response queueing delay + service time = response time } }

Slide 12

Slide 12 text

model the web server as a queueing system. assumptions 1. requests are independent and random, arrive at some “arrival rate”. 2. requests are processed one at a time, in FIFO order;
 requests queue if server is busy (“queueing delay”). 3. “service time” of a request is constant. web server request response queueing delay + service time = response time } }

Slide 13

Slide 13 text

model the web server as a queueing system. assumptions 1. requests are independent and random, arrive at some “arrival rate”. 2. requests are processed one at a time, in FIFO order;
 requests queue if server is busy (“queueing delay”). 3. “service time” of a request i.e. request size is constant. web server request response queueing delay + service time = response time } }

Slide 14

Slide 14 text

“What’s the maximum throughput of this server?” i.e. given a response time target

Slide 15

Slide 15 text

“What’s the maximum throughput of this server?” i.e. given a response time target arrival rate increases server utilization increases

Slide 16

Slide 16 text

“What’s the maximum throughput of this server?” i.e. given a response time target arrival rate increases server utilization increases linearly utilization = arrival rate * service time “busyness” utilization arrival rate Utilization law

Slide 17

Slide 17 text

“What’s the maximum throughput of this server?” i.e. given a response time target P(request has to queue) increases, so
 mean queue length increases, so mean queueing delay increases. arrival rate increases server utilization increases linearly Utilization law

Slide 18

Slide 18 text

“What’s the maximum throughput of this server?” i.e. given a response time target P(request has to queue) increases, so
 mean queue length increases, so mean queueing delay increases. arrival rate increases server utilization increases linearly Utilization law P-K formula

Slide 19

Slide 19 text

Pollaczek-Khinchine (P-K) formula mean queueing delay = U * (mean service time) * (service time variability)2 (1 - U) assuming constant service time and so, request sizes: mean queueing delay ∝ U (1 - U) utilization (U) queueing delay

Slide 20

Slide 20 text

utilization (U) response time since response time ∝ queueing delay utilization (U) queueing delay (queueing delay + service time) Pollaczek-Khinchine (P-K) formula assuming constant service time and so, request sizes: mean queueing delay ∝ U (1 - U) mean queueing delay = U * (mean service time) * (service time variability)2 (1 - U)

Slide 21

Slide 21 text

“What’s the maximum throughput of this server?” i.e. given a response time target arrival rate increases server utilization increases linearly Utilization law P-K formula mean queueing delay increases non-linearly; so, response time too. response time (ms) throughput (requests / second) low utilization regime

Slide 22

Slide 22 text

“What’s the maximum throughput of this server?” i.e. given a response time target arrival rate increases server utilization increases linearly Utilization law P-K formula mean queueing delay increases non-linearly; so, response time too. response time (ms) max throughput low utilization regime high utilization regime throughput (requests / second)

Slide 23

Slide 23 text

“How can we improve the mean response time?”

Slide 24

Slide 24 text

“How can we improve the mean response time?” 1. response time ∝ queueing delay prevent requests from queuing too long • Controlled Delay (CoDel)
 in Facebook’s Thrift framework
 • adaptive or always LIFO
 in Facebook’s PHP runtime, 
 Dropbox’s Bandaid reverse proxy. • set a max queue length
 when queue is full, drop incoming requests. • client-side timeouts and back-off

Slide 25

Slide 25 text

“How can we improve the mean response time?” onNewRequest(req, queue): if (queue.lastEmptyTime() < (now - N ms)) { // Queue was last empty more than N ms ago; // set timeout to M << N ms.
 timeout = M ms
 } else { // Else, set timeout to N ms.
 timeout = N ms
 } 
 queue.enqueue(req, timeout) 1. response time ∝ queueing delay prevent requests from queuing too long key insight: queues are typically empty; allows short bursts, prevents standing queues • Controlled Delay (CoDel)
 in Facebook’s Thrift framework
 • adaptive or always LIFO
 in Facebook’s PHP runtime, 
 Dropbox’s Bandaid reverse proxy. • set a max queue length
 when queue is full, drop incoming requests. • client-side timeouts and back-off

Slide 26

Slide 26 text

“How can we improve the mean response time?” 1. response time ∝ queueing delay prevent requests from queuing too long newest requests first, not old requests 
 that are likely to expire. helps when system is overloaded, 
 makes no difference when it’s not. • Controlled Delay (CoDel)
 in Facebook’s Thrift framework
 • adaptive or always LIFO
 in Facebook’s PHP runtime, 
 Dropbox’s Bandaid reverse proxy. • set a max queue length
 when queue is full, drop incoming requests. • client-side timeouts and back-off key insight: queues are typically empty; allows short bursts, prevents standing queues

Slide 27

Slide 27 text

“How can we improve the mean response time?” 2. response time ∝ queueing delay U * (mean service time) * (service time variability)2 (1 - U) P-K formula decrease service time by optimizing application code } decrease request / service size variability for example, by batching requests }

Slide 28

Slide 28 text

the cloud industry site N sensors server while true: // upload synchronously. ack = upload(data) // update state, // sleep for Z seconds. deleteUploaded(ack) sleep(Z seconds) processes data from N sensors model II

Slide 29

Slide 29 text

• requests are synchronized. • fixed number of clients. throughput inversely depends on response time!
 queue length is bounded (<= N), so response time bounded! } This is called a closed system. super different that the previous web server model (open system). server N clients ] ] response request

Slide 30

Slide 30 text

response time vs. load for closed systems

Slide 31

Slide 31 text

assuming sleep time (“think time”) is constant and service time is constant, when number of clients (N) increases: response time vs. load for closed systems

Slide 32

Slide 32 text

response time vs. load for closed systems number of clients response time high utilization regime high utilization regime:
 response time grows linearly with N. low utilization regime: response time stays ~same. by Little’s law (see addendum for details) } response time for a closed system assuming sleep time (“think time”) is constant and service time is constant, when number of clients (N) increases:

Slide 33

Slide 33 text

response time vs. load for closed systems number of clients response time high utilization regime way different than for an open system arrival rate response time high utilization regime response time for a closed system

Slide 34

Slide 34 text

open v/s closed systems • how throughput relates to response time. • response time versus load, especially in the high load regime. closed systems are very different from open systems: uh oh…

Slide 35

Slide 35 text

standard load simulators typically mimic closed systems So, load simulation might predict: • lower response times than the actual system yields • better tolerance to request size variability • smaller effects of different scheduling policies • other differences you probably don’t want to find out in production… open v/s closed systems …but the system with real users may not be one! A couple neat papers on the topic, workarounds: Open Versus Closed: A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools for example: scale “think time” along with number of virtual clients s.t. the ratio remains constant.

Slide 36

Slide 36 text

a cluster of servers

Slide 37

Slide 37 text

clients cluster of web servers load balancer “How many servers do we need to support a target throughput?” while keeping response time the same capacity planning! “How can we improve how the system scales?” scalability

Slide 38

Slide 38 text

max throughput of a cluster of N servers = max single server throughput * N ? “How many servers do we need to support a target throughput?” while keeping response time the same no, systems don’t scale linearly. • contention penalty
 due to serialization for shared resources.
 examples: database contention, lock contention.
 • crosstalk penalty
 due to coordination for coherence. examples: servers coordinating to synchronize
 mutable state. αN

Slide 39

Slide 39 text

max throughput of a cluster of N servers = max single server throughput * N ? “How many servers do we need to support a target throughput?” while keeping response time the same no, systems don’t scale linearly. • contention penalty
 due to serialization for shared resources.
 examples: database contention, lock contention.
 • crosstalk penalty
 due to coordination for coherence. examples: servers coordinating to synchronize
 mutable state. αN βN2

Slide 40

Slide 40 text

Universal Scalability Law (USL) throughput of N servers = N (αN + βN2 + C) N (αN + βN2 + C) N C N (αN + C) contention and crosstalk linear scaling contention throughput cluster size

Slide 41

Slide 41 text

“How can we improve how the system scales?” Avoid contention (serialization) and crosstalk (synchronization). • smarter data partitioning, smaller partitions see Facebook’s TAO cache paper. • smarter aggregation see Facebook’s SCUBA data store paper. • better load balancing strategies: best of two random choices • fine-grained locking

Slide 42

Slide 42 text

“How can we improve how the system scales?” Avoid contention (serialization) and crosstalk (synchronization). • eventually consistent datastores • etc. • smarter data partitioning, smaller partitions see Facebook’s TAO cache paper. • smarter aggregation see Facebook’s SCUBA data store paper. • better load balancing strategies: best of two random choices • fine-grained locking

Slide 43

Slide 43 text

stepping back

Slide 44

Slide 44 text

…modeling requires assumptions that may be difficult to practically validate. the role of performance modeling “empiricism is queen.” performance modeling is not a replacement for empirical analysis i.e. load simulation, benchmarks, experiments.

Slide 45

Slide 45 text

the role of performance modeling determine what experiments to run
 run experiments to get data to fit the USL, 
 response time curves. interpret and evaluate the results
 why load simulations predicted better results 
 than your system shows. informed experimentation strategic performance work predict future system behavior how load may affect performance, scalability. make highest-impact improvements 
 improve mean service time, reduce service time variability, remove crosstalk etc. But modeling gives us a rigorous framework for:

Slide 46

Slide 46 text

the role of performance modeling most useful in conjunction with empirical analysis.

Slide 47

Slide 47 text

empiricism is queen. empiricism grounded in theory is queen.

Slide 48

Slide 48 text

empiricism is queen. empiricism grounded in theory is queen. @kavya719 speakerdeck.com/kavya719/practical-performance-theory Special thanks to Eben Freeman for reading drafts of this.

Slide 49

Slide 49 text

References 
 Performance Modeling and Design of Computer Systems, Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law, Baron Schwartz Open Versus Closed: A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools A General Theory of Computational Scalability Based on Rational Functions Queuing Theory, In Practice Fail at Scale Kraken: Leveraging Live Traffic Tests SCUBA: Diving into Data at Facebook Special thanks to Eben Freeman for reading drafts of this. @kavya719 speakerdeck.com/kavya719/practical-performance-theory

Slide 50

Slide 50 text

addendum The open system model used is called an M/D/1 system using Kendall notation; we assumed a Poisson arrival process (“M” for memoryless) , deterministic service time distribution (“D”), and a single server (the 1) with an infinite buffer and using a First-Come-First-Serve service discipline. The P-K formula assumes a memoryless arrival process and cannot be applied otherwise. In the closed system, load can also be increased by decreasing think time.

Slide 51

Slide 51 text

On CoDel at Facebook: “An attractive property of this algorithm is that the values of M and N tend not to need tuning. Other methods of solving the problem of standing queues, such as setting a limit on the number of items in the queue or setting a timeout for the queue, have required tuning on a per-service basis. We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases. “ Using LIFO to select thread to run next, to reduce mutex, cache trashing and context switching overhead:

Slide 52

Slide 52 text

Experiment: Improvements based on the P-K formula 2. response time ∝ queueing delay U * (mean service time) * (service time variability)2 (1 - U) P-K formula decrease service time by optimizing application code } optimized decrease request / service size variability for example, by batching requests } batched

Slide 53

Slide 53 text

Derivation: response time vs. load for closed systems assumptions 1. sleep time (“think time”) is constant. 2. requests are processed one at a time, in FIFO order. 3. service time is constant. What happens to response time in this regime? Like earlier, as the number of clients (N) increases: throughput increases to a point i.e. until utilization is high.
 after that, increasing N only increases queuing. throughput number of clients low utilization regime high utilization regime

Slide 54

Slide 54 text

Little’s Law for closed systems server sleeping waiting being processed ] ] the total number of requests in the system includes requests across the states. a request can be in one of three states in the system: sleeping (on the device), waiting (in the server queue), being processed (in the server). the system in this case is the entire loop i.e. N clients

Slide 55

Slide 55 text

Little’s Law for closed systems # requests in system = throughput * round-trip time of a request across the whole system sleep time + response time sleep time queueing delay + service time = response time server ] ] So, response time only grows linearly with N! N = constant * response time applying it in the high utilization regime (constant throughput) and assuming constant sleep: N clients

Slide 56

Slide 56 text

response time vs. load for closed systems So, response time for a closed system: number of clients response time Like earlier, as the number of clients (N) increases: throughput increases to a point i.e. until utilization is high.
 after that, increasing N only increases queuing. high utilization regime:
 grows linearly with N. low utilization regime: response time stays ~same high utilization regime

Slide 57

Slide 57 text

response time vs. load for closed systems So, response time for a closed system: number of clients response time Like earlier, as the number of clients (N) increases: throughput increases to a point i.e. until utilization is high.
 after that, increasing N only increases queuing. arrival rate response time way different than for an open system: high utilization regime high utilization regime high utilization regime:
 grows linearly with N. low utilization regime: response time stays ~same

Slide 58

Slide 58 text

load simulation results with increasing number of virtual clients (N) = 1, …, 100 … load simulator hit a bottleneck. response time number of clients wrong shape for response time curve! should be one of the two curves above number of clients response time Example: using performance theory to evaluate results of a loadtest

Slide 59

Slide 59 text

No content