The Tail At Scale - Speaker Deck

Slide 1

Slide 1 text

The Tail at Scale Jeffrey Dean & Luiz André Barroso https://research.google/pubs/pub40801/ Piyush Goel @pigol1 [email protected]

Slide 2

Slide 2 text

What is Tail Latency? ● Last 0.X% of the request latency distribution graph. ● In general we can take slowest 1% response times or the 99%ile response times as the tail latency of that request. ○ May vary depending on the scale of the system ○ InMobi - 8B Ad Requests/day ○ Capillary - 500M Requests/day ● Responsive/Interactive Systems ○ Better User Experience -- Higher Engagement ○ Monetary Value. ○ 100ms -- Large Information Retrieval Systems ○ Reads out number the writes!

Slide 3

Slide 3 text

Why should we care about Tail Latency? ● Tail latency becomes important as the scale of our system increases. ○ 10k requests/day → 100M requests/day ○ Slowest 1% queries - 100 requests → 1M requests ● Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, large online services need to create a predictably responsive whole out of less-predictable parts; we refer to such systems as “latency tail-tolerant,” or simply “tail-tolerant.” ● If we remain ignorant about tail latency, it will eventually come back to bite us. ● Many Tail Tolerant Methods can leverage infra provided for Fault Tolerance - Better Utilisation

Slide 4

Slide 4 text

Why does such Variability Exists? ● Software Variability ○ Shared Resources ■ Co-hosted applications ● Containers can provide isolation. ● Within app-contention still exists. ○ Daemons ■ Background or batch jobs running in the background. ○ Global Resource Sharing ■ Network Switches, Shared File Systems, Shared Disks. ○ Maintenance activities (Log compaction, data shuffling, etc). ○ Queueing (Network buffers, OS Kernels, intermediate hops) ● Hardware Variability ○ Power limits, Disk Capacity, etc. ● Garbage Collection ● Energy Management

Slide 5

Slide 5 text

Variability is amplified by Scale! ● At large scale, components or parallel operations increase. (Fan-outs) ○ Micro-Services ○ Data Partitions ● Further increase in the overall latency of the request. ○ Overall Latency ≥ Latency of Slowest Component ● Server with 1 ms avg. but 1 sec 99%ile latency ○ 1 Server: 1% of requests take ≥ 1 sec ○ 100 Servers: 63% of requests take ≥ 1 sec

Slide 6

Slide 6 text

The effect of large fan-out on Latency Distributions.

Slide 7

Slide 7 text

At Google Scale - 10ms 99% percentile for any single request, the 99% percentile for all requests to ﬁnish is 140ms, and the 95% percentile is 70ms. ○ meaning that waiting for the slowest 5% of the requests to complete is responsible for half of the total 99%-percentile latency. Techniques that concentrate on these slow outliers can yield dramatic reductions in overall service performance.

Slide 8

Slide 8 text

Can we control Variability?

Slide 9

Slide 9 text

Reducing Component Variability ● Some practices which can be used to reduce variability ● Prioritize Interactive Requests or Real-time requests over background requests ○ Differentiating service classes and prefer higher-level queuing (Google File System) ○ AWS uses something similar, albeit for Load Shedding [Ref#2] ● Reducing head-of-line blocking ○ Break long-running requests into a sequence of smaller requests to allow interleaving of the execution of other short-running requests. (Search Queries) ● Managing background activities and synchronized disruption ○Throttling, Service Operation Breakdowns. ○For large fan-out services, it is sometimes useful for the system to synchronize the background activity across many different machines. This synchronization enforces a brief burst of activity on each machine simultaneously, slowing only those interactive requests being handled during the brief period of background activity. In contrast, without synchronization, a few machines are always doing some background activity, pushing out the latency tail on all requests. ● Caching doesn’t impact variability much unless whole data is residing in the cache.

Slide 10

Slide 10 text

Variability is Inevitable!

Slide 11

Slide 11 text

Living with Variability! ● ~Impossible to eliminate all latency variability -- especially with the Cloud Era! ● Develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of trying to eliminate them altogether. ● Classes of Tail Tolerant Techniques ○ Within Request // Immediate Response Adaptations ■ Focus on reducing variations within a single request path. ■ Time Scale - 10ms ○ Cross Request Long Term Adaptations ■ Focus on holistic measures to reduce the tail at the system level. ■ Time Scale - 10 seconds and above.

Slide 12

Slide 12 text

Within Request Immediate Response Adaptations

Slide 13

Slide 13 text

Within Request Immediate Response Adaptations ● Cope with a slow subsystem in context of a higher level request ● Time Scale == Right Now ○ User is waiting ● Multiple Replicas for additional throughput capacity. ○ Availability in the presence of failures. ○ This approach is particularly effective when most requests operate on largely read-only, loosely consistent datasets. ○ Spelling Correction Service - Write Once Read in Millions ● Based on how replication (request & data) can be used to reduce variability within a single higher-level request: ○ Hedged Requests ○ Tied Requests

Slide 14

Slide 14 text

● Issue Same Request to multiple replicas & use the first quickest response. ○ Send a request to a most appropriate Replica (Appropriate Definition is Open) ○ In case, no response within a threshold, issue to another replica. ○ Cancel the outstanding requests. ● Can amplify the traffic significantly if not implemented prudently. ○ One such approach is to defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. ○ Mitigates the effect of external interferences. Not the same request slowness. Hedged Requests

Slide 15

Slide 15 text

● Google Benchmark ○ Read 1,000 keys stored in a BigTable table ○ Distributed across 100 different servers. ○ Sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. ● Vulnerabilities ○ Multiple Servers might execute the same requests - redundant computation. ○ 95% tile techniques reduces the impact. ○ Further reduction requires aggressive cancellation of requests. Hedged Requests

Slide 16

Slide 16 text

Queueing Delays ● Queueing delay add variability before a request begins execution. ● Once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially. ● Mitzenmacher* - Allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uniform random scheme. * Mitzenmacher, M. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Computing 12, 10 (Oct. 2001), 1094–1104.

Slide 17

Slide 17 text

Tied Requests ● Instead of choosing one, enqueue in multiple servers. ● Send the identity of the other servers as a part of the request -- Tieing! ● Send a cancellation request to the other servers once a server picks it off the queue. ● Corner Case: ○ What if both servers pick up the request while the cancellation messages are in transit? Network Delay? ○ Typical under low traffic scenario when server queues are empty. ○ Client can introduce a small delay of 2X the average network message delay (1ms or less in modern data-center networks) between the first request and the second request.

Slide 18

Slide 18 text

● Tied Request Performance on BigTable - Uncached Read Requests from Disk. ● Case 1 - Tests in Isolation. No external interference. ● Case 2 - Concurrent Sorting Job running along with the benchmark test. ● In both scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the cancellation strategy is effective at eliminating redundant reads. ● Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic computing cost reductions.

Slide 19

Slide 19 text

● Probe Remote queues first, then submit the request to the least-loaded server. ● Sub-Optimal ○ Load levels can change between probe and request time ○ Request service times can be difficult to estimate due to underlying system and hardware variability ○ Clients can create temporary hot spots by picking the same (least-loaded) server ● Distributed Shortest-Positioning Time First system* - Request is sent to one server and forwarded to replicas only if the initial server does not have it in its cache and uses cross-server cancellations. ● These techniques are effective only when the phenomenon that cause variability do not tend to simultaneously affect multiple request replicas. Remote Queue Probing? *Lumb, C.R. and Golding, R. D-SPTF: Decentralized request distribution in brick-based storage systems. SIGOPS Operating System Review 38, 5 (Oct. 2004), 37–47.

Slide 20

Slide 20 text

Cross Request Long Term Adaptations

Slide 21

Slide 21 text

Cross Request Long Term Adaptations ● Reducing latency variability caused by coarse-grain phenomenon ○ Load imbalance. (Unbalanced data partitions) ■ Centered on data distribution/placement techniques to optimize the reads. ○ Service-time variations ■ Detecting and mitigating effects of service performance variations. ● Simple Partitioning ○ Partitions have equal cost. ○ Static assignment of a partition to a single machine? ○ Sub-Optimal ■ Performance of the underlying machines is neither uniform nor constant over time (Thermal Throttling and Shared workload Interference) ■ Outliers Items can cause data-induced load imbalance ● Particular item becomes popular and the load for its partition increases.

Slide 22

Slide 22 text

Micro-Partitioning ● # Partitions > # Machines ● Partition large datasets into multiple pieces (10-100 per machine) [BigTable - Ref#3] ● Dynamic assignment and Load balancing of these partitions to particular machines.

Slide 23

Slide 23 text

Micro-Partitioning* ● Fine grain load balancing - Move the partitions from one machine to another. ● Average ~20 partitions per machine, the system can shed load in 5% increments and in 1/20th the time it would take if the system had a one-to-one mapping of partitions to machines ● Using such partitions also lead to an improved failure recovery rate. ○ More nodes to takeover the work of a failed node. * Stoica I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of SIGCOMM (San Diego, Aug. 27–31). ACM Press, New York, 2001, 149–160. ● Does this sound familiar? (Hint : Think Amazon!) ● DynamoDB Paper also talks about a similar concept of multiple virtual nodes mapped to physical nodes. [Ref#4]

Slide 24

Slide 24 text

Selective Replication ● An enhancement of Micro-partitioning. ● Detection or Prediction of items that are likely to cause load imbalance. ● Create additional replicas of these items/micro-partitions. ● Distribute the load among more replicas rather than moving the partitions across nodes.

Slide 25

Slide 25 text

Latency Induced Probation Techniques ● Servers can sometimes become slow dues to: ○ May be Data issues ○ Most likely Interference issues (discussed earlier) ● Intermediate Servers - Observe the Latency distribution of the fleet. ● Put a slow server on Probation in case of slowness. ○ Pass shadow requests to collect statistics. ○ Put the node into rotation if it stabilizes. ● Reducing Server Capacity during load can improve overall latency!

Slide 26

Slide 26 text

Large Information Retrieval Systems ● Latency is a key quality metric. ● Retrieving good results quickly is better than returning best results slowly. ● Good Enough Results? ○ Sufficient amount of corpus has been searched - return the results without waiting for the rest of the queries. ○ Tradeoff between Completeness vs. Responsiveness ● Canary Requests ○ High Fan out systems. ○ Requests may hit untested code paths - crash or degrade multiple servers simultaneously. ○ Forward to limited leaf servers and send to the fleet only if successful.

Slide 27

Slide 27 text

Mutations/Writes in Data Intensive Services ● For Data Latency variability for mutation is relatively easier. ○ The scale of latency-critical modifications is generally small. ○ Updates can often be performed off the critical path, after responding to the user. ○ Many services can be structured to tolerate inconsistent update models for (inherently more latency-tolerant) mutations. ● Services that require consistent updates, commonly used techniques are quorum-based algorithms (Paxos*) ○ These algorithms touch only three to five replicas, they are inherently tail-tolerant. *Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.

Slide 28

Slide 28 text

Hardware Trends ● Some Bad: ○ More aggressive power optimizations to save energy can lead to an increase in variability due to added latency in switching from/to power saving mode. ● Variability technique friendly trends: ○ Lower latency data center networks can make things like Tied Request Cancellations work better.

Slide 29

Slide 29 text

Conclusions ● Living with Variability ○ Fault-tolerant techniques needed as guaranteeing fault-free operation isn’t feasible beyond a point. ○ Not possible to eliminate all source of variabilities. ● Tail Tolerant Techniques ○ Will become more important with the hardware trends -- software level handling. ○ Require some additional resources but with modest overheads. ○ Leverage the capacity deployed for redundancy and fault tolerance. ○ Can drive higher resource utilization overall with better responsiveness. ○ Common Patterns - easy to bake into baseline libraries.

Slide 30

Slide 30 text

References 1. https://cseweb.ucsd.edu/~gmporter/classes/sp18/cse124/post/schedule/p74-dean.pdf 2. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/ 3. Chang F., Dean J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.E. BigTable: A distributed storage system for structured data. In Proceedings of the Seventh Symposium on Operating Systems Design and Implementation (Seattle, Nov.). USENIX Association, Berkeley CA, 2006, 205–218. 4. https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Slide 31

Slide 31 text

Questions?

Slide 32

Slide 32 text

Co-ordinates ● Twitter - @pigol1 ● Mail - [email protected] ● LinkedIn - https://www.linkedin.com/in/piyushgoel1/