Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Tail At Scale

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for pigol pigol
August 29, 2020

The Tail At Scale

Talk given at the Papers We Love - Bangalore Chapter on August 29th. Covers "The Tail at Scale" paper by Jeff Dean and Luiz Andre Barroso.

https://research.google/pubs/pub40801/

Avatar for pigol

pigol

August 29, 2020
Tweet

More Decks by pigol

Other Decks in Programming

Transcript

  1. The Tail at Scale Jeffrey Dean & Luiz André Barroso

    https://research.google/pubs/pub40801/ Piyush Goel @pigol1 [email protected]
  2. What is Tail Latency? • Last 0.X% of the request

    latency distribution graph. • In general we can take slowest 1% response times or the 99%ile response times as the tail latency of that request. ◦ May vary depending on the scale of the system ◦ InMobi - 8B Ad Requests/day ◦ Capillary - 500M Requests/day • Responsive/Interactive Systems ◦ Better User Experience -- Higher Engagement ◦ Monetary Value. ◦ 100ms -- Large Information Retrieval Systems ◦ Reads out number the writes!
  3. Why should we care about Tail Latency? • Tail latency

    becomes important as the scale of our system increases. ◦ 10k requests/day → 100M requests/day ◦ Slowest 1% queries - 100 requests → 1M requests • Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, large online services need to create a predictably responsive whole out of less-predictable parts; we refer to such systems as “latency tail-tolerant,” or simply “tail-tolerant.” • If we remain ignorant about tail latency, it will eventually come back to bite us. • Many Tail Tolerant Methods can leverage infra provided for Fault Tolerance - Better Utilisation
  4. Why does such Variability Exists? • Software Variability ◦ Shared

    Resources ▪ Co-hosted applications • Containers can provide isolation. • Within app-contention still exists. ◦ Daemons ▪ Background or batch jobs running in the background. ◦ Global Resource Sharing ▪ Network Switches, Shared File Systems, Shared Disks. ◦ Maintenance activities (Log compaction, data shuffling, etc). ◦ Queueing (Network buffers, OS Kernels, intermediate hops) • Hardware Variability ◦ Power limits, Disk Capacity, etc. • Garbage Collection • Energy Management
  5. Variability is amplified by Scale! • At large scale, components

    or parallel operations increase. (Fan-outs) ◦ Micro-Services ◦ Data Partitions • Further increase in the overall latency of the request. ◦ Overall Latency ≥ Latency of Slowest Component • Server with 1 ms avg. but 1 sec 99%ile latency ◦ 1 Server: 1% of requests take ≥ 1 sec ◦ 100 Servers: 63% of requests take ≥ 1 sec
  6. At Google Scale - 10ms 99% percentile for any single

    request, the 99% percentile for all requests to finish is 140ms, and the 95% percentile is 70ms. ◦ meaning that waiting for the slowest 5% of the requests to complete is responsible for half of the total 99%-percentile latency. Techniques that concentrate on these slow outliers can yield dramatic reductions in overall service performance.
  7. Reducing Component Variability • Some practices which can be used

    to reduce variability • Prioritize Interactive Requests or Real-time requests over background requests ◦ Differentiating service classes and prefer higher-level queuing (Google File System) ◦ AWS uses something similar, albeit for Load Shedding [Ref#2] • Reducing head-of-line blocking ◦ Break long-running requests into a sequence of smaller requests to allow interleaving of the execution of other short-running requests. (Search Queries) • Managing background activities and synchronized disruption ◦Throttling, Service Operation Breakdowns. ◦For large fan-out services, it is sometimes useful for the system to synchronize the background activity across many different machines. This synchronization enforces a brief burst of activity on each machine simultaneously, slowing only those interactive requests being handled during the brief period of background activity. In contrast, without synchronization, a few machines are always doing some background activity, pushing out the latency tail on all requests. • Caching doesn’t impact variability much unless whole data is residing in the cache.
  8. Living with Variability! • ~Impossible to eliminate all latency variability

    -- especially with the Cloud Era! • Develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of trying to eliminate them altogether. • Classes of Tail Tolerant Techniques ◦ Within Request // Immediate Response Adaptations ▪ Focus on reducing variations within a single request path. ▪ Time Scale - 10ms ◦ Cross Request Long Term Adaptations ▪ Focus on holistic measures to reduce the tail at the system level. ▪ Time Scale - 10 seconds and above.
  9. Within Request Immediate Response Adaptations • Cope with a slow

    subsystem in context of a higher level request • Time Scale == Right Now ◦ User is waiting • Multiple Replicas for additional throughput capacity. ◦ Availability in the presence of failures. ◦ This approach is particularly effective when most requests operate on largely read-only, loosely consistent datasets. ◦ Spelling Correction Service - Write Once Read in Millions • Based on how replication (request & data) can be used to reduce variability within a single higher-level request: ◦ Hedged Requests ◦ Tied Requests
  10. • Issue Same Request to multiple replicas & use the

    first quickest response. ◦ Send a request to a most appropriate Replica (Appropriate Definition is Open) ◦ In case, no response within a threshold, issue to another replica. ◦ Cancel the outstanding requests. • Can amplify the traffic significantly if not implemented prudently. ◦ One such approach is to defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. ◦ Mitigates the effect of external interferences. Not the same request slowness. Hedged Requests
  11. • Google Benchmark ◦ Read 1,000 keys stored in a

    BigTable table ◦ Distributed across 100 different servers. ◦ Sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. • Vulnerabilities ◦ Multiple Servers might execute the same requests - redundant computation. ◦ 95% tile techniques reduces the impact. ◦ Further reduction requires aggressive cancellation of requests. Hedged Requests
  12. Queueing Delays • Queueing delay add variability before a request

    begins execution. • Once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially. • Mitzenmacher* - Allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uniform random scheme. * Mitzenmacher, M. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Computing 12, 10 (Oct. 2001), 1094–1104.
  13. Tied Requests • Instead of choosing one, enqueue in multiple

    servers. • Send the identity of the other servers as a part of the request -- Tieing! • Send a cancellation request to the other servers once a server picks it off the queue. • Corner Case: ◦ What if both servers pick up the request while the cancellation messages are in transit? Network Delay? ◦ Typical under low traffic scenario when server queues are empty. ◦ Client can introduce a small delay of 2X the average network message delay (1ms or less in modern data-center networks) between the first request and the second request.
  14. • Tied Request Performance on BigTable - Uncached Read Requests

    from Disk. • Case 1 - Tests in Isolation. No external interference. • Case 2 - Concurrent Sorting Job running along with the benchmark test. • In both scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the cancellation strategy is effective at eliminating redundant reads. • Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic computing cost reductions.
  15. • Probe Remote queues first, then submit the request to

    the least-loaded server. • Sub-Optimal ◦ Load levels can change between probe and request time ◦ Request service times can be difficult to estimate due to underlying system and hardware variability ◦ Clients can create temporary hot spots by picking the same (least-loaded) server • Distributed Shortest-Positioning Time First system* - Request is sent to one server and forwarded to replicas only if the initial server does not have it in its cache and uses cross-server cancellations. • These techniques are effective only when the phenomenon that cause variability do not tend to simultaneously affect multiple request replicas. Remote Queue Probing? *Lumb, C.R. and Golding, R. D-SPTF: Decentralized request distribution in brick-based storage systems. SIGOPS Operating System Review 38, 5 (Oct. 2004), 37–47.
  16. Cross Request Long Term Adaptations • Reducing latency variability caused

    by coarse-grain phenomenon ◦ Load imbalance. (Unbalanced data partitions) ▪ Centered on data distribution/placement techniques to optimize the reads. ◦ Service-time variations ▪ Detecting and mitigating effects of service performance variations. • Simple Partitioning ◦ Partitions have equal cost. ◦ Static assignment of a partition to a single machine? ◦ Sub-Optimal ▪ Performance of the underlying machines is neither uniform nor constant over time (Thermal Throttling and Shared workload Interference) ▪ Outliers Items can cause data-induced load imbalance • Particular item becomes popular and the load for its partition increases.
  17. Micro-Partitioning • # Partitions > # Machines • Partition large

    datasets into multiple pieces (10-100 per machine) [BigTable - Ref#3] • Dynamic assignment and Load balancing of these partitions to particular machines.
  18. Micro-Partitioning* • Fine grain load balancing - Move the partitions

    from one machine to another. • Average ~20 partitions per machine, the system can shed load in 5% increments and in 1/20th the time it would take if the system had a one-to-one mapping of partitions to machines • Using such partitions also lead to an improved failure recovery rate. ◦ More nodes to takeover the work of a failed node. * Stoica I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of SIGCOMM (San Diego, Aug. 27–31). ACM Press, New York, 2001, 149–160. • Does this sound familiar? (Hint : Think Amazon!) • DynamoDB Paper also talks about a similar concept of multiple virtual nodes mapped to physical nodes. [Ref#4]
  19. Selective Replication • An enhancement of Micro-partitioning. • Detection or

    Prediction of items that are likely to cause load imbalance. • Create additional replicas of these items/micro-partitions. • Distribute the load among more replicas rather than moving the partitions across nodes.
  20. Latency Induced Probation Techniques • Servers can sometimes become slow

    dues to: ◦ May be Data issues ◦ Most likely Interference issues (discussed earlier) • Intermediate Servers - Observe the Latency distribution of the fleet. • Put a slow server on Probation in case of slowness. ◦ Pass shadow requests to collect statistics. ◦ Put the node into rotation if it stabilizes. • Reducing Server Capacity during load can improve overall latency!
  21. Large Information Retrieval Systems • Latency is a key quality

    metric. • Retrieving good results quickly is better than returning best results slowly. • Good Enough Results? ◦ Sufficient amount of corpus has been searched - return the results without waiting for the rest of the queries. ◦ Tradeoff between Completeness vs. Responsiveness • Canary Requests ◦ High Fan out systems. ◦ Requests may hit untested code paths - crash or degrade multiple servers simultaneously. ◦ Forward to limited leaf servers and send to the fleet only if successful.
  22. Mutations/Writes in Data Intensive Services • For Data Latency variability

    for mutation is relatively easier. ◦ The scale of latency-critical modifications is generally small. ◦ Updates can often be performed off the critical path, after responding to the user. ◦ Many services can be structured to tolerate inconsistent update models for (inherently more latency-tolerant) mutations. • Services that require consistent updates, commonly used techniques are quorum-based algorithms (Paxos*) ◦ These algorithms touch only three to five replicas, they are inherently tail-tolerant. *Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.
  23. Hardware Trends • Some Bad: ◦ More aggressive power optimizations

    to save energy can lead to an increase in variability due to added latency in switching from/to power saving mode. • Variability technique friendly trends: ◦ Lower latency data center networks can make things like Tied Request Cancellations work better.
  24. Conclusions • Living with Variability ◦ Fault-tolerant techniques needed as

    guaranteeing fault-free operation isn’t feasible beyond a point. ◦ Not possible to eliminate all source of variabilities. • Tail Tolerant Techniques ◦ Will become more important with the hardware trends -- software level handling. ◦ Require some additional resources but with modest overheads. ◦ Leverage the capacity deployed for redundancy and fault tolerance. ◦ Can drive higher resource utilization overall with better responsiveness. ◦ Common Patterns - easy to bake into baseline libraries.
  25. References 1. https://cseweb.ucsd.edu/~gmporter/classes/sp18/cse124/post/schedule/p74-dean.pdf 2. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/ 3. Chang F., Dean J.,

    Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.E. BigTable: A distributed storage system for structured data. In Proceedings of the Seventh Symposium on Operating Systems Design and Implementation (Seattle, Nov.). USENIX Association, Berkeley CA, 2006, 205–218. 4. https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
  26. Co-ordinates • Twitter - @pigol1 • Mail - [email protected]

    LinkedIn - https://www.linkedin.com/in/piyushgoel1/