The Tail at Scale - HasGeek Meetup

9d4305b7fb9495f42f3dc8c9338995ae?s=47 pigol
September 17, 2020

The Tail at Scale - HasGeek Meetup

The Tail at Scale talk at the HasGeek Meetup.



September 17, 2020


  1. The Tail at Scale Jeffrey Dean & Luiz André Barroso Piyush Goel @pigol1
  2. What are Percentiles? • A percentile (or a centile) is

    a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found. Equivalently, 80% of the observations are found above the 20th percentile. -- Wikipedia • Percentile Latency - The latency number below which a certain percentage of requests fall. A 95%tile of 100ms means 95% of the request are being served under 100ms.
  3. What is Tail Latency? • Last 0.X% of the request

    latency distribution graph. • In general we can take slowest 1% response times or the 99%ile response times as the tail latency of that request. ◦ May vary depending on the scale of the system ▪ InMobi - 8B Ad Requests/day (99.5 %tile) ▪ Capillary - 500M Requests/day (99% tile) • Responsive/Interactive Systems ◦ Better User Experience, Fluidity -- Higher Engagement ◦ 100ms -- Large Information Retrieval Systems ◦ Reads out number the writes!
  4. Why should we care about Tail Latency? • Tail latency

    becomes important as the scale of our system increases. ◦ 10K requests/day → 100M requests/day ◦ Slowest 1% queries - 100 requests → 1M requests • Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, large online services need to create a predictably responsive whole out of less-predictable parts; we refer to such systems as “latency tail-tolerant,” or simply “tail-tolerant.” • Many Tail Tolerant Methods can leverage infra provided for Fault Tolerance - Better Utilisation
  5. Why does such Variability Exists? • Software Variability ◦ Shared

    Resources ▪ Co-hosted applications • Containers can provide isolation. • Within app-contention still exists. ◦ Daemons ▪ Background or batch jobs running in the background. ◦ Global Resource Sharing ▪ Network Switches, Shared File Systems, Shared Disks. ◦ Maintenance activities (Log compaction, data shuffling, etc). ◦ Queueing (Network buffers, OS Kernels, intermediate hops) • Other Aspects - Hardware Variability, Garbage Collection, Energy Management
  6. Variability is amplified by Scale! • At large scale, components

    or parallel operations increase. (Fan-outs) ◦ Micro-Services ◦ Data Partitions • Further increase in the overall latency of the request. ◦ Overall Latency ≥ Latency of Slowest Component • Server with 1 ms avg. but 1 sec 99%ile latency ◦ 1 Server: 1% of requests take ≥ 1 sec ◦ 100 Servers: 63% of requests take ≥ 1 sec (Fan-out)
  7. The effect of large fan-out on Latency Distributions.

  8. At Google Scale - 10ms 99% percentile for any single

    request, the 99% percentile for all requests to finish is 140ms, and the 95% percentile is 70ms. ◦ meaning that waiting for the slowest 5% of the requests to complete is responsible for half of the total 99%-percentile latency. Techniques that concentrate on these slow outliers can yield dramatic reductions in overall service performance.
  9. Can we control Variability?

  10. Reducing Component Variability - Some Practices • Prioritize Interactive Requests

    or Real-time requests over background requests ◦ Differentiating service classes and prefer higher-level queuing (Google File System) ◦ AWS uses something similar, albeit for Load Shedding [Ref#2] • Reducing head-of-line blocking ◦ Break long-running requests into a sequence of smaller requests to allow interleaving of the execution of other short-running requests. (Search Queries) ◦ Example - Pagination requests when scanning large lists. • Managing background activities and synchronized disruption ◦Throttling, Service Operation Breakdowns. ◦Example - Anti-virus, security scans, log compression can be run during off-business hours. • Caching doesn’t impact variability much unless whole data is residing in the cache.
  11. Variability is Inevitable!

  12. Living with Variability! • ~Impossible to eliminate all latency variability

    -- especially with the Cloud Era! • Develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of trying to eliminate them altogether. • Classes of Tail Tolerant Techniques ◦ Within Request // Immediate Response Adaptations ▪ Focus on reducing variations within a single request path. ▪ Time Scale - 10ms ◦ Cross Request Long Term Adaptations ▪ Focus on holistic measures to reduce the tail at the system level. ▪ Time Scale - 10 seconds and above.
  13. Within Request Immediate Response Adaptations

  14. Within Request Immediate Response Adaptations • Cope with a slow

    subsystem in context of a higher level request • Time Scale == Right Now ( User is waiting ) • Multiple Replicas for additional throughput capacity. ◦ Availability in the presence of failures. ◦ This approach is particularly effective when most requests operate on largely read-only, loosely consistent datasets. ◦ Spelling Correction Service, Contacts Lookup - Write Once Read in Millions • Replication (request & data) can be used to reduce variability in a single higher-level request ◦ Hedged Requests ◦ Tied Requests
  15. • Issue Same Request to multiple replicas & use the

    first quickest response. ◦ Send a request to a most appropriate Replica (Appropriate Definition is Open) ◦ In case, no response within a threshold, issue to another replica. ◦ Cancel the outstanding requests after receiving the first response. • Can amplify the traffic significantly if not implemented prudently. ◦ One such approach is to defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. ◦ Mitigates the effect of external interferences. Not the same request slowness. Hedged Requests
  16. • Google Benchmark ◦ Read 1,000 keys stored in a

    BigTable table ◦ Distributed across 100 different servers. ◦ Sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. • Vulnerabilities ◦ Multiple Servers might execute the same requests - redundant computation. ◦ 95% tile techniques reduces the impact. ◦ Further reduction requires aggressive cancellation of requests. Hedged Requests
  17. • Twitter Finagle ◦$.html ◦ Server as a Function

    - • Envoy Proxy ◦ h-overview-http-routing-hedging • Linkerd ◦ Open Feature Request - • Spring Reactor ◦ ent-and-a-service-registry Hedged Requests - Usable Examples
  18. Queueing Delays • Queueing delay add variability before a request

    begins execution. • Once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially. • Mitzenmacher* - Allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uniform random scheme. * Mitzenmacher, M. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Computing 12, 10 (Oct. 2001), 1094–1104.
  19. Tied Requests • Instead of choosing one, enqueue in multiple

    servers. • Send the identity of the other servers as a part of the request -- Tieing! • Send a cancellation request to the other servers once a server picks it off the queue. • Corner Case: ◦ What if both servers pick up the request while the cancellation messages are in transit? Network Delay? ◦ Typical under low traffic scenario when server queues are empty. ◦ Client can introduce a small delay of 2X the average network message delay (1ms or less in modern data-center networks) between the first request and the second request.
  20. • Tied Request Performance on BigTable - Uncached Read Requests

    from Disk. • Case 1 - Tests in Isolation. No external interference. • Case 2 - Concurrent Sorting Job running along with the benchmark test. • In both scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the cancellation strategy is effective at eliminating redundant reads. • Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic computing cost reductions.
  21. Cross Request Long Term Adaptations

  22. Cross Request Long Term Adaptations • Reducing latency variability caused

    by coarse-grain phenomenon ◦ Load imbalance. (Unbalanced data partitions) ▪ Centered on data distribution/placement techniques to optimize the reads. ◦ Service-time variations ▪ Detecting and mitigating effects of service performance variations. • Simple Partitioning ◦ Partitions have equal cost. Static assignment of a partition to a single machine? ◦ Sub-Optimal ▪ Performance of the underlying machines is neither uniform nor constant over time (Thermal Throttling and Shared workload Interference) ▪ Outliers/Hot Items can cause data-induced load imbalance • Particular item becomes popular and the load for its partition increases.
  23. Micro-Partitioning • # Partitions > # Machines • Partition large

    datasets into multiple pieces (10-100 per machine) [BigTable - Ref#3] • Dynamic assignment and Load balancing of these partitions to particular machines.
  24. Micro-Partitioning* • Fine grain load balancing - Move the partitions

    from one machine to another. • Average ~20 partitions per machine, the system can shed load in 5% increments and in 1/20th the time it would take if the system had a one-to-one mapping of partitions to machines • Using such partitions also lead to an improved failure recovery rate. ◦ More nodes to takeover the work of a failed node. * Stoica I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of SIGCOMM (San Diego, Aug. 27–31). ACM Press, New York, 2001, 149–160. • Does this sound familiar? (Hint : Think Amazon!) • DynamoDB Paper also talks about a similar concept of multiple virtual nodes mapped to physical nodes. [Ref#4]
  25. Micro-Partitioning - Practical Examples • Cassandra - Partition key ◦ • Hbase - Row Keys ◦ • Riak - Partition Keys ◦
  26. Selective Replication • An enhancement of Micro-partitioning. • Detection or

    Prediction of items that are likely to cause load imbalance. • Create additional replicas of these items/micro-partitions. • Distribute the load among more replicas rather than moving the partitions across nodes. Practical Examples • HDFS ◦ Replication Factor ◦ Rebalancer (Human Triggered) • Cassandra ◦ Auto-rebalancing on topology changes. ◦
  27. Latency Induced Probation Techniques • Servers can sometimes become slow

    dues to: ◦ May be Data issues ◦ Most likely Interference issues (discussed earlier) • Intermediate Servers - Observe the Latency distribution of the fleet. • Put a slow server on Probation in case of slowness. ◦ Pass shadow requests to collect statistics. ◦ Put the node into rotation if it stabilizes. • Reducing Server Capacity during load can improve overall latency!
  28. Large Information Retrieval Systems • Latency is a key quality

    metric. • Retrieving good results quickly is better than returning best results slowly. • Good Enough Results? ◦ Sufficient amount of corpus has been searched - return the results without waiting for the rest of the queries. ◦ Tradeoff between Completeness vs. Responsiveness • Canary Requests ◦ High Fan out systems. ◦ Requests may hit untested code paths - crash or degrade multiple servers simultaneously. ◦ Forward to limited leaf servers and send to the fleet only if successful.
  29. Canary Requests • Istio ◦ • AWS API Gateway

    ◦ • Netflix Zuul ◦
  30. Mutations/Writes in Data Intensive Services • For Data Latency variability

    for mutation is relatively easier. ◦ The scale of latency-critical modifications is generally small. ◦ Updates can often be performed off the critical path, after responding to the user. ◦ Many services can be structured to tolerate inconsistent update models for (inherently more latency-tolerant) mutations. • Services that require consistent updates, commonly used techniques are quorum-based algorithms (Paxos*) ◦ These algorithms touch only three to five replicas, they are inherently tail-tolerant. *Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.
  31. Hardware Trends • Some Bad: ◦ More aggressive power optimizations

    to save energy can lead to an increase in variability due to added latency in switching from/to power saving mode. • Variability technique friendly trends: ◦ Lower latency data center networks can make things like Tied Request Cancellations work better.
  32. Conclusions • Living with Variability ◦ Fault-tolerant techniques needed as

    guaranteeing fault-free operation isn’t feasible beyond a point. ◦ Not possible to eliminate all source of variabilities. • Tail Tolerant Techniques ◦ Will become more important with the hardware trends -- software level handling. ◦ Require some additional resources but with modest overheads. ◦ Leverage the capacity deployed for redundancy and fault tolerance. ◦ Can drive higher resource utilization overall with better responsiveness. ◦ Common Patterns - easy to bake into baseline libraries.
  33. References & Further Reading 1. 2. 3. Chang

    F., Dean J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.E. BigTable: A distributed storage system for structured data. In Proceedings of the Seventh Symposium on Operating Systems Design and Implementation (Seattle, Nov.). USENIX Association, Berkeley CA, 2006, 205–218. 4. 5. Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints Image Source - by @glenncarstenspeters
  34. Questions?

  35. Co-ordinates • Twitter - @pigol1 • Mail - •

    LinkedIn -