Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Tail at Scale - HasGeek Meetup

pigol
September 17, 2020

The Tail at Scale - HasGeek Meetup

The Tail at Scale talk at the HasGeek Meetup.

pigol

September 17, 2020
Tweet

More Decks by pigol

Other Decks in Technology

Transcript

  1. The Tail at Scale
    Jeffrey Dean & Luiz André Barroso
    https://research.google/pubs/pub40801/
    Piyush Goel
    @pigol1
    [email protected]

    View Slide

  2. What are Percentiles?
    ● A percentile (or a centile) is a measure used in statistics indicating the value
    below which a given percentage of observations in a group of observations falls.
    For example, the 20th percentile is the value (or score) below which 20% of the
    observations may be found. Equivalently, 80% of the observations are found
    above the 20th percentile. -- Wikipedia
    ● Percentile Latency - The latency number below which a certain percentage of
    requests fall. A 95%tile of 100ms means 95% of the request are being served
    under 100ms.

    View Slide

  3. What is Tail Latency?
    ● Last 0.X% of the request latency distribution graph.
    ● In general we can take slowest 1% response times or the 99%ile response times
    as the tail latency of that request.
    ○ May vary depending on the scale of the system
    ■ InMobi - 8B Ad Requests/day (99.5 %tile)
    ■ Capillary - 500M Requests/day (99% tile)
    ● Responsive/Interactive Systems
    ○ Better User Experience, Fluidity -- Higher Engagement
    ○ 100ms -- Large Information Retrieval Systems
    ○ Reads out number the writes!

    View Slide

  4. Why should we care about Tail Latency?
    ● Tail latency becomes important as the scale of our system increases.
    ○ 10K requests/day → 100M requests/day
    ○ Slowest 1% queries - 100 requests → 1M requests
    ● Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts,
    large online services need to create a predictably responsive whole out of
    less-predictable parts; we refer to such systems as “latency tail-tolerant,” or simply
    “tail-tolerant.”
    ● Many Tail Tolerant Methods can leverage infra provided for Fault Tolerance - Better Utilisation

    View Slide

  5. Why does such Variability Exists?
    ● Software Variability
    ○ Shared Resources
    ■ Co-hosted applications
    ● Containers can provide isolation.
    ● Within app-contention still exists.
    ○ Daemons
    ■ Background or batch jobs running in the background.
    ○ Global Resource Sharing
    ■ Network Switches, Shared File Systems, Shared Disks.
    ○ Maintenance activities (Log compaction, data shuffling, etc).
    ○ Queueing (Network buffers, OS Kernels, intermediate hops)
    ● Other Aspects - Hardware Variability, Garbage Collection, Energy Management

    View Slide

  6. Variability is amplified by Scale!
    ● At large scale, components or parallel operations increase. (Fan-outs)
    ○ Micro-Services
    ○ Data Partitions
    ● Further increase in the overall latency of the request.
    ○ Overall Latency ≥ Latency of Slowest Component
    ● Server with 1 ms avg. but 1 sec 99%ile latency
    ○ 1 Server: 1% of requests take ≥ 1 sec
    ○ 100 Servers: 63% of requests take ≥ 1 sec (Fan-out)

    View Slide

  7. The effect of large fan-out on Latency Distributions.

    View Slide

  8. At Google Scale - 10ms 99% percentile for any single request, the 99% percentile for
    all requests to finish is 140ms, and the 95% percentile is 70ms.
    ○ meaning that waiting for the slowest 5% of the requests to complete is
    responsible for half of the total 99%-percentile latency. Techniques that
    concentrate on these slow outliers can yield dramatic reductions in overall
    service performance.

    View Slide

  9. Can we control Variability?

    View Slide

  10. Reducing Component Variability - Some Practices
    ● Prioritize Interactive Requests or Real-time requests over background requests
    ○ Differentiating service classes and prefer higher-level queuing (Google File System)
    ○ AWS uses something similar, albeit for Load Shedding [Ref#2]
    ● Reducing head-of-line blocking
    ○ Break long-running requests into a sequence of smaller requests to allow interleaving of the
    execution of other short-running requests. (Search Queries)
    ○ Example - Pagination requests when scanning large lists.
    ● Managing background activities and synchronized disruption
    ○Throttling, Service Operation Breakdowns.
    ○Example - Anti-virus, security scans, log compression can be run during off-business hours.
    ● Caching doesn’t impact variability much unless whole data is residing in the cache.

    View Slide

  11. Variability is Inevitable!

    View Slide

  12. Living with Variability!
    ● ~Impossible to eliminate all latency variability -- especially with the Cloud Era!
    ● Develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of
    trying to eliminate them altogether.
    ● Classes of Tail Tolerant Techniques
    ○ Within Request // Immediate Response Adaptations
    ■ Focus on reducing variations within a single request path.
    ■ Time Scale - 10ms
    ○ Cross Request Long Term Adaptations
    ■ Focus on holistic measures to reduce the tail at the system level.
    ■ Time Scale - 10 seconds and above.

    View Slide

  13. Within Request Immediate Response Adaptations

    View Slide

  14. Within Request Immediate Response Adaptations
    ● Cope with a slow subsystem in context of a higher level request
    ● Time Scale == Right Now ( User is waiting )
    ● Multiple Replicas for additional throughput capacity.
    ○ Availability in the presence of failures.
    ○ This approach is particularly effective when most requests operate on largely read-only,
    loosely consistent datasets.
    ○ Spelling Correction Service, Contacts Lookup - Write Once Read in Millions
    ● Replication (request & data) can be used to reduce variability in a single higher-level request
    ○ Hedged Requests
    ○ Tied Requests

    View Slide

  15. ● Issue Same Request to multiple replicas & use the first quickest response.
    ○ Send a request to a most appropriate Replica (Appropriate Definition is Open)
    ○ In case, no response within a threshold, issue to another replica.
    ○ Cancel the outstanding requests after receiving the first response.
    ● Can amplify the traffic significantly if not implemented prudently.
    ○ One such approach is to defer sending a secondary request until the first request has been
    outstanding for more than the 95th-percentile expected latency for this class of requests.
    This approach limits the additional load to approximately 5% while substantially shortening
    the latency tail.
    ○ Mitigates the effect of external interferences. Not the same request slowness.
    Hedged Requests

    View Slide

  16. ● Google Benchmark
    ○ Read 1,000 keys stored in a BigTable table
    ○ Distributed across 100 different servers.
    ○ Sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for
    retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests.
    ● Vulnerabilities
    ○ Multiple Servers might execute the same requests - redundant computation.
    ○ 95% tile techniques reduces the impact.
    ○ Further reduction requires aggressive cancellation of requests.
    Hedged Requests

    View Slide

  17. ● Twitter Finagle
    ○ https://twitter.github.io/finagle/docs/com/twitter/finagle/context/BackupRequest$.html
    ○ Server as a Function - https://dl.acm.org/doi/abs/10.1145/2626401.2626413
    ● Envoy Proxy
    ○ https://www.envoyproxy.io/docs/envoy/v1.12.0/intro/arch_overview/http/http_routing.html#arc
    h-overview-http-routing-hedging
    ● Linkerd
    ○ Open Feature Request - https://github.com/linkerd/linkerd/issues/1069
    ● Spring Reactor
    ○ https://spring.io/blog/2019/01/23/spring-tips-hedging-client-requests-with-the-reactive-webcli
    ent-and-a-service-registry
    Hedged Requests - Usable Examples

    View Slide

  18. Queueing Delays
    ● Queueing delay add variability before a request begins execution.
    ● Once a request is actually scheduled and begins execution, the variability of its completion
    time goes down substantially.
    ● Mitzenmacher* - Allowing a client to choose between two servers based on queue lengths at
    enqueue time exponentially improves load-balancing performance over a uniform random
    scheme.
    * Mitzenmacher, M. The power of two choices in randomized load balancing. IEEE Transactions on
    Parallel and Distributed Computing 12, 10 (Oct. 2001), 1094–1104.

    View Slide

  19. Tied Requests
    ● Instead of choosing one, enqueue in multiple servers.
    ● Send the identity of the other servers as a part of the request -- Tieing!
    ● Send a cancellation request to the other servers once a server picks it off the queue.
    ● Corner Case:
    ○ What if both servers pick up the request while the cancellation messages are in
    transit? Network Delay?
    ○ Typical under low traffic scenario when server queues are empty.
    ○ Client can introduce a small delay of 2X the average network message delay (1ms or
    less in modern data-center networks) between the first request and the second
    request.

    View Slide

  20. ● Tied Request Performance on BigTable - Uncached Read Requests from Disk.
    ● Case 1 - Tests in Isolation. No external interference.
    ● Case 2 - Concurrent Sorting Job running along with the benchmark test.
    ● In both scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the
    cancellation strategy is effective at eliminating redundant reads.
    ● Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic
    computing cost reductions.

    View Slide

  21. Cross Request Long Term Adaptations

    View Slide

  22. Cross Request Long Term Adaptations
    ● Reducing latency variability caused by coarse-grain phenomenon
    ○ Load imbalance. (Unbalanced data partitions)
    ■ Centered on data distribution/placement techniques to optimize the reads.
    ○ Service-time variations
    ■ Detecting and mitigating effects of service performance variations.
    ● Simple Partitioning
    ○ Partitions have equal cost. Static assignment of a partition to a single machine?
    ○ Sub-Optimal
    ■ Performance of the underlying machines is neither uniform nor constant over time
    (Thermal Throttling and Shared workload Interference)
    ■ Outliers/Hot Items can cause data-induced load imbalance
    ● Particular item becomes popular and the load for its partition increases.

    View Slide

  23. Micro-Partitioning
    ● # Partitions > # Machines
    ● Partition large datasets into multiple pieces (10-100 per machine) [BigTable - Ref#3]
    ● Dynamic assignment and Load balancing of these partitions to particular machines.

    View Slide

  24. Micro-Partitioning*
    ● Fine grain load balancing - Move the partitions from one machine to another.
    ● Average ~20 partitions per machine, the system can shed load in 5% increments and in 1/20th
    the time it would take if the system had a one-to-one mapping of partitions to machines
    ● Using such partitions also lead to an improved failure recovery rate.
    ○ More nodes to takeover the work of a failed node.
    * Stoica I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup
    service for Internet applications. In Proceedings of SIGCOMM (San Diego, Aug. 27–31). ACM Press, New
    York, 2001, 149–160.
    ● Does this sound familiar? (Hint : Think Amazon!)
    ● DynamoDB Paper also talks about a similar concept of multiple virtual nodes mapped
    to physical nodes. [Ref#4]

    View Slide

  25. Micro-Partitioning - Practical Examples
    ● Cassandra - Partition key
    ○ https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key
    ● Hbase - Row Keys
    ○ https://medium.com/@ajaygupta.hbti/design-principles-for-hbase-key-and-rowkey-3016a77fc52d
    ● Riak - Partition Keys
    ○ https://docs.riak.com/riak/kv/latest/learn/concepts/clusters/index.html

    View Slide

  26. Selective Replication
    ● An enhancement of Micro-partitioning.
    ● Detection or Prediction of items that are likely to cause load imbalance.
    ● Create additional replicas of these items/micro-partitions.
    ● Distribute the load among more replicas rather than moving the partitions across nodes.
    Practical Examples
    ● HDFS
    ○ Replication Factor
    ○ Rebalancer (Human Triggered)
    ● Cassandra
    ○ Auto-rebalancing on topology changes.
    ○ https://docs.datastax.com/en/opscenter/6.5/opsc/online_help/opscRebalanceCluster.html

    View Slide

  27. Latency Induced Probation Techniques
    ● Servers can sometimes become slow dues to:
    ○ May be Data issues
    ○ Most likely Interference issues (discussed earlier)
    ● Intermediate Servers - Observe the Latency distribution of the fleet.
    ● Put a slow server on Probation in case of slowness.
    ○ Pass shadow requests to collect statistics.
    ○ Put the node into rotation if it stabilizes.
    ● Reducing Server Capacity during load can improve overall latency!

    View Slide

  28. Large Information Retrieval Systems
    ● Latency is a key quality metric.
    ● Retrieving good results quickly is better than returning best results slowly.
    ● Good Enough Results?
    ○ Sufficient amount of corpus has been searched - return the results without waiting for the rest of
    the queries.
    ○ Tradeoff between Completeness vs. Responsiveness
    ● Canary Requests
    ○ High Fan out systems.
    ○ Requests may hit untested code paths - crash or degrade multiple servers simultaneously.
    ○ Forward to limited leaf servers and send to the fleet only if successful.

    View Slide

  29. Canary Requests
    ● Istio
    ○ https://istio.io/latest/blog/2017/0.1-canary/
    ● AWS API Gateway
    ○ https://docs.aws.amazon.com/apigateway/latest/developerguide/create-canary-deployment.html
    ● Netflix Zuul
    ○ https://cloud.spring.io/spring-cloud-netflix/multi/multi__router_and_filter_zuul.html

    View Slide

  30. Mutations/Writes in Data Intensive Services
    ● For Data Latency variability for mutation is relatively easier.
    ○ The scale of latency-critical modifications is generally small.
    ○ Updates can often be performed off the critical path, after responding to the user.
    ○ Many services can be structured to tolerate inconsistent update models for (inherently more
    latency-tolerant) mutations.
    ● Services that require consistent updates, commonly used techniques are quorum-based
    algorithms (Paxos*)
    ○ These algorithms touch only three to five replicas, they are inherently tail-tolerant.
    *Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.

    View Slide

  31. Hardware Trends
    ● Some Bad:
    ○ More aggressive power optimizations to save energy can lead to an increase in
    variability due to added latency in switching from/to power saving mode.
    ● Variability technique friendly trends:
    ○ Lower latency data center networks can make things like Tied Request Cancellations
    work better.

    View Slide

  32. Conclusions
    ● Living with Variability
    ○ Fault-tolerant techniques needed as guaranteeing fault-free operation isn’t feasible beyond a point.
    ○ Not possible to eliminate all source of variabilities.
    ● Tail Tolerant Techniques
    ○ Will become more important with the hardware trends -- software level handling.
    ○ Require some additional resources but with modest overheads.
    ○ Leverage the capacity deployed for redundancy and fault tolerance.
    ○ Can drive higher resource utilization overall with better responsiveness.
    ○ Common Patterns - easy to bake into baseline libraries.

    View Slide

  33. References & Further Reading
    1. https://cseweb.ucsd.edu/~gmporter/classes/sp18/cse124/post/schedule/p74-dean.pdf
    2. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
    3. Chang F., Dean J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A.,
    and Gruber, R.E. BigTable: A distributed storage system for structured data. In Proceedings of the
    Seventh Symposium on Operating Systems Design and Implementation (Seattle, Nov.). USENIX
    Association, Berkeley CA, 2006, 205–218.
    4. https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
    5. Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints
    Image Source - https://unsplash.com/photos/npxXWgQ33ZQ by @glenncarstenspeters

    View Slide

  34. Questions?

    View Slide

  35. Co-ordinates
    ● Twitter - @pigol1
    ● Mail - [email protected]
    ● LinkedIn - https://www.linkedin.com/in/piyushgoel1/

    View Slide