Upgrade to Pro — share decks privately, control downloads, hide ads and more …

QConPlus'21 - Beating the Speed of Light with Intelligent Request Routing

QConPlus'21 - Beating the Speed of Light with Intelligent Request Routing

Network request latency is crucial for many Internet applications. For Netflix it matters even outside video streaming - lower latencies to our AWS cloud endpoints mean smoother browsing experience for hundreds of millions of members. The catch - Netflix service is used on hundreds of millions of devices all around the world, connecting to our data centers over the open Internet - an ever-changing global network with many possible paths, distributed ownership and lack of centralized control.

This talk is both about API acceleration technology and data-driven approach to building distributed systems at a global scale that are safe to deploy and easy to maintain. While this talk demonstrates Netflix’s journey, the main principles and techniques can easily be applied and practiced by every owner of Internet-based services.

From this talk you’ll learn:

- how to build the Internet latency map for your customers;
- how to leverage the knowledge of network protocols and edge infrastructure to do the impossible - beat the speed of light;
- how to use a data-driven approach to evolve your client-server interactions;
- how to do that with a small team, on a tight schedule and minimal risk to your users.

Sergey Fedorov

May 25, 2021
Tweet

More Decks by Sergey Fedorov

Other Decks in Technology

Transcript

  1. INTRO Why does Network Speed Matter? +100ms = 1% A

    100 ms latency penalty implies a 1% sales loss. Amazon, 2009 Liddle, J.: Amazon Found Every 100ms of Latency Cost Them 1% in Sales. http://goo.gl/BUJgV sales loss latency
  2. Google, 2018 https://webmasters.googleblog.com/2018/01/using-page-speed-in-mobile-search.html INTRO Why does Network Speed Matter? +3s

    = -53% Starting in July 2018, page speed will be a ranking factor for mobile searches. User traffic Page load
  3. INTRO Why does Network Speed Matter? The reduction in latency

    can improve performance, and even improve profitability and increase sales for some customers. Hibernia Networks, 2015 https://www.submarinenetworks.com/en/systems/trans-atlantic/project-express/hiber nia-express-connects-new-york-to-london-in-under-58-95ms $400M investment to reduce New York to London latency by 6ms That’s $66M per ms!
  4. This rack can serve around 1% of Internet traffic FreeBSD

    + Nginx Lots of SSDs/HDDs Optimized for max network throughput Open Connect Appliance (OCA)
  5. INSIDE NETFLIX Hundreds of microservices Personalization Control plane Telemetry Big

    Data Encoding HTTP APIs UI Personalization is Powered by the AWS Cloud CDN
  6. Personalized network requests contribute up to 40% of time to

    render Netflix home page INSIDE NETFLIX
  7. • With minimal impact to device or server teams •

    For non-cacheable API requests • Low operational overhead PROBLEM STATEMENT How can we leverage our CDN edge infrastructure to make device-cloud requests faster?
  8. Director of Engineering, Content Delivery • 8+ years at Netflix

    ◦ CDN Monitoring ◦ FAST.com ◦ API acceleration ◦ QoE optimization • Past: Intel/Microsoft SPEAKER INFO About me Sergey Fedorov
  9. 1. How to reduce network latency using edge infrastructure 2.

    How we did it at Netflix 3. How YOU can learn from our experience AGENDA What will I talk about?
  10. Client Client CDN Client Client hops onto the OC backbone

    through the OCA. The closer the OCA, the greater the acceleration. Proxying Requests via CDN Edge Server
  11. THEORY: EDGE ACCELERATION Background: TCP + TLS Connection Establishment Time

    Depends on Distance to a Server 0 ms 100 ms 200 ms 300 ms 50 ms 150 ms 250 ms Client AWS 100 ms RTT SYN SYN ACK ACK ClientHello ClientKeyShare ServerHello ServerKeyShare Certificate Finished Finished HTTP Request HTTP Response Data TCP - 100ms TLS - 100ms Time To First Byte: 300ms + Processing
  12. THEORY: EDGE ACCELERATION Reducing Connection Establishment Overhead with a CDN

    Proxy HTTP Response Data 0 ms 30 ms 60 ms 15 ms 45 ms Time To First Byte: 160 ms Client CDN 30 ms RTT AWS 70 ms RTT (persistent connection) SYN TCP - 30ms TLS - 60ms ACK ClientHello ClientKeyShare Finished HTTP Request SYN ACK ServerHello ServerKeyShare Certificate Finished
  13. THEORY: EDGE ACCELERATION Reducing Data Transfer Times Multiplexing with HTTP/2

    CDN Proxy HTTP/1.1 Client HTTP/1.1 Client HTTP/2 Client HTTP/2 Client Congestion avoidance with private backbone Faster loss recovery
  14. Goal: understand network connectivity between Netflix devices and servers. aka

    build a latency map of the Internet for Netflix users … and use it to compare different routing options NETFLIX IMPLEMENTATION: MEASUREMENTS
  15. • Estimates for our current users. • Full device coverage.

    • Quickly. • Don’t break production. Network Measurement System Requirements
  16. NETFLIX GRAPH App Launch API Call Asset Loading Rendering Time

    Real User Monitoring 100% geo 100% devices Noisy signal Only production traffic NETFLIX IMPLEMENTATION: MEASUREMENTS
  17. Synthetic Monitoring NETFLIX IMPLEMENTATION: MEASUREMENTS Dedicated test servers CDN Tests

    Cloud Tests DNS CDN DNS Tests Full control Clean signal Partial geo Limited device coverage Cloud
  18. NETFLIX IMPLEMENTATION: MEASUREMENTS Probnik = RUM + Synthetic App Start

    Discovery Play App startup requests Playback Requests Probe
  19. Client Config API Call 1 Target 1 Target 2 Target

    N 2 Fetch & Measure × pulses NETFLIX IMPLEMENTATION: MEASUREMENTS Step 2: fetch and measure request time
  20. Client Config API Call 1 LOG Results 3 Target 1

    Target 2 Target N 2 Report Fetch & Measure × pulses NETFLIX IMPLEMENTATION: MEASUREMENTS Step 3: Upload Results
  21. • Proxy: implementation and deployment on CDN • Steering: how

    to choose the CDN server for a client? • Comparison: Proxied requests vs AWS-direct NETFLIX IMPLEMENTATION: PROTOTYPING What to Use for Measurements? Need a Quick Proof-of-Concept.
  22. NETFLIX IMPLEMENTATION: PROTOTYPING Client to Cloud Routing Options: Direct Link

    Cloud Steering: • 3 AWS regions • geoDNS for server selection Cloud Cloud
  23. NETFLIX IMPLEMENTATION: PROTOTYPING Client to Cloud Routing Options: CDN Proxy

    CDN Proxy Steering: • TCP Anycast • Single IP for all CDN locations CDN Cloud Proxy
  24. Using the Probing System to Compare type: HTTP GET name:

    steering test targets: cdn-proxy.test.me/probe cloud.test.me/probe NETFLIX IMPLEMENTATION: PROTOTYPING cdn-proxy: time / OK / FAIL cloud: time / OK / FAIL CDN Cloud Proxy CLOUD CDN-PROXY
  25. NETFLIX IMPLEMENTATION: PROTOTYPING Subregion Proxy is Faster Proxy is Slower

    Lighter color = more clients Results: no Clear Winner
  26. • Accurate estimate of performance • 100% CDN Edge proxying

    doesn’t work • No production impact NETFLIX IMPLEMENTATION: PROTOTYPING Recap
  27. Goal: find the fastest possible network path, amongst available options:

    - Cloud-direct - CDN-proxy NETFLIX IMPLEMENTATION: PROTOTYPING No Single Winner, Routing Based on Past Performance
  28. NETFLIX IMPLEMENTATION: PROTOTYPING Intelligent Client Steering Goal: choose the fastest

    path • Cloud-direct vs CDN-Proxy • No additional API calls • Easy client integration Client Steering CDN Cloud Proxy
  29. NETFLIX IMPLEMENTATION: PROTOTYPING Using DNS: Return IP of a Server

    on the Fastest Path DNS CDN Cloud Proxy api.netflix.com AWS Server IP TCP Anycast IP Cloud direct path CDN Proxy path
  30. NETFLIX IMPLEMENTATION: PROTOTYPING DNS Resolver Based Steering Challenge: auth DNS

    can only see recursive DNS resolver IP. Need to make a decision based on aggregate client performance. Auth DNS Recursive DNS Resolver CDN Cloud Proxy
  31. Step 1: Measure Network Performance for Each Client and Each

    Path NETFLIX IMPLEMENTATION: PROTOTYPING type: HTTP GET name: steering test targets: cloud.test.me/probe cdn-proxy.test.me/probe cloud: time / OK / FAIL cdn-proxy: time / OK / FAIL CDN Cloud Proxy CLOUD CDN-PROXY
  32. Step 2: Aggregate Device Measurements by DNS Resolver NETFLIX IMPLEMENTATION:

    PROTOTYPING Resolver 1 Resolver 2 CDN Proxy GROUP BY resolver: resolver 1: CDN Proxy resolver 2: AWS … Probe
  33. NETFLIX IMPLEMENTATION: PROTOTYPING Step 3: Load the Resolver Map to

    Auth DNS and Steer Based on Resolver IP Resolver IP → CDN Proxy AWS Path: Auth DNS Recursive DNS Resolver CDN Cloud Proxy Resolver Map
  34. Client Steering NETFLIX IMPLEMENTATION: PROTOTYPING Automatic Traffic Steering Probes →

    Data Pipeline → Measure different client routing options Choose the fastest path Use for production steering policy
  35. NETFLIX IMPLEMENTATION: PROTOTYPING Use Probes to Measure Effectiveness type: HTTP

    GET name: steering test targets: cloud.test.me/probe dns-steering.test.me/probe cloud: time / OK / FAIL dns-steering: time / OK / FAIL CDN Cloud Proxy CLOUD DNS STEERING
  36. NETFLIX IMPLEMENTATION: PROTOTYPING Subregion Steering is Faster Steering is Slower

    Lighter color = more clients Results: Equal or Better Performance
  37. • Validated that the DNS-based solution works • Tuned the

    resolver aggregation models • Tested full integration of end-to-end solution • No production impact NETFLIX IMPLEMENTATION: PROTOTYPING Recap
  38. CDN HTTP APIs PRODUCTIZE Changing Critical Path at Netflix Scale

    200+ Million Users 1K+ Different Devices 10K+ CDN Servers 1K+ Locations 1M+ Requests per Second 100+ Microservices 100+ Deployments per Day
  39. Minimize the scope of failure Embrace failure as part of

    your design Graceful degradation NETFLIX IMPLEMENTATION: PRODUCTIZATION Changing Critical Path with Minimal Production Impact
  40. Small features. Frequent deployments. Clean metrics NETFLIX IMPLEMENTATION: PRODUCTIZATION Deployment

    Principles Workflow: 1. Probe test 2. AB tests or Canaries 3. Progressive rollout 4. Done
  41. NETFLIX IMPLEMENTATION: OPERATIONS At Netflix Scale Manual is not an

    Option 200+ Million Users 1K+ Different Devices 10K+ CDN Servers 1K+ Locations 1M+ Requests per Second 100+ Microservices 100+ Deployments per Day
  42. NETFLIX IMPLEMENTATION: OPERATIONS Automating Response to Network Failures type: HTTP

    GET name: reachability test targets: cloud.test.me/probe cdn-proxy.test.me/probe cloud: time / OK / FAIL cdn-proxy: time / OK / FAIL CDN Cloud Proxy CLOUD CDN-PROXY
  43. cloud: FAIL cdn-proxy: OK What’s broken? - ISP’s connection to

    AWS Can we fix it? - YES - route traffic via the CDN proxy NETFLIX IMPLEMENTATION: OPERATIONS Automating Response to Network Failures CDN Cloud Proxy CLOUD CDN-PROXY
  44. cloud: OK cdn-proxy: FAIL What’s broken? - Backbone link to

    AWS Can we fix it? - YES - Move traffic to cloud-direct path NETFLIX IMPLEMENTATION: OPERATIONS Automating Response to Network Failures CDN Cloud Proxy CLOUD CDN-PROXY
  45. cloud: FAIL cdn-proxy: FAIL What’s broken? - ISP outage or

    client last mile Can we fix it? - NO (we don’t have a routable path) NETFLIX IMPLEMENTATION: OPERATIONS Automating Response to Network Failures CDN Cloud Proxy CLOUD CDN-PROXY
  46. NETFLIX IMPLEMENTATION: OPERATIONS Automatic Traffic Steering Client Steering Probes →

    Data Pipeline → Measure different client routing options Choose the fastest path Use for production steering policy
  47. 1M+ production requests per second NETFLIX IMPLEMENTATION: SUMMARY Recap: Device-Cloud

    API Acceleration at Netflix 200K routes Client Steering Probes → Data Pipeline → 100K measurements per minute
  48. 10% median acceleration of requests on existing connection NETFLIX IMPLEMENTATION:

    SUMMARY Recap: Device-Cloud API Acceleration at Netflix 25% median acceleration of requests on new connections
  49. LEARNINGS Request Routing is Only One of Many Optimizations You

    can Test Our team of 3 ran and deployed dozens of experiments over 3 years: • Edge Termination + DNS Steering • HTTP2 • TLS1.3 • TCP Fast Open • Traffic Rebalancing • DNS migration • HPACK compression • CDN chaining • IPv6 migration • ...
  50. LEARNINGS Reduce Measure-Prototype Flow, only Productize what Works Operate Productize

    Prototype Measure Probing system First prototype DNS Steering First prod traffic Progressive deployment 3mo 1mo 3mo 2mo 2+yr
  51. Fallback and failure avoidance lead to much better operational experience

    Embrace Failure LEARNINGS Our team gets less than 1 critical alert per week, on average Time, hours