Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make the Internet faster. Sleep well at night.

Make the Internet faster. Sleep well at night.

What happens with network requests between your clients and servers? How to make them faster, not break things and keep quiet on-calls? This talk will show how to architect, engineer and operate complex fault-tolerant systems, using Netflix network optimizations work as an example.

Sergey Fedorov

October 30, 2019
Tweet

More Decks by Sergey Fedorov

Other Decks in Technology

Transcript

  1. Make the Internet faster. Sleep well at night. Sergey Fedorov

    DevOops, Saint-Petersburg October 30, 2019 @sfedov
  2. INTRO Why does speed matter? +100ms = 1% A 100

    ms latency penalty implies a 1% sales loss. Amazon, 2009 Liddle, J.: Amazon Found Every 100ms of Latency Cost Them 1% in Sales. http://goo. gl/BUJgV sales loss latency
  3. Google, 2018 https://webmasters.googleblog.com/2018/01/using-page-speed-in-mobile-search.html INTRO Why does speed matter? +3s =

    -53% Starting in July 2018, page speed will be a ranking factor for mobile searches. User traffic Page load
  4. INTRO Why does speed matter? The reduction in latency can

    improve performance, and even improve profitability and increase sales for some customers. Hibernia Networks, 2015 https://www.submarinenetworks.com/en/systems/trans-atlantic/project-express/hiber nia-express-connects-new-york-to-london-in-under-58-95ms $400M investment to reduce New York to London latency by 6ms That’s $66M per ms!
  5. Netflix, 2017 https://medium.com/netflix-techblog/a-b-testing-and-beyond-improving-the-netflix-str eaming-experience-with-experimentation-and-data-5b0ae9295bdf INTRO Why does speed matter? Rebuffer

    Rate Play Delay Video Quality We study tradeoffs amongst QoE metrics: do members prefer faster playback start with lower video quality or do they prefer to wait a bit longer but start at a higher quality?
  6. After 5 Mbps bandwidth does not materially impact load time

    for a typical page. Latency affects the load time linearly. INTRO Network latency vs bandwidth. Mike Belshe https://hpbn.co/primer-on-web-performance/
  7. • How to reduce latency of network requests. • How

    Netflix does it. • How to sleep well: operational excellence. AGENDA What is this talk about?
  8. • Background: how network protocols affect requests latency. • Practical

    approach to reduce latency. Netflix style. • How to design, engineer and operate fault-tolerant systems. AGENDA What will you learn?
  9. Netflix movies and TV shows ABOUT NETFLIX Stranger Things House

    of Cards Black Mirror El Camino A Breaking Bad Movie Daredevil The Crown
  10. INSIDE NETFLIX Thousands of device types Personalized UI 4 client

    teams: iOS, Android, TV, Web Hundreds of AB tests Netflix Client
  11. OCA Open Connect Appliance CDN Edge Server: • FreeBSD •

    Nginx • Lots of SSDs/HDDs • Optimized for max network throughput INSIDE NETFLIX Netflix CDN Open Connect
  12. • NNSU’09 • Microsoft Imagine Cup’09 • Past: Intel/Microsoft •

    Netflix, 7 years ◦ CDN Monitoring ◦ FAST.com ◦ API acceleration ◦ QoE optimization INSIDE NETFLIX About me
  13. INSIDE NETFLIX How can CDN help? Client Client CDN Client

    Client hops onto the OC backbone through the OCA. The closer the OCA, the greater the acceleration.
  14. INTERNET ACCELERATION Background: TCP + TLS 0 ms 100 ms

    200 ms 300 ms 50 ms 150 ms 250 ms TTFB: 400 ms Client AWS 100 ms RTT SYN SYN ACK ACK ClientHello ServerHello Certificate ServerHelloDone ClientKeyExchange ChangeCipherSpec Finished ChangeCipherSpec Finished HTTP Request HTTP Response Data TCP - 100ms TLS - 200ms
  15. INTERNET ACCELERATION Background: TCP + TLS HTTP Response Data 0

    ms 30 ms 60 ms 90 ms 15 ms 45 ms 75 ms TTFB: 220 ms Client CDN 30 ms RTT AWS 100 ms RTT (persistent connection) SYN SYN ACK ACK ClientHello ServerHello Certificate ServerHelloDone ClientKeyExchange ChangeCipherSpec Finished ChangeCipherSpec Finished HTTP Request TCP - 30ms TLS - 60ms
  16. INTERNET ACCELERATION Background: TCP Loss Recovery Recovery CDN Client ISP

    Client ISP • Packet drops have more chances to happen on the last mile (wireless network, traffic shaping, crappy end-user router). • It takes at the very least 1 RTT to detect packet loss and at least 1.5 RTT to repair it. During that time the application is not getting new data. • Lower RTT speeds up the recovery and increases the throughput.
  17. INTERNET ACCELERATION Background: Internet Congestion Congestion CDN Client ISP Client

    ISP • Long-leg section of network between OCA and AWS, largest part of RTT • Client's ISP link is often congested during peak times, competing with video and other sites • OC Backbone has lots of capacity (mostly Fill and stats) and we ensure QoS
  18. INTERNET ACCELERATION Background: HTTP 2 (+) Multiplexing with HTTP/2 Proxy

    HTTP/1.1 Client HTTP/1.1 Client HTTP/2 Client HTTP/2 Client • Better resource utilisation • Less idling • Higher chance to keep congestion window high • Out of the box header compression • No head-of-line blocking at HTTP layer
  19. INSIDE NETFLIX How can CDN help? Client Client CDN Client

    Client hops onto the OC backbone through the OCA. The closer the OCA, the greater the acceleration.
  20. • Speed: will it be faster? • Reliability: does it

    fail more? • Complexity: how to integrate with the client? • Cost: $$$ of extra infrastructure? • ... MEASURE What do we need to know?
  21. • Speed: will it be faster? • Reliability: does it

    fail more? • Complexity: how to integrate with the client? • Cost: $$$ of extra infrastructure? • ... MEASURE What do we need to know?
  22. • Estimates from our current users. • Full device coverage.

    • Quickly. • Don’t break production. MEASURE Desired results.
  23. NETFLIX GRAPH App Launch API Call Asset Loading Rendering Time

    RUM: Passive Request Monitoring 100% geo 100% devices Noisy signal Only production traffic
  24. Synthetic MEASURING NETWORK COMPONENTS Dedicated test servers CDN Tests Cloud

    Tests DNS CDN DNS Tests Full control Clean signal Partial geo Limited device coverage Cloud
  25. Probe Get Recipe Fetch & Measure Report LOG MEASURING NETWORK

    COMPONENTS Client Probes = RUM + Synthetic
  26. type: HTTP GET name: reachability test targets: http://target1.test.me/probe http://target2.test.me/probe http://target3.test.me/probe

    payload: 5KB pulses: 3 delay: 5s Config MEASURING NETWORK COMPONENTS Probe Recipe Specifies What to Test
  27. type: HTTP GET name: reachability test targets: http://target1.test.me/probe http://target2.test.me/probe http://target3.test.me/probe

    payload: 5KB pulses: 3 delay: 5s Config type: HTTPS GET name: large payload test targets: https://target1.test.me/probe https://target2.test.me/probe https://target3.test.me/probe payload: 100KB pulses: 3 delay: 10s ... MEASURE Probe Recipe Specifies What to Test
  28. Config MEASURE Multiple recipes can be used Client API Call

    1 Reachability Test DNS Providers Test CDN Test Large Payload Test 20% 10% 1% 2% Choose one, with a given probability
  29. Client Config API Call 1 Target 1 Target 2 Target

    N 2 Fetch & Measure × pulses MEASURE Step 2: Request and Measure
  30. MEASURE Making Network Calls Pulse 1 Pulse 2 Start End

    Start End Pulse Pause Pulse 3 Start End Pulse Pause Target 1 Target 2 Target N
  31. Response Code: OK / FAIL / ... DNS TCP TLS

    First Byte Download Server MEASURE Collected Measurements
  32. Client Config API Call 1 LOG Logblob 3 Target 1

    Target 2 Target N 2 Report Fetch & Measure × pulses MEASURE Step 3: Upload Results
  33. $1M+ per year for a vendor solution 6K+ probes per

    second 14 recipes 1K+ devices 100M+ locations MEASURING NETWORK COMPONENTS Probe Stats at Netflix
  34. • Proxy: implementation and deployment on CDN • Steering: how

    to choose the CDN server for a client? • Comparison: Proxied requests vs AWS-direct PROTOTYPE What to Use for Measurements? Need a Quick Proof-of-Concept.
  35. • Go-based • Deployed on each CDN as a static

    binary • HTTP2 connection pooling and request multiplexing • No blows and whistles: simple, mostly golang standard library components MEASURING NETWORK COMPONENTS Proof-of Concept: CDN Reverse Proxy
  36. Cloud PROTOTYPE Client to AWS Routing Options: Direct Link Cloud

    Steering: • 3 AWS regions • geoDNS for load balancing
  37. IX Cloud Proxy PROTOTYPE Private Backbone Client to AWS Routing

    Options: Proxy via IX CDN IX CDN Steering: • TCP Anycast • Single IP • 70+ sites
  38. ISP IX Cloud Proxy Proxy PROTOTYPE Client to AWS Routing

    Options: Proxy via ISP Private Backbone ISP CDN Steering: • Steer to the same site as video • Based on BGP advertisement + traffic shaping • 1K+ sites
  39. ISP IX Cloud Proxy Proxy PROTOTYPE Client to AWS Routing

    Options: ISP to IX Chain Private Backbone Which path is faster? Which path is more reliable?
  40. ISP IX Cloud CLOUD Proxy IX-CLOUD Proxy ISP-CLOUD type: HTTP

    GET name: steering test targets: cloud.test.me/probe ix-cloud.test.me/probe isp-cloud.test.me/probe PROTOTYPE 3 Network Paths to Reach the Cloud Private Backbone cloud: time / OK / FAIL ix-cloud: time / OK / FAIL isp-cloud: time / OK / FAIL
  41. • Simple proxying is not enough • Accurate estimation with

    the measurement system • No production impact PROTOTYPE Recap
  42. Goal: find the fastest possible network path, amongst available options:

    - AWS-direct - IX-proxy (TCP Anycast) - ISP-proxy (Unicast) PROTOTYPE: STEERING No single winner, so need to choose based on perf
  43. ISP IX Cloud Proxy Proxy PROTOTYPE: STEERING Client Steering Private

    Backbone Client Steering Goal: choose the fastest path • Cloud vs IX vs ISP • No additional API calls • Easy client integration
  44. ISP IX Cloud Proxy Proxy PROTOTYPE: STEERING Client DNS Steering

    Private Backbone DNS Solution: DNS api.netflix.com -> IP of the server on the fastest path • Cloud : AWS Region IP • IX : Anycast IP • ISP : Site Unicast IP
  45. ISP IX Cloud Proxy Proxy PROTOTYPE: STEERING DNS Resolver Based

    Steering Private Backbone Auth DNS Recursive Resolver Challenge: auth DNS can only see recursive DNS resolver IP
  46. ISP IX Cloud Proxy Proxy PROTOTYPE: STEERING DNS Resolver Based

    Steering Private Backbone Auth DNS Recursive Resolver Challenge: need to make a decision based on aggregate client performance
  47. Step 1: measure network for each client and each path

    ISP IX Cloud CLOUD Proxy IX-CLOUD Proxy ISP-CLOUD PROTOTYPE: STEERING Private Backbone type: HTTP GET name: steering test targets: cloud.test.me/probe ix-cloud.test.me/probe isp-cloud.test.me/probe cloud: time / OK / FAIL ix-cloud: time / OK / FAIL isp-cloud: time / OK / FAIL
  48. Step 2: Aggregate Results by Resolver PROTOTYPE: STEERING Probe ISP

    Proxy IX Proxy Cloud GROUP BY resolver: resolver 1: IX Proxy resolver 2: ISP Proxy (stack #) resolver 3: AWS resolver 4: IX Proxy resolver 5: AWS resolver 6: IX Proxy … Targets:
  49. ISP IX Cloud Proxy Proxy REMEDIATION Step 3: Load the

    Resolver Map to Auth DNS and Steer Based on Resolver IP Private Backbone Auth DNS Recursive Resolver Resolver IP -> ISP Proxy IX Proxy AWS Path: Resolver Map
  50. Client Steering PROTOTYPE: STEERING Automatic Traffic Steering Probes => Data

    Pipeline => Measure different client routing options Choose the fastest/most reliable Use for production steering policy
  51. ISP IX Cloud Proxy Proxy PROTOTYPE: STEERING Use Probes to

    Measure Effectiveness Private Backbone DNS CLOUD STEERING type: HTTP GET name: steering test targets: cloud.test.me/probe dns-steering.test.me/probe cloud: time / OK / FAIL dns-steering: time / OK / FAIL
  52. • DNS-based solution • Proved that the resolver steering can

    work • Tuned the resolver aggregation models • Tested full integration of end-to-end solution • No production impact PROTOTYPE: STEERING Recap
  53. CDN HTTP APIs PRODUCTIZE Scope of work 150 Million Users

    1K+ Different Devices 10K+ CDN Servers 1K+ Locations 1M+ Requests per Second Internet Scale Routing 100+ Microservices 100+ Deployments per Day
  54. Minimize the scope of failure Embrace failure as part of

    your design Graceful degradation PRODUCTIZE How to continue to innovate without a risk of prod impact?
  55. Workflow: 1. Probe test 2. AB tests or Canaries 3.

    Progressive rollout 4. Done PRODUCTIZE Small features. Frequent deployments.
  56. PRODUCTIZE Canary your deployments. Site 1 Canary Control Site 2

    Canary Control Site N Canary Control ...
  57. PRODUCTIZE Canary your deployments Site 1 Wave1 Wave1 Wave2 Wave2

    Wave1 Wave2 Wave3 Wave3 Wave4 Wave4 Wave3 Wave4 Wave1 Wave2 Wave3 Wave4 Site 2 Wave1 Wave1 Wave2 Wave2 Wave3 Wave3 Wave4 Wave4 Site N Wave1 Wave1 Wave2 Wave2 Wave1 Wave2 Wave3 Wave3 Wave4 Wave4 Wave3 Wave4 ...
  58. PRODUCTIZE Clients CDN AWS Kafka Pipeline Hive ES Atlas ETLs

    Client Logs Probes User QoE DNS logs Access logs Proxy logs Access logs Lumen* Cloud Edge * Lumen: Custom, Self-Service Dashboarding For Netflix: https://medium.com/netflix-techblog/lumen-custom-self-service-dashboarding-for-netflix-8c56b541548c Instrument as much as possible.
  59. MONITOR When you change critical path - you are the

    first to blame on any issue 150 Million Users 1K+ Different Devices 10K+ CDN Servers 1K+ Locations 1M+ Requests per Second Internet Scale Routing 100+ Microservices 100+ Deployments per Day
  60. MONITOR Monitoring philosophy 1. Detect 2. Triage 3. Scope 4.

    Debug/Troubleshoot 5. Modeling More real-time More Granular Dashboard counts: 40 in Lumen dashboards 15 in Kibana 25 in Tableau
  61. MONITOR Monitoring philosophy 1. Detect 2. Triage 3. Scope 4.

    Debug/Troubleshoot 5. Modeling Time to triage: 1-2 minutes Time to debug/troubleshoot: minutes to hours More real-time More Granular
  62. ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD MONITOR Probe

    for Reachability Private Backbone type: HTTP GET name: steering test targets: cloud.test.me/probe ix-cloud.test.me/probe isp-cloud.test.me/probe cloud: time / OK / FAIL ix-cloud: time / OK / FAIL isp-cloud: time / OK / FAIL
  63. cloud: FAIL ix-cloud: OK isp-cloud: FAIL What’s broken? - ISP’s

    connection to AWS Can we fix it? - YES - Move traffic via the IX CDN server MONITOR Remediation for Broken Path ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD Private Backbone
  64. cloud: FAIL ix-cloud: OK isp-cloud: FAIL What’s broken? - Backbone

    link to AWS Can we fix it? - YES - Move traffic via the ISP or AWS-direct MONITOR Remediation for Broken Path ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD Private Backbone
  65. cloud: FAIL ix-cloud: FAIL isp-cloud: FAIL What’s broken? - ISP

    outage or client last mile Can we fix it? - NO (we don’t have a routable path) MONITOR Remediation for Full Isolation ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD Private Backbone
  66. Client Steering MONITOR Automatic traffic steering Probes => Data Pipeline

    => Measure different client routing options Choose the fastest/most reliable Use for production steering policy
  67. MONITOR Operational Principles • Reduce failure impact • Collect metrics

    • Invest into dashboarding/triage toolset • Auto-correct issues whenever possible • Alert on big incidents
  68. WORKFLOW Workflow Monitor Productize Prototype Measure Probing system First prototype

    DNS Steering First prod traffic Gradual deployment 3mo 1mo 4mo 2mo 2+yr
  69. LEARN Challenge your assumptions every step of the way Learn

    Learn M easure Prototype Productize Monitor
  70. Question your intuition LEARN Some examples of our surprises: •

    CDN Proxy Latency results • TCP Anycast stability • DNS overhead • TLS impact • Traffic patterns • Resource impact • ...
  71. Learn from production data LEARN Examples of problems in prod:

    • Client integration bugs • URL length limits • Device TLS overhead • DNS behavior • ...
  72. LEARN The only way to see what works for YOU

    is to try and measure We see different patterns from: • Akamai: DNS resolvers stats • Uber: HTTP2 performance • Cloudflare: TLS performance • ...
  73. LEARN Say NO to fancy features, unless you prove you

    need them. What we did NOT do: • ECS EDNS0 subnet • Network architecture • Dynamic PID loop for traffic balancing • Response caching • ...
  74. LEARN Look out for new use cases along the way

    We started with latency, but now we also do: • AWS regional traffic balancing • CDN reliability modeling • DNS configuration changes • TLS/TCP config improvements • ...
  75. Client Steering MONITOR Probes => Data Pipeline => Measure different

    client routing options Choose the fastest/most reliable Use for production steering policy Architecture recap
  76. ISP IX Cloud Proxy Proxy REMEDIATION Architecture recap Private Backbone

    Auth DNS Recursive Resolver Resolver IP -> ISP Proxy IX Proxy AWS Path: Resolver Map
  77. • Understand network impact. • Measure what you can control.

    • Start quick and simple. • Embrace failure. • Control alert fatigue. • Do what works for you, not someone else. • Learn as you go. SUMMARY Summary Learn M easure Prototype Productize Monitor