Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Networking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov

Networking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov

Sergey Fedorov, Senior Software Engineer at Netflix, describes a client-side network measurement system called "Probnik", and how it can be used to improve performance, reliability and control of client-server network interactions.

Sergey Fedorov

September 09, 2019
Tweet

More Decks by Sergey Fedorov

Other Decks in Technology

Transcript

  1. CDN INTRO API Acceleration Private Backbone Video DNS Small assets

    Client 150M+ users 190+ countries 1K+ of devices 4 providers 100+ zones 10K+ servers 1K+ sites 3 regions 100+ microservices Netflix Network Paths: Scale
  2. INTRO What could possibly go wrong: Client app release/bug/change Client

    OS release DNS issue Last mile network issue Internet Congestion Route leak AWS outage AWS microservice release ... CDN API Acceleration Private Backbone Video DNS Small assets Client
  3. INTRO Basic Troubleshooting Guide YES NO Is it a network

    problem? Can I fix it? Has the fix worked? Delegate Delegate YES NO
  4. INTRO Basic Troubleshooting Guide YES NO Is it a network

    problem? Can I fix it? Has the fix worked? Could I prevent it? Delegate Delegate YES NO YES NO Try again
  5. INTRO Basic Troubleshooting Guide 1 2 3 4 Measure Troubleshoot

    Remediate Prevent YES NO Is it a network problem? Can I fix it? Has the fix worked? Could I prevent it? Delegate Delegate YES NO YES NO Try again
  6. MEASURING NETWORK COMPONENTS Designing a system to measure Netflix network

    interactions Measuring Network Components 1 2 3 4 Measure Troubleshoot Remediate Prevent
  7. NETFLIX GRAPH App Launch API Call Asset Loading Rendering Time

    RUM: Passive Request Monitoring 100% geo 100% devices Noisy signal Only production traffic
  8. Synthetic MEASURING NETWORK COMPONENTS Dedicated test servers CDN Tests Cloud

    Tests DNS CDN DNS Tests Full control Clean signal Partial geo Limited device coverage Cloud
  9. Can we combine the benefits? YES But we had to

    build that ourselves MEASURING NETWORK COMPONENTS
  10. Probe Get Recipe Fetch & Measure Report LOG MEASURING NETWORK

    COMPONENTS Client Probes = RUM + Synthetic
  11. type: HTTP GET name: reachability test targets: http://target1.test.me/probe http://target2.test.me/probe http://target3.test.me/probe

    payload: 5KB pulses: 3 delay: 5s Config MEASURING NETWORK COMPONENTS Probe Recipe Specifies What to Test
  12. type: HTTP GET name: reachability test targets: http://target1.test.me/probe http://target2.test.me/probe http://target3.test.me/probe

    payload: 5KB pulses: 3 delay: 5s Config type: HTTPS GET name: large payload test targets: https://target1.test.me/probe https://target2.test.me/probe https://target3.test.me/probe payload: 100KB pulses: 3 delay: 10s ... MEASURING NETWORK COMPONENTS Probe Recipe Specifies What to Test
  13. Config MEASURING NETWORK COMPONENTS Multiple recipes can be used Client

    API Call 1 Reachability Test DNS Providers Test CDN Test Large Payload Test 20% 10% 1% 2% Choose one, with a given probability
  14. Client Config API Call 1 Target 1 Target 2 Target

    N 2 Fetch & Measure × pulses MEASURING NETWORK COMPONENTS Step 2: Request and Measure
  15. MEASURING NETWORK COMPONENTS Making Network Calls Pulse 1 Pulse 2

    Start End Start End Pulse Pause Pulse 3 Start End Pulse Pause Target 1 Target 2 Target N
  16. Response Code: OK / FAIL / ... DNS TCP TLS

    First Byte Download Server MEASURING NETWORK COMPONENTS Collected Measurements
  17. Client Config API Call 1 LOG Logblob 3 Target 1

    Target 2 Target N 2 Report Fetch & Measure × pulses MEASURING NETWORK COMPONENTS Step 3: Upload Results
  18. iOS % of all sessions Client Steering Traffic Modeling Monitoring

    Stream/Batch Performance MEASURING NETWORK COMPONENTS Netflix Use Cases
  19. 6K+ probes per second 14 recipes 1K+ devices 100M+ locations

    MEASURING NETWORK COMPONENTS Probe Stats at Netflix
  20. $1M+ per year for a vendor solution 6K+ probes per

    second 14 recipes 1K+ devices 100M+ locations MEASURING NETWORK COMPONENTS Probe Stats at Netflix
  21. How to use Probes to detect and triage network issues

    Detecting Network Issues MEASURING NETWORK COMPONENTS 1 2 3 4 Measure Troubleshoot Remediate Prevent
  22. ISP IX CDN Cloud CDN Cloud IX ISP isp: OK

    / FAIL ix: OK / FAIL cloud: OK / FAIL type: HTTP GET name: reachability test targets: isp.test.me/probe ix.test.me/probe cloud.test.me/probe DETECTING NETWORK ISSUES Reachability Test Setup
  23. DETECTING NETWORK ISSUES Can Drill Down to Various Dimensions of

    User Connectivity IX Cloud CDN Cloud IX ISP CDN CDN Cloud IX ISP Cloud IX ISP Cloud Cloud ISP3 ISP2 ISP1 ISP1 CDN ISP2 CDN ISP3 CDN
  24. Cloud AuthDNS C AuthDNS B AuthDNS A probe.dnsA.me -> 1.2.3.4

    probe.dnsB.me -> 1.2.3.4 probe.dnsC.me -> 1.2.3.4 1.2.3.4 DETECTING NETWORK ISSUES Beyond HTTP Reachability: Auth DNS Availability
  25. Cloud AuthDNS C AuthDNS B AuthDNS A type: HTTP GET

    name: DNS test targets: probe.dnsA.me/probe probe.dnsB.me/probe probe.dnsC.me/probe probe.dnsA.me -> 1.2.3.4 probe.dnsB.me -> 1.2.3.4 probe.dnsC.me -> 1.2.3.4 1.2.3.4 Auth DNS A: OK / FAIL Auth DNS B: OK / FAIL Auth DNS C: OK / FAIL DETECTING NETWORK ISSUES Beyond HTTP Reachability: Auth DNS Availability
  26. Cloud AuthDNS C AuthDNS B AuthDNS A probe.dnsA.me -> 1.2.3.4

    probe.dnsB.me -> 1.2.3.4 probe.dnsC.me -> 1.2.3.4 DETECTING NETWORK ISSUES Beyond HTTP Reachability: Auth DNS Availability 1.2.3.4
  27. Cloud us-east Cloud us-west Cloud eu-west DETECTING NETWORK ISSUES Testing

    Cloud Region Connectivity type: HTTP GET name: Cloud region test targets: us-east.test.me/probe us-west.test.me/probe eu-west.test.me/probe Auth DNS A: OK / FAIL Auth DNS B: OK / FAIL Auth DNS C: OK / FAIL
  28. Using Probe results to know which fix would work best.

    Remediation MEASURING NETWORK COMPONENTS 1 2 3 4 Measure Troubleshoot Remediate Prevent
  29. ISP IX Cloud CDN CDN REMEDIATION Private Backbone Client to

    AWS Routing Options: Proxy via IX CDN
  30. ISP IX Cloud CDN CDN REMEDIATION Client to AWS Routing

    Options: Proxy via ISP Private Backbone
  31. ISP IX Cloud CDN CDN REMEDIATION Client to AWS Routing

    Options: ISP to IX Chain Private Backbone
  32. ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD ISP-IX-CLOUD type:

    HTTP GET name: steering test targets: cloud.test.me/probe ix-cloud.test.me/probe isp-cloud.test.me/probe isp-ix-cloud.test.me/probe REMEDIATION 4 Network Paths to Reach the Cloud Private Backbone
  33. ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD ISP-IX-CLOUD cloud:

    OK / FAIL ix-cloud: OK / FAIL isp-cloud: OK / FAIL Isp-ix-cloud: OK / FAIL type: HTTP GET name: steering test targets: cloud.test.me/probe ix-cloud.test.me/probe isp-cloud.test.me/probe isp-ix-cloud.test.me/probe REMEDIATION Probe for Reachability Private Backbone
  34. cloud: FAIL ix-cloud: OK isp-cloud: FAIL isp-ix-cloud: OK What’s broken?

    - ISP’s connection to AWS Can we fix it? - YES - Move traffic via the IX CDN server REMEDIATION Remediation for Broken Path ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD ISP-IX-CLOUD Private Backbone
  35. cloud: FAIL ix-cloud: FAIL isp-cloud: FAIL isp-ix-cloud: FAIL What’s broken?

    - ISP outage or client last mile Can we fix it? - NO (we don’t have a routable path) REMEDIATION Remediation for Full Isolation ISP IX Cloud CLOUD CDN IX-CLOUD CDN ISP-CLOUD ISP-IX-CLOUD Private Backbone
  36. Client Steering REMEDIATION Automatic traffic steering Probes => Data Pipeline

    => Measure different client routing options Choose the fastest/most reliable Use for production steering policy
  37. Using Probe to find issues before they hit production. Prevention

    PREVENTION 1 2 3 4 Measure Troubleshoot Remediate Prevent
  38. NO but you can try Probing improves your chances. PREVENTION

    Can You Prevent All Self-Inflicted Failures?
  39. Cloud AuthDNS A Prod Auth DNS probe.prodDNS.test -> 1.2.3.4 probe.dnsA.test

    -> 1.2.3.4 1.2.3.4 PREVENTION Testing DNS Changes with Probes Test DNS change on Probe traffic before applying to PROD
  40. Cloud ipv4 1.2.3.4 ipv6 1:2:3:4:: PREVENTION Seeing the impact of

    the ipv6 deployment type: HTTP GET name: ipv6 test targets: ipv4.test.me/probe ipv6.test.me/probe Compare ipv6 to ipv4 on probe traffic - find differences without PROD impact
  41. Internet IX PREVENTION Netflix Example: Provisioning the Backbone AWS Site

    1 AWS Cloud Site 2 Site 3 Site N Private Backbone Prod Traffic: - RPS - Gbs In - Gbs Out Want to move client traffic from client -> cloud to client -> IX -> cloud
  42. Internet IX PREVENTION Netflix Example: Provisioning the Backbone AWS Site

    1 AWS Cloud Site 2 Site 3 Site N Q: How to provision the backbone? Private Backbone Prod Traffic: - RPS - Gbs In - Gbs Out
  43. PREVENTION Building Traffic Model with Probes Client App Start Discovery

    Play App startup requests Playback Requests Probe
  44. Internet IX PREVENTION Netflix Example: Provisioning the Backbone AWS Site

    1 AWS Cloud Site 2 Site 3 Site N Private Backbone IX-Cloud type: HTTP GET name: IX Steering targets: policy.ixaws.me/probe Prod Traffic: - RPS - Gbs In - Gbs Out
  45. site1: % probes site2: % probes site3: % probes ...

    siteN: % probes Internet IX PREVENTION Netflix Example: Provisioning the Backbone AWS Site 1 AWS Cloud Site 2 Site 3 Site N Private Backbone IX-Cloud type: HTTP GET name: IX Steering targets: policy.ixaws.me/probe Prod Traffic: - RPS - Gbs In - Gbs Out
  46. site1: % probes site2: % probes site3: % probes ...

    siteN: % probes Internet IX PREVENTION Netflix Example: Provisioning the Backbone AWS Site 1 AWS Cloud Site 2 Site 3 Site N Private Backbone IX-Cloud Client-IX Steering Policy AWS Region Steering Policy type: HTTP GET name: IX Steering targets: policy.ixaws.me/probe Prod Traffic: - RPS - Gbs In - Gbs Out
  47. PREVENTION From Probes to Traffic Estimates Site: - % probes

    X PROD Traffic: - RPS - Gbs In - Gbs Out = Site RPS Site Gbs In Site Gbs Out
  48. PREVENTION From Probes to Traffic Estimates Probe Site % Backbone

    Topology Prod: RPS, Gbs In, Gbs Out Client to IX Site Steering Policy IX to AWS Region Steering Policy Input: Probes + Prod Traffic Variations Objective(s): - min latency - min cost - min risk - ... Backbone link -> <traffic> link1: <gbs> link2: <gbs> link3: <gbs> ... linkN: <gbs>
  49. PREVENTION Summary • Leverage your clients • Sophisticated analysis instead

    of tooling • Probe design is important • Rich insights with basic measurements • Applications beyond monitoring 1 2 3 4 Measure Troubleshoot Remediate Prevent