$30 off During Our Annual Pro Sale. View Details »

Own your reliability - Velocity Amsterdam 2016

Adam Surak
November 07, 2016

Own your reliability - Velocity Amsterdam 2016

Who do you trust? What do you control? What are your dependencies? Reliability on the Internet is an adrenaline-fueled adventure, but we all want a good night sleep and working service sometimes. Adam Surák takes a closer look at some reliability nightmares and explains how they could be dealt with, sharing the design learning outcomes of his experience running servers in almost 40 data centers across 15 regions, achieving close to 100% availability globally and 100% in the vast majority of the regions.

In order to demonstrate why we’re being impacted by our design and operations decisions, Adam quickly reviews the basics before exploring in detail SLAs that we commit to every day yet have only a vague idea of what they mean. Adam then offers an overview of blackbox monitoring tools, from very simple, low-precision tools testing traffic to very sophisticated, high-precision tools measuring real-user traffic. Although cloud solutions seem to be the silver bullet of everything for some, Adam explains that that’s not the case—the cloud has its own issues. Adam concludes with an overview of commonly underestimated dependencies in our software, infrastructure, and people.

Adam Surak

November 07, 2016
Tweet

More Decks by Adam Surak

Other Decks in Technology

Transcript

  1. Build Unique Search Experiences
    Adam Surak
    SRE & Security Engineer
    [email protected]
    @AdamSurak
    Own your reliability

    View Slide

  2. @AdamSurak
    #id
    Algolia since 2014 (team of 8)
    SRE & Security Engineer
    Responsible for infrastructure
    I like to sleep and break things

    View Slide

  3. @AdamSurak

    View Slide

  4. @AdamSurak
    Algolia Today
    15 regions, 50+ datacenters
    2100+ customers in 100+ countries
    40B+ Write operations per month
    17B+ User-generated queries per month

    View Slide

  5. @AdamSurak
    Who owns your availability?
    YOU

    View Slide

  6. @AdamSurak
    Basic principles
    99% 99%
    98%
    T = A^2
    99%
    99%
    T = 1-(1-A)^2
    99.99%

    View Slide

  7. @AdamSurak
    Reality
    99.3% 99.6%
    98%
    99.95%
    99.8%
    99.95%
    external service
    no documentation does what?

    View Slide

  8. @AdamSurak
    Ideal state
    99.3% 99.6%
    98%
    99.95%
    9
    8
    %
    99.2%
    99.8%
    99%
    99.7%
    98%
    99.95%
    99.95%

    View Slide

  9. @AdamSurak
    What is SLA?

    View Slide

  10. @AdamSurak
    What is SLA?
    “A service level agreement (SLA) is a contract between
    a service provider (either internal or external) and the
    end user that defines the level of service expected
    from the service provider.” by Palo Alto Networks
    Mostly uptime
    In advanced environments - response time, error rate

    View Slide

  11. @AdamSurak
    How much costs you a minute of downtime?

    View Slide

  12. @AdamSurak
    Common SLA levels
    SLA Downtime per month Cost ($95/min, $50M/year) Cost per year
    99 % 7 hours and 18 minutes $41 610 $500 000
    99,9 % 43 minutes $4 085 $49 000
    99,95 % 21 minutes $1 995 $24 000
    99,99 % 4 minutes $380 $4 560
    99,999 % 26s $41 $492
    99,9999 % 2,6 seconds $4 $48
    100 % Marketing level

    View Slide

  13. @AdamSurak
    SLA tricks
    “100% uptime, 5% refund after each 0.05%”
    ➡ 99.95%
    “99.9% SLA, downtime counts if backend responds with error during 2
    consecutive 90s intervals”
    ➡ 99.8%
    “…downtime counts from the moment of customer’s report”
    ➡ :facepalm:

    View Slide

  14. @AdamSurak
    Where is the SLA monitored from?

    View Slide

  15. @AdamSurak
    Independent monitoring
    Pingdom
    ServerDensity
    ThousandEyes
    CatchPoint
    TurboBytes Pulse
    Kentik
    Custom
    DNS - 1.6k, API Status - 3.3k, Latency - 73k

    View Slide

  16. @AdamSurak
    Who runs a server?
    Assuming the rest is #serverless with abstracted issues

    View Slide

  17. @AdamSurak
    Can you restart any server anytime?

    View Slide

  18. @AdamSurak
    Can you restart any rack anytime?

    View Slide

  19. @AdamSurak
    Can you restart any datacenter anytime?

    View Slide

  20. @AdamSurak
    Underestimated dependencies
    Power/network segments
    • are two adjacent racks really independent?
    • how protected is the network? Rogue DHCP? IP hijack?
    • can you choose rack with your provider?
    • what happens if you order 3 servers at once?
    A/C
    • influences a set of racks
    Network cables and interfaces are always broken
    • is your 1Gbit interface really in 1Gbit mode?

    View Slide

  21. @AdamSurak
    Network maintenance issues
    unplanned
    planned
    • but they forget to tell you
    failing
    • planned, they told you, but you have downtime

    View Slide

  22. – Cloud user
    “But I run on cloud!”

    View Slide

  23. @AdamSurak
    Clouds are not error-proof
    AWS has outages
    • route leak, broken DynamoDB
    GCP has outages
    • sequence of lightning strikes, global network issue
    Azure has outages
    • global multi-hour outage
    Salesforce has outages
    • EU2 read-only, NA14 outage
    …you name it
    you can deploy multi-cloud! => APIs!

    View Slide

  24. @AdamSurak
    Network related issues
    AWS EU-West-1 broke Direct Connect with OVH
    ISP received 0.0.0.0/0 from a new peer => 75% traffic
    lost
    Malaysia Telecom announced AWS prefixes => US-East-1
    unavailable
    ISP of CloudFlare misconfigured router and started to
    receive all CloudFlare’s worldwide traffic in Doha, Qatar
    TCP proxy becomes your best friend

    View Slide

  25. @AdamSurak
    Servers in San Jose
    Customer in Oregon
    • AWS US-West 2
    21 ms average latency
    Boardman, OR
    Seattle, WA
    San Jose, CA

    View Slide

  26. @AdamSurak
    Boardman, OR
    Seattle, WA
    San Jose, CA
    from 21 ms to 150-300ms
    ?

    View Slide

  27. @AdamSurak
    Boardman, OR
    Seattle, WA
    San Jose, CA
    HOST: ec2-54-186-253-146 Loss% Snt Last Avg Best Wrst StDev
    4. 100.64.16.131 0.0% 10 0.4 0.6 0.4 1.3 0.3
    5. 205.251.232.62 0.0% 10 1.3 7.7 0.8 43.9 14.1
    8. 205.251.225.199 0.0% 10 8.5 8.1 7.1 9.6 0.9
    9. sea-b1-link.telia.net 0.0% 10 7.9 8.1 7.7 9.9 0.6
    10. den-b1-link.telia.net 20.0% 10 109.1 109.4 108.6 110.0 0.5
    11. sjo-b21-link.telia.net 10.0% 10 137.8 137.5 136.7 138.0 0.4
    12. leaseweb-ic-302376-sjo-b21.c 10.0% 10 136.4 137.0 136.4 137.8 0.6
    13. 209.58.135.206 10.0% 10 47.3 47.5 47.2 47.9 0.3
    14. 209.58.135.201 44.4% 9 47.3 47.1 46.8 47.3 0.2
    15. 209.58.131.95 22.2% 9 47.0 47.0 46.8 47.2 0.2
    Denver, CO
    new route via Denver
    20% packet loss on Seattle-Denver
    issue out of AWS network

    View Slide

  28. –Networking 101
    “What happens when you put www.algolia.com to your
    browser and hit enter?”

    View Slide

  29. @AdamSurak
    DNS
    No DNS -> no new connections
    Packet loss prone
    Latency of DNS is very tricky
    • language and OS dependent timeouts
    DNS providers are popular DDoS targets
    Having two DNS providers is perfectly doable -> APIs!
    • amazon.com doesn’t use Route53 and has two other providers

    View Slide

  30. @AdamSurak
    Software design
    TCP checksum is not 100% safe
    DNS resolving is not 100% working
    HTTP calls don’t always succeed or return 200
    • what is the default timeout of your HTTP client?

    View Slide

  31. @AdamSurak
    Software operations
    Package repositories can get broken or out-of-sync
    Can you deploy when GitHub is down?
    Does your CI works when Docker signature is broken?
    Invest in introducing mistakes!
    iptables -A INPUT -p udp --dport 33434:33523 -j REJECT

    View Slide

  32. @AdamSurak
    People
    Who holds the knowledge about the system?
    Do people know what to do?
    How do you escalate?
    Send people on vacation!

    View Slide

  33. –Sidney Dekker
    “Everything that can break will work and then we will
    make wrong assumptions about the reliability.”

    View Slide

  34. W
    e are hiring in Paris and SF
    THANK YOU!
    Build Unique Search Experiences
    [email protected]
    @AdamSurak

    View Slide