Own your reliability - Velocity Amsterdam 2016

B6e65829cc035d58fe399073e4244c6a?s=47 Adam Surak
November 07, 2016

Own your reliability - Velocity Amsterdam 2016

Who do you trust? What do you control? What are your dependencies? Reliability on the Internet is an adrenaline-fueled adventure, but we all want a good night sleep and working service sometimes. Adam Surák takes a closer look at some reliability nightmares and explains how they could be dealt with, sharing the design learning outcomes of his experience running servers in almost 40 data centers across 15 regions, achieving close to 100% availability globally and 100% in the vast majority of the regions.

In order to demonstrate why we’re being impacted by our design and operations decisions, Adam quickly reviews the basics before exploring in detail SLAs that we commit to every day yet have only a vague idea of what they mean. Adam then offers an overview of blackbox monitoring tools, from very simple, low-precision tools testing traffic to very sophisticated, high-precision tools measuring real-user traffic. Although cloud solutions seem to be the silver bullet of everything for some, Adam explains that that’s not the case—the cloud has its own issues. Adam concludes with an overview of commonly underestimated dependencies in our software, infrastructure, and people.

B6e65829cc035d58fe399073e4244c6a?s=128

Adam Surak

November 07, 2016
Tweet

Transcript

  1. Build Unique Search Experiences Adam Surak SRE & Security Engineer

    adam.surak@algolia.com @AdamSurak Own your reliability
  2. @AdamSurak #id Algolia since 2014 (team of 8) SRE &

    Security Engineer Responsible for infrastructure I like to sleep and break things
  3. @AdamSurak

  4. @AdamSurak Algolia Today 15 regions, 50+ datacenters 2100+ customers in

    100+ countries 40B+ Write operations per month 17B+ User-generated queries per month
  5. @AdamSurak Who owns your availability? YOU

  6. @AdamSurak Basic principles 99% 99% 98% T = A^2 99%

    99% T = 1-(1-A)^2 99.99%
  7. @AdamSurak Reality 99.3% 99.6% 98% 99.95% 99.8% 99.95% external service

    no documentation does what?
  8. @AdamSurak Ideal state 99.3% 99.6% 98% 99.95% 9 8 %

    99.2% 99.8% 99% 99.7% 98% 99.95% 99.95%
  9. @AdamSurak What is SLA?

  10. @AdamSurak What is SLA? “A service level agreement (SLA) is

    a contract between a service provider (either internal or external) and the end user that defines the level of service expected from the service provider.” by Palo Alto Networks Mostly uptime In advanced environments - response time, error rate
  11. @AdamSurak How much costs you a minute of downtime?

  12. @AdamSurak Common SLA levels SLA Downtime per month Cost ($95/min,

    $50M/year) Cost per year 99 % 7 hours and 18 minutes $41 610 $500 000 99,9 % 43 minutes $4 085 $49 000 99,95 % 21 minutes $1 995 $24 000 99,99 % 4 minutes $380 $4 560 99,999 % 26s $41 $492 99,9999 % 2,6 seconds $4 $48 100 % Marketing level
  13. @AdamSurak SLA tricks “100% uptime, 5% refund after each 0.05%”

    ➡ 99.95% “99.9% SLA, downtime counts if backend responds with error during 2 consecutive 90s intervals” ➡ 99.8% “…downtime counts from the moment of customer’s report” ➡ :facepalm:
  14. @AdamSurak Where is the SLA monitored from?

  15. @AdamSurak Independent monitoring Pingdom ServerDensity ThousandEyes CatchPoint TurboBytes Pulse Kentik

    Custom DNS - 1.6k, API Status - 3.3k, Latency - 73k
  16. @AdamSurak Who runs a server? Assuming the rest is #serverless

    with abstracted issues
  17. @AdamSurak Can you restart any server anytime?

  18. @AdamSurak Can you restart any rack anytime?

  19. @AdamSurak Can you restart any datacenter anytime?

  20. @AdamSurak Underestimated dependencies Power/network segments • are two adjacent racks

    really independent? • how protected is the network? Rogue DHCP? IP hijack? • can you choose rack with your provider? • what happens if you order 3 servers at once? A/C • influences a set of racks Network cables and interfaces are always broken • is your 1Gbit interface really in 1Gbit mode?
  21. @AdamSurak Network maintenance issues unplanned planned • but they forget

    to tell you failing • planned, they told you, but you have downtime
  22. – Cloud user “But I run on cloud!”

  23. @AdamSurak Clouds are not error-proof AWS has outages • route

    leak, broken DynamoDB GCP has outages • sequence of lightning strikes, global network issue Azure has outages • global multi-hour outage Salesforce has outages • EU2 read-only, NA14 outage …you name it you can deploy multi-cloud! => APIs!
  24. @AdamSurak Network related issues AWS EU-West-1 broke Direct Connect with

    OVH ISP received 0.0.0.0/0 from a new peer => 75% traffic lost Malaysia Telecom announced AWS prefixes => US-East-1 unavailable ISP of CloudFlare misconfigured router and started to receive all CloudFlare’s worldwide traffic in Doha, Qatar TCP proxy becomes your best friend
  25. @AdamSurak Servers in San Jose Customer in Oregon • AWS

    US-West 2 21 ms average latency Boardman, OR Seattle, WA San Jose, CA
  26. @AdamSurak Boardman, OR Seattle, WA San Jose, CA from 21

    ms to 150-300ms ?
  27. @AdamSurak Boardman, OR Seattle, WA San Jose, CA HOST: ec2-54-186-253-146

    Loss% Snt Last Avg Best Wrst StDev 4. 100.64.16.131 0.0% 10 0.4 0.6 0.4 1.3 0.3 5. 205.251.232.62 0.0% 10 1.3 7.7 0.8 43.9 14.1 8. 205.251.225.199 0.0% 10 8.5 8.1 7.1 9.6 0.9 9. sea-b1-link.telia.net 0.0% 10 7.9 8.1 7.7 9.9 0.6 10. den-b1-link.telia.net 20.0% 10 109.1 109.4 108.6 110.0 0.5 11. sjo-b21-link.telia.net 10.0% 10 137.8 137.5 136.7 138.0 0.4 12. leaseweb-ic-302376-sjo-b21.c 10.0% 10 136.4 137.0 136.4 137.8 0.6 13. 209.58.135.206 10.0% 10 47.3 47.5 47.2 47.9 0.3 14. 209.58.135.201 44.4% 9 47.3 47.1 46.8 47.3 0.2 15. 209.58.131.95 22.2% 9 47.0 47.0 46.8 47.2 0.2 Denver, CO new route via Denver 20% packet loss on Seattle-Denver issue out of AWS network
  28. –Networking 101 “What happens when you put www.algolia.com to your

    browser and hit enter?”
  29. @AdamSurak DNS No DNS -> no new connections Packet loss

    prone Latency of DNS is very tricky • language and OS dependent timeouts DNS providers are popular DDoS targets Having two DNS providers is perfectly doable -> APIs! • amazon.com doesn’t use Route53 and has two other providers
  30. @AdamSurak Software design TCP checksum is not 100% safe DNS

    resolving is not 100% working HTTP calls don’t always succeed or return 200 • what is the default timeout of your HTTP client?
  31. @AdamSurak Software operations Package repositories can get broken or out-of-sync

    Can you deploy when GitHub is down? Does your CI works when Docker signature is broken? Invest in introducing mistakes! iptables -A INPUT -p udp --dport 33434:33523 -j REJECT
  32. @AdamSurak People Who holds the knowledge about the system? Do

    people know what to do? How do you escalate? Send people on vacation!
  33. –Sidney Dekker “Everything that can break will work and then

    we will make wrong assumptions about the reliability.”
  34. W e are hiring in Paris and SF THANK YOU!

    Build Unique Search Experiences adam@algolia.com @AdamSurak