Slide 1

Slide 1 text

Build Unique Search Experiences Adam Surak SRE & Security Engineer [email protected] @AdamSurak Own your reliability

Slide 2

Slide 2 text

@AdamSurak #id Algolia since 2014 (team of 8) SRE & Security Engineer Responsible for infrastructure I like to sleep and break things

Slide 3

Slide 3 text

@AdamSurak

Slide 4

Slide 4 text

@AdamSurak Algolia Today 15 regions, 50+ datacenters 2100+ customers in 100+ countries 40B+ Write operations per month 17B+ User-generated queries per month

Slide 5

Slide 5 text

@AdamSurak Who owns your availability? YOU

Slide 6

Slide 6 text

@AdamSurak Basic principles 99% 99% 98% T = A^2 99% 99% T = 1-(1-A)^2 99.99%

Slide 7

Slide 7 text

@AdamSurak Reality 99.3% 99.6% 98% 99.95% 99.8% 99.95% external service no documentation does what?

Slide 8

Slide 8 text

@AdamSurak Ideal state 99.3% 99.6% 98% 99.95% 9 8 % 99.2% 99.8% 99% 99.7% 98% 99.95% 99.95%

Slide 9

Slide 9 text

@AdamSurak What is SLA?

Slide 10

Slide 10 text

@AdamSurak What is SLA? “A service level agreement (SLA) is a contract between a service provider (either internal or external) and the end user that defines the level of service expected from the service provider.” by Palo Alto Networks Mostly uptime In advanced environments - response time, error rate

Slide 11

Slide 11 text

@AdamSurak How much costs you a minute of downtime?

Slide 12

Slide 12 text

@AdamSurak Common SLA levels SLA Downtime per month Cost ($95/min, $50M/year) Cost per year 99 % 7 hours and 18 minutes $41 610 $500 000 99,9 % 43 minutes $4 085 $49 000 99,95 % 21 minutes $1 995 $24 000 99,99 % 4 minutes $380 $4 560 99,999 % 26s $41 $492 99,9999 % 2,6 seconds $4 $48 100 % Marketing level

Slide 13

Slide 13 text

@AdamSurak SLA tricks “100% uptime, 5% refund after each 0.05%” ➡ 99.95% “99.9% SLA, downtime counts if backend responds with error during 2 consecutive 90s intervals” ➡ 99.8% “…downtime counts from the moment of customer’s report” ➡ :facepalm:

Slide 14

Slide 14 text

@AdamSurak Where is the SLA monitored from?

Slide 15

Slide 15 text

@AdamSurak Independent monitoring Pingdom ServerDensity ThousandEyes CatchPoint TurboBytes Pulse Kentik Custom DNS - 1.6k, API Status - 3.3k, Latency - 73k

Slide 16

Slide 16 text

@AdamSurak Who runs a server? Assuming the rest is #serverless with abstracted issues

Slide 17

Slide 17 text

@AdamSurak Can you restart any server anytime?

Slide 18

Slide 18 text

@AdamSurak Can you restart any rack anytime?

Slide 19

Slide 19 text

@AdamSurak Can you restart any datacenter anytime?

Slide 20

Slide 20 text

@AdamSurak Underestimated dependencies Power/network segments • are two adjacent racks really independent? • how protected is the network? Rogue DHCP? IP hijack? • can you choose rack with your provider? • what happens if you order 3 servers at once? A/C • influences a set of racks Network cables and interfaces are always broken • is your 1Gbit interface really in 1Gbit mode?

Slide 21

Slide 21 text

@AdamSurak Network maintenance issues unplanned planned • but they forget to tell you failing • planned, they told you, but you have downtime

Slide 22

Slide 22 text

– Cloud user “But I run on cloud!”

Slide 23

Slide 23 text

@AdamSurak Clouds are not error-proof AWS has outages • route leak, broken DynamoDB GCP has outages • sequence of lightning strikes, global network issue Azure has outages • global multi-hour outage Salesforce has outages • EU2 read-only, NA14 outage …you name it you can deploy multi-cloud! => APIs!

Slide 24

Slide 24 text

@AdamSurak Network related issues AWS EU-West-1 broke Direct Connect with OVH ISP received 0.0.0.0/0 from a new peer => 75% traffic lost Malaysia Telecom announced AWS prefixes => US-East-1 unavailable ISP of CloudFlare misconfigured router and started to receive all CloudFlare’s worldwide traffic in Doha, Qatar TCP proxy becomes your best friend

Slide 25

Slide 25 text

@AdamSurak Servers in San Jose Customer in Oregon • AWS US-West 2 21 ms average latency Boardman, OR Seattle, WA San Jose, CA

Slide 26

Slide 26 text

@AdamSurak Boardman, OR Seattle, WA San Jose, CA from 21 ms to 150-300ms ?

Slide 27

Slide 27 text

@AdamSurak Boardman, OR Seattle, WA San Jose, CA HOST: ec2-54-186-253-146 Loss% Snt Last Avg Best Wrst StDev 4. 100.64.16.131 0.0% 10 0.4 0.6 0.4 1.3 0.3 5. 205.251.232.62 0.0% 10 1.3 7.7 0.8 43.9 14.1 8. 205.251.225.199 0.0% 10 8.5 8.1 7.1 9.6 0.9 9. sea-b1-link.telia.net 0.0% 10 7.9 8.1 7.7 9.9 0.6 10. den-b1-link.telia.net 20.0% 10 109.1 109.4 108.6 110.0 0.5 11. sjo-b21-link.telia.net 10.0% 10 137.8 137.5 136.7 138.0 0.4 12. leaseweb-ic-302376-sjo-b21.c 10.0% 10 136.4 137.0 136.4 137.8 0.6 13. 209.58.135.206 10.0% 10 47.3 47.5 47.2 47.9 0.3 14. 209.58.135.201 44.4% 9 47.3 47.1 46.8 47.3 0.2 15. 209.58.131.95 22.2% 9 47.0 47.0 46.8 47.2 0.2 Denver, CO new route via Denver 20% packet loss on Seattle-Denver issue out of AWS network

Slide 28

Slide 28 text

–Networking 101 “What happens when you put www.algolia.com to your browser and hit enter?”

Slide 29

Slide 29 text

@AdamSurak DNS No DNS -> no new connections Packet loss prone Latency of DNS is very tricky • language and OS dependent timeouts DNS providers are popular DDoS targets Having two DNS providers is perfectly doable -> APIs! • amazon.com doesn’t use Route53 and has two other providers

Slide 30

Slide 30 text

@AdamSurak Software design TCP checksum is not 100% safe DNS resolving is not 100% working HTTP calls don’t always succeed or return 200 • what is the default timeout of your HTTP client?

Slide 31

Slide 31 text

@AdamSurak Software operations Package repositories can get broken or out-of-sync Can you deploy when GitHub is down? Does your CI works when Docker signature is broken? Invest in introducing mistakes! iptables -A INPUT -p udp --dport 33434:33523 -j REJECT

Slide 32

Slide 32 text

@AdamSurak People Who holds the knowledge about the system? Do people know what to do? How do you escalate? Send people on vacation!

Slide 33

Slide 33 text

–Sidney Dekker “Everything that can break will work and then we will make wrong assumptions about the reliability.”

Slide 34

Slide 34 text

W e are hiring in Paris and SF THANK YOU! Build Unique Search Experiences [email protected] @AdamSurak