Reliably Scaling to your First Million Users

1 Reliably Scaling to your First Million Users with Chaos
Engineering Ho Ming Li Principal Solutions Architect, Gremlin [email protected] @horeal Adobe CE Meetup May 2019

2 Ho Ming Li Principal Solutions Architect, Gremlin [email protected] •
Assist in “Digital Transformation” • Share Architectural Best Practices • Share Operational Best Practices • Facilitate GameDays Quite possibly the only Solutions Architect who became a Technical Account Manager at AWS. @HoReaL @GremlinInc

@horeal @gremlininc I’m not Netﬂix. I’m not Amazon. I’m not
_________ . I don’t have their problems. … okay.

@horeal @gremlininc Do you want to scale? Do you want
to be… the next Amazon, or the next Netﬂix?

@horeal @gremlininc The Journey 1. Build.

@horeal @gremlininc The Journey 1. Build. 2. Scale.

@horeal @gremlininc The Journey 1. Build. 2. Scale. 3. Proﬁt.

@horeal @gremlininc The Journey (Hardcore aka Reality) Build “MVP”, found
bugs, fix bugs, build new features, found bugs, fix bugs, build new features, P0 hard down, on call hero saves the day, build new features, P1 incident, fix bugs, P0 hard down, customer complains, fix bugs, new bugs show up, fix new bugs, P2 issue, build new features, P1 incident, fix bugs, product release, P2 issue came back as P0 hard down… Frustrated customer looks for alternative, churn rate increases, business struggles…

@horeal @gremlininc Build and Operate, so that your service actually
works serving your Customer. Service Down means No Value to customers. Downtime Sucks!

@horeal @gremlininc Can be costly... The head of San Francisco’s
Municipal Transportation Agency is stepping down amid the fallout from a 10-hour meltdown that choked the city on Friday, drawing anger from City Hall.

@horeal @gremlininc How do we combat Downtime?

@horeal @gremlininc How do we combat Downtime? Chaos Engineering

We test proactively, instead of waiting for an outage.

@horeal @gremlininc Chaos Engineering Thoughtful, planned experiments designed to reveal
the weakness in our systems.

@horeal @gremlininc Like a vaccine, we inject harm to build
immunity.

@horeal @gremlininc 11 Attacks

@horeal @gremlininc 11 Attacks PAUSE

@horeal @gremlininc Let’s begin our Journey. * disclaimer - user
numbers will vary for your particular service

@horeal @gremlininc In the beginning... 1 to 100 Users… Maybe?
M-V-P DEPLOYMENT Rsync? Heroku? ENVIRONMENT Your Laptop ARCHITECTURE Monolith

@horeal @gremlininc Chaos Engineering? Back Burner (low priority) Level Undeﬁned
FOCUS Time to Market APPROACH Functional MVP DESIRE Proving out the Idea

@horeal @gremlininc First taste of scaling... 100 to 1000+ Users
Scale DEPLOYMENT CI/CD Pipelines ENVIRONMENT Dev → Stage → Prod ARCHITECTURE 3-tier (Front End, Back End, Data Store)

@horeal @gremlininc Do we know when a host goes away?
How do we remove/patch hosts? Can we replace smaller host to a bigger host? Can we scale out/in to add/remove more hosts? Any auto-healing mechanism if hosts fail health check? Is “S.T.O.N.I.T.H.” in our toolbox?

@horeal @gremlininc CHAOS MONKEY-ESQUE Host Takedown Level 0 FOCUS Detect
and Remediate (Manual to Auto) APPROACH Random host failing DESIRE Ability to replace hosts

@horeal @gremlininc Takedown Shutdown and Reboot a host $ shutdown
-r $ gremlin shutdown -r # AVAILABLE WITH Killing a process $ pkill httpd $ gremlin attack process_killer -p httpd

@horeal @gremlininc General Architectural Guidance Set up (and verify! )
Monitoring & Alerting Leverage Multiple Zones Identify Stateful vs Stateless hosts Replication of State Scale out for Stateless

@horeal @gremlininc Operational Challenges 1,000 to 10,000+ Users Operational Excellence
DEPLOYMENT CI/CD Pipelines ENVIRONMENT Multiple, Mixed, Hybrid ARCHITECTURE Monolith, 3-Tier, Managed Services

@horeal @gremlininc Which resource is the workload bounded upon? What
threshold do we trigger scaling? How long does it take to scale? What is the user experience upon encountering failure? How can we improve this user experience?

@horeal @gremlininc Other Host Failures Resource Constraints Level 1 FOCUS
Alerting and Basic Operations APPROACH Disciplined: benchmark, measure DESIRE Prepare for host-level failures

@horeal @gremlininc CPU $ gremlin attack cpu # AVAILABLE WITH
$ while :; do :; done $ stress --cpu 2 --timeout 60 $ dd if=/dev/zero of=/dev/null conv=sync $ yes > /dev/null &

@horeal @gremlininc Disk (capacity and IO) $ gremlin attack disk
$ gremlin attack io $ fallocate -l 10G outfile $ dd if=/dev/urandom of=/tmp/outfile bs=$((1024*1024)) count=1024 $ gremlin attack memory $ stress -m 1 --vm-bytes 1G Memory

@horeal @gremlininc General Guidance Establish incident management practice and process
Discuss Backup and Recovery (BCP/DR) Run those exercises!!!

@horeal @gremlininc Dependency Pain 100,000+ Users DEPLOYMENT Centralize, or decentralize?
ENVIRONMENT Kubernetes is the new hotness ARCHITECTURE Heavily adopting microservices “Digital Transformation”

@horeal @gremlininc What if THEY fail? Network Failures Level 1.5
FOCUS APPROACH DESIRE Error handling Experiments (GameDay) Prepare for high impact events

@horeal @gremlininc Network Network Gremlin $ gremlin attack latency Trafﬁc
Control (TC) $ tc qdisc add dev eth0 root netem delay 1000ms 500ms Iptable iptables -A OUTPUT -p tcp -d 157.240.0.0/16 -j DROP PF (Mac) block quick from any to 157.240.0.0/16

@horeal @gremlininc “Control” Complexity 1 mil Users DEPLOYMENT Balancing Agility
and Quality ENVIRONMENT Mixed, K8S on Multiple Providers ARCHITECTURE uServices, OSS, a bit of everything The Unknown

@horeal @gremlininc Putting it all together... Host Takedown Resource Limits
Network Failures Unknown → Known FOCUS APPROACH DESIRE Business Metrics User Experience Automated Experiments New Manual Experiments Veriﬁable Resilience!

Don’t wait till you have a service death star. Start
early. Start small. Start now.

42 Ho Ming Li [email protected] @HoReaL @GremlinInc Thank you! Reliably
Yours tinyurl.com/chaoseng meetup.com/pro/chaos

Reliably Scaling to your First Million Users

Reliably Scaling to your First Million Users

More Decks by HML

Other Decks in Technology

Featured

Transcript