It's All About Scale (By: Arslan Arshad) - DevFest 2022

It’s all about scale Design & Reliability At Planet Scale
Arslan Arshad - SWE / SRE @ Google

Proprietary + Conﬁdential • What is SRE? • Architectural Principles
◦ Learn by building ◦ Managing state in a distributed system ◦ Iterating on initial design ◦ Preparing for launch day • Fault Handling Agenda

Proprietary + Conﬁdential • Who are we? ◦ Software Engineers
on a unique mission • What we do? ◦ Keep services running, scale them and make them reliable • How we do it? ◦ A mix of proactive and reactive engineering to make our services better • Why do we do it? ◦ Users expect speed, reliability and correct operation of services Site Reliability Engineering

Proprietary + Conﬁdential Architectural Principles

Proprietary + Conﬁdential Let’s Build - Global Photo Hosting Service
Users Search Photos Users Upload Photos . Everything Is Public Global Availability

Let’s Start Simple Storage App Logic Machine Application Code

Proprietary + Conﬁdential Any server can handle any incoming request
and treats each request independently Aim For Stateless Service App Logic Storage

Proprietary + Conﬁdential • Can only handle limited requests per
second • Hardware / software failures lead to service unavailability • Hardware / software upgrades leads to service unavailability Limitation: Less Compute & Poor Reliability App Logic Machine App Logic Machine App Logic Machine App Logic Machine App Logic Machine Horizontal Scaling Vertical Scaling One large failure domain Multiple small failure domains

Solution: Horizontal Scaling App Logic App Logic App Logic App
Logic Machine Storage Load Balancer • More machines => More QPS • Autoscale horizontally to adjust capacity to demand • Sharding: Requests spread across multiple servers

Proprietary + Conﬁdential • Image Uploading Operation: ◦ Low QPS
◦ Long duration of requests ◦ Variable payload as images sizes / type might vary ◦ Optional: Image processing might work better on different hardware e.g. GPU • Image Serving Operation: ◦ Relatively High QPS ◦ Short duration of requests ◦ Optimized for latency instead of throughput • Mix of requests can change, hard plan for capacity and operation of server Limitation: Doing different things at the same time

Solution: Do one thing and do it well App Logic
Machine App Logic Machine App Logic Machine Upload Machine Storage Load Balancer App Logic Machine App Logic Machine App Logic Machine Search Machine • Consistent behavior per server, easier for resource planning • Can have more optimized hardware & networking requirements

Replicate Everything App Logic Machine App Logic Machine App Logic
Machine Upload Machine Storage Load Balancer App Logic Machine App Logic Machine App Logic Machine Search Machine Load Balancer Load Balancer Storage Storage

Proprietary + Conﬁdential • CAP Theorem ◦ Pick any two
▪ Consistency ▪ Availability ▪ Partition Tolerance • Distributed consensus algorithms ◦ Protocols: Paxos, RAFT, and ZAB ◦ Use cases: ▪ Leader Election ▪ Globally Consistent DBs ▪ Reliable Message Queuing Can’t go stateless: Managing State

Proprietary + Conﬁdential • Look for bottlenecks ◦ Verify each
function can scale horizontally • Work out the hardware requirements for each function ◦ CPU, network, disk accesses, RAM ◦ Sanity check expected concurrent connections, etc • Look for ways to improve ◦ Any single points of failure or other weaknesses? ◦ Is it efficient in terms of resources? ◦ How can the design accommodate change? ◦ Is there a better/simpler design? Iterate on the design

Proprietary + Conﬁdential • Expected number of photo uploads: 1k/second
• Expected number of thumbnail views: 10k/second • Expected number of photos views: 100/sec • Resource Planning: ◦ Load balancer / networking: ▪ Uploads: 1,000 * 4 MB = 4 GB / s ▪ Serving: 10,000 * 200 KB + 100 * 4 MB = 2.4 GB / s ◦ Number of instances for upload operation: ▪ 300 ms per request at 10 requests concurrently ▪ 0.3 * 1,000 requests / 10 => We need 30 instances Preparation for launch day

Proprietary + Conﬁdential Fault Handling

Proprietary + Conﬁdential • One of the challenges when designing
high scalable systems. • As your service grows, what will start to fail and how will that show itself? • What failure modes can you prevent before they start? • How can you keep serving and growing your service in spite of failures? Considering Failure Modes

Proprietary + Conﬁdential • This is the most common type
of failures • Hardware failures: ◦ Server ◦ Network Switch ◦ Disk / Storage • Ensure multiple failure domains, no single point of failure • Build software to handle failovers properly Component Failures

Proprietary + Conﬁdential • Query / request that is making
your service crash ◦ Deliberate attack ◦ Just bad luck • No great solutions, but possible mitigations: ◦ Rate limiting request per user ◦ Limiting the size of the request per user ◦ Pro-actively blocking untrustworthy users ◦ Fuzz testing to ensure bad inputs are handled properly Query Of Death

Proprietary + Conﬁdential • Design your system and tools to
avoid changes that are applied instantaneously everywhere • Slow rollouts with proper canaring ◦ Binaries ◦ Configurations • Rate limiting administrative commands • Avoid multiple changes in more than one zones Avoid Global Change

Book covers copyright O’Reilly Media. Used with permission. Find Google
SRE publications—including the SRE Books, articles, trainings, and more—for free at sre.google/resources.

Proprietary + Conﬁdential Questions?

It's All About Scale (By: Arslan Arshad) - DevF...

It's All About Scale (By: Arslan Arshad) - DevFest 2022

GDG Lahore
PRO

More Decks by GDG Lahore

Other Decks in Technology

Featured

Transcript

It’s all about scale Design & Reliability At Planet Scale

Proprietary + Conﬁdential • What is SRE? • Architectural Principles

Proprietary + Conﬁdential • Who are we? ◦ Software Engineers

Proprietary + Conﬁdential Architectural Principles

Proprietary + Conﬁdential Let’s Build - Global Photo Hosting Service

Let’s Start Simple Storage App Logic Machine Application Code

Proprietary + Conﬁdential Any server can handle any incoming request

Proprietary + Conﬁdential • Can only handle limited requests per

Solution: Horizontal Scaling App Logic App Logic App Logic App

Proprietary + Conﬁdential • Image Uploading Operation: ◦ Low QPS

Solution: Do one thing and do it well App Logic

Replicate Everything App Logic Machine App Logic Machine App Logic

Proprietary + Conﬁdential • CAP Theorem ◦ Pick any two

Proprietary + Conﬁdential • Look for bottlenecks ◦ Verify each

Proprietary + Conﬁdential • Expected number of photo uploads: 1k/second

Proprietary + Conﬁdential Fault Handling

Proprietary + Conﬁdential • One of the challenges when designing

Proprietary + Conﬁdential • This is the most common type

Proprietary + Conﬁdential • Query / request that is making

Proprietary + Conﬁdential • Design your system and tools to

Book covers copyright O’Reilly Media. Used with permission. Find Google

Proprietary + Conﬁdential Questions?