Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's All About Scale (By: Arslan Arshad) - DevF...

GDG Lahore
December 17, 2022

It's All About Scale (By: Arslan Arshad) - DevFest 2022

Talk by Arslan Arshad (https://twitter.com/SudoArslan) at DevFest Lahore 2022 by GDG Lahore.

GDG Lahore

December 17, 2022
Tweet

More Decks by GDG Lahore

Other Decks in Technology

Transcript

  1. Proprietary + Confidential • What is SRE? • Architectural Principles

    ◦ Learn by building ◦ Managing state in a distributed system ◦ Iterating on initial design ◦ Preparing for launch day • Fault Handling Agenda
  2. Proprietary + Confidential • Who are we? ◦ Software Engineers

    on a unique mission • What we do? ◦ Keep services running, scale them and make them reliable • How we do it? ◦ A mix of proactive and reactive engineering to make our services better • Why do we do it? ◦ Users expect speed, reliability and correct operation of services Site Reliability Engineering
  3. Proprietary + Confidential Let’s Build - Global Photo Hosting Service

    Users Search Photos Users Upload Photos . Everything Is Public Global Availability
  4. Proprietary + Confidential Any server can handle any incoming request

    and treats each request independently Aim For Stateless Service App Logic Storage
  5. Proprietary + Confidential • Can only handle limited requests per

    second • Hardware / software failures lead to service unavailability • Hardware / software upgrades leads to service unavailability Limitation: Less Compute & Poor Reliability App Logic Machine App Logic Machine App Logic Machine App Logic Machine App Logic Machine Horizontal Scaling Vertical Scaling One large failure domain Multiple small failure domains
  6. Solution: Horizontal Scaling App Logic App Logic App Logic App

    Logic Machine Storage Load Balancer • More machines => More QPS • Autoscale horizontally to adjust capacity to demand • Sharding: Requests spread across multiple servers
  7. Proprietary + Confidential • Image Uploading Operation: ◦ Low QPS

    ◦ Long duration of requests ◦ Variable payload as images sizes / type might vary ◦ Optional: Image processing might work better on different hardware e.g. GPU • Image Serving Operation: ◦ Relatively High QPS ◦ Short duration of requests ◦ Optimized for latency instead of throughput • Mix of requests can change, hard plan for capacity and operation of server Limitation: Doing different things at the same time
  8. Solution: Do one thing and do it well App Logic

    Machine App Logic Machine App Logic Machine Upload Machine Storage Load Balancer App Logic Machine App Logic Machine App Logic Machine Search Machine • Consistent behavior per server, easier for resource planning • Can have more optimized hardware & networking requirements
  9. Replicate Everything App Logic Machine App Logic Machine App Logic

    Machine Upload Machine Storage Load Balancer App Logic Machine App Logic Machine App Logic Machine Search Machine Load Balancer Load Balancer Storage Storage
  10. Proprietary + Confidential • CAP Theorem ◦ Pick any two

    ▪ Consistency ▪ Availability ▪ Partition Tolerance • Distributed consensus algorithms ◦ Protocols: Paxos, RAFT, and ZAB ◦ Use cases: ▪ Leader Election ▪ Globally Consistent DBs ▪ Reliable Message Queuing Can’t go stateless: Managing State
  11. Proprietary + Confidential • Look for bottlenecks ◦ Verify each

    function can scale horizontally • Work out the hardware requirements for each function ◦ CPU, network, disk accesses, RAM ◦ Sanity check expected concurrent connections, etc • Look for ways to improve ◦ Any single points of failure or other weaknesses? ◦ Is it efficient in terms of resources? ◦ How can the design accommodate change? ◦ Is there a better/simpler design? Iterate on the design
  12. Proprietary + Confidential • Expected number of photo uploads: 1k/second

    • Expected number of thumbnail views: 10k/second • Expected number of photos views: 100/sec • Resource Planning: ◦ Load balancer / networking: ▪ Uploads: 1,000 * 4 MB = 4 GB / s ▪ Serving: 10,000 * 200 KB + 100 * 4 MB = 2.4 GB / s ◦ Number of instances for upload operation: ▪ 300 ms per request at 10 requests concurrently ▪ 0.3 * 1,000 requests / 10 => We need 30 instances Preparation for launch day
  13. Proprietary + Confidential • One of the challenges when designing

    high scalable systems. • As your service grows, what will start to fail and how will that show itself? • What failure modes can you prevent before they start? • How can you keep serving and growing your service in spite of failures? Considering Failure Modes
  14. Proprietary + Confidential • This is the most common type

    of failures • Hardware failures: ◦ Server ◦ Network Switch ◦ Disk / Storage • Ensure multiple failure domains, no single point of failure • Build software to handle failovers properly Component Failures
  15. Proprietary + Confidential • Query / request that is making

    your service crash ◦ Deliberate attack ◦ Just bad luck • No great solutions, but possible mitigations: ◦ Rate limiting request per user ◦ Limiting the size of the request per user ◦ Pro-actively blocking untrustworthy users ◦ Fuzz testing to ensure bad inputs are handled properly Query Of Death
  16. Proprietary + Confidential • Design your system and tools to

    avoid changes that are applied instantaneously everywhere • Slow rollouts with proper canaring ◦ Binaries ◦ Configurations • Rate limiting administrative commands • Avoid multiple changes in more than one zones Avoid Global Change
  17. Book covers copyright O’Reilly Media. Used with permission. Find Google

    SRE publications—including the SRE Books, articles, trainings, and more—for free at sre.google/resources.