Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large scale distributed systems patterns

Large scale distributed systems patterns

Avatar for Ryosuke Iwanaga

Ryosuke Iwanaga

September 22, 2025
Tweet

Other Decks in Technology

Transcript

  1. Agenda Visit architecture patterns and talk about problems * Web

    app * Cloud native * Microservice in scale * Resource management * Event system * Problem 1: Cold start * Problem 1: Poison pill
  2. Distributed systems always fail => Design for failure Takeaways One

    solution introduces another problem => Design exercise
  3. 2009 Mobile browser gaming (SRE / DBA) * ~5,000 physical

    servers * MySQL replications, sharding * Datacenter operations AI! 2025 Cloud (Solutions architect) Distributed datastore (Developer) 2015 2018 * Architecture * Container * Analytics * Distributed system * ~50 Microservices * Horizontal scale * Cell-based My experience in large scale distributed systems
  4. LB ... User Web app distributed system Replication delay Deadlock

    Write bottleneck LB's scalability Typical problems: App App App App DB Writer DB Reader DB Reader ... DB Writer DB Reader DB Reader ... ... Payment
  5. Write Read Read LB ... App App App App DB

    DB DB ... Cache Cache Cache ... Web app distributed system Cache invalidation Cache scalability Typical problems:
  6. ... Server 1 Resource orchestrator/manager App1 Amazon EC2, Eucalyptus, OpenStack

    Hadoop, Mesos, YARN, Omega, Borg, k8s Server 2 Server N App1 App1 App1 App2 Cloud resource distributed system App3 App2 App3 Manager’s scalability Consistency Typical problems:
  7. App Stream Speed layer Event distributed system e.g. Lambda architecture

    App App Stream process 1 Object storage Stream process 2 Batch process 1 Batch process 2 Batch layer At least once At most once Stream scalability Back pressure Typical problems:
  8. App Service A ... User 1 Metadata User 1,3,4 User

    1 => DB 1 User 2 => DB 2 User 3 => DB 1 ... 💀 DB 1 DB 2 Service B App App User 2 LB Service C User 2,5 Microservice distributed system ...
  9. App Service A ... User 1 Metadata User 1,3,4 User

    1 => DB 1 User 2 => DB 2 User 3 => DB 1 ... DB 1 DB 2 Service B App App User 2 LB Service C User 2,5 Microservice distributed system ... Cache 😁?
  10. Warm start 🔄Restart ✅ https://aws.amazon.com/message/11201/ Cold start Metadata App App

    App App App App App 🔄Restart App App App App App App App 🔄Restart 🔄Restart 🔄Restart 🔄Restart 🔄Restart 💀 Cold start problem
  11. 💊 If user 1's requests trigger a bug on app

    that crashes the app... 💀 Retry Retry Retry Retry Retry Retry Retry 💀 0% availability => App User 1 App App User 2 App App App App App User 3 💀 💀 💀 💀 💀 💀 💀 LB Poison pill problem 0% availability => 0% availability =>
  12. 💊 💀 ✅ App User 1 App App User 2

    App App App App App User 3 💀 Naive sharding 💀 💀 0% availability => 100% availability => 0% availability =>
  13. 💊 ✅ App User 1 App App User 2 App

    App App App App User 3 💀 Shuffle sharding 💀 ✅ ✅ https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/ : Server set for user 1, 2 5 overlap (k=5): 0.00000013% 4 overlap (k=4): 0.00063% 3 overlap (k=3): 0.059% 2 overlap (k=2): 1.8% 1 overlap (k=1): 21% 0 overlap (k=0): 77% 💀 50% availability => 100% availability => 0% availability => : Number of total servers : Size of each shard : Overlap between user 1 and 2
  14. What’s next? App Service A ... User 1 Metadata DB

    1 DB 2 Service B App App User 2 Service C ... App Service D ... App App Metadata LB Service E 💀
  15. Distributed systems always fail => Design for failure Takeaways (again)

    One solution introduces another problem => Design exercise