Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Economics of Scale: Promises and Perils of Going Distributed

Tyler Treat
September 20, 2015

The Economics of Scale: Promises and Perils of Going Distributed

What does it take to scale a system? We'll learn how going distributed can pay dividends in areas like availability and fault tolerance by examining a real-world case study. However, we will also look at the inherent pitfalls. When it comes to distributed systems, for every promise there is a peril.

Tyler Treat

September 20, 2015
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. The Economics of Scale Tyler Treat Workiva Promises and Perils

    of Going Distributed September 19, 2015
  2. About The Speaker • Backend engineer at Workiva • Messaging

    platform tech lead • Distributed systems • bravenewgeek.com @tyler_treat [email protected]
  3. About The Talk • Why distributed systems? • Case study

    • Advantages/Disadvantages • Strategies for scaling and resilience patterns • Scaling Workiva
  4. Scale Up vs. Scale Out ❖ Add resources to a

    node ❖ Increases node capacity, load is unaffected ❖ System complexity unaffected Vertical Scaling ❖ Add nodes to a cluster ❖ Decreases load, capacity is unaffected ❖ Availability and throughput w/ increased complexity Horizontal Scaling
  5. Prior to 5.5, MySQL used table-level locking. Now it uses

    row-level locking. Either way,
 lock contention everywhere.
  6. This alleviates lock contention and improves throughput…
 
 but fetching

    timelines is still extremely costly (now scatter-gather query across multiple DBs).
  7. Ingestion/Fan-Out Process 1. Tweet comes in 2. Query the social

    graph service for followers 3. Iterate through each follower and insert tweet ID into their timeline (stored in Redis) 4. Store tweet on disk (MySQL)
  8. Ingestion/Fan-Out Process • Lots of processing on ingest, no computation

    on reads • Redis stores timelines in memory—very fast • Fetching timeline involves no queries—get timeline from Redis cache and rehydrate with multi-get on IDs • If timeline falls out of cache, reconstitute from disk • O(n) on writes, O(1) on reads • http://www.infoq.com/presentations/Twitter-Timeline-Scalability
  9. Let’s Recap… • Advantages of single database system: • Simple!

    • Data and invariants are consistent (ACID transactions) • Disadvantages of single database system: • Slow • Doesn’t scale • Single point of failure
  10. This problem happens all the time on Twitter.
 
 For

    example, you tweet, someone else replies, and I see the reply before the original tweet.
  11. Sure, just coordinate things before proceeding… “Have you seen this

    tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.”
  12. Key Takeaway: strong consistency is slow and distributed coordination is

    expensive (in terms of latency and throughput).
  13. If we allow writes on both sides of the partition,

    how do we resolve conflicts when the partition heals?
  14. But lots of good research going on to solve these

    problems… CRDTs Lasp SyncFree RAMP transactions etc.
  15. Twitter has 316 million monthly active users. Facebook has 1.49

    billion monthly active users. Netflix has 62.3 million streaming subscribers.
  16. If an overloaded service is not essential to
 core business,

    fail fast to prevent availability or latency problems upstream.
  17. Flow-Control Mechanisms • Rate limit • Bound queues/buffers • Backpressure

    - drop messages on the floor • Increment stat counters for monitoring/alerting • Exponential back-off • Use application-level acks for critical transactions
  18. –Paul Barham “You can have a second computer once you’ve

    shown you know how to use the first one.”