Slide 1

Slide 1 text

The Economics of Scale Tyler Treat Workiva Promises and Perils of Going Distributed September 19, 2015

Slide 2

Slide 2 text

About The Speaker • Backend engineer at Workiva • Messaging platform tech lead • Distributed systems • bravenewgeek.com @tyler_treat [email protected]

Slide 3

Slide 3 text

About The Talk • Why distributed systems? • Case study • Advantages/Disadvantages • Strategies for scaling and resilience patterns • Scaling Workiva

Slide 4

Slide 4 text

What does it mean to “scale” a system?

Slide 5

Slide 5 text

Scale Up vs. Scale Out ❖ Add resources to a node ❖ Increases node capacity, load is unaffected ❖ System complexity unaffected Vertical Scaling ❖ Add nodes to a cluster ❖ Decreases load, capacity is unaffected ❖ Availability and throughput w/ increased complexity Horizontal Scaling

Slide 6

Slide 6 text

Okay, cool—but what does this actually mean?

Slide 7

Slide 7 text

Let’s look at a real-world example…

Slide 8

Slide 8 text

How does Twitter work?

Slide 9

Slide 9 text

Just write tweets to a big database table.

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Getting a timeline is a simple join query.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

But…this is crazy expensive.

Slide 14

Slide 14 text

Joins are sloooooooooow.

Slide 15

Slide 15 text

Prior to 5.5, MySQL used table-level locking. Now it uses row-level locking. Either way,
 lock contention everywhere.

Slide 16

Slide 16 text

As the table grows larger, lock contention on indexes becomes worse too.

Slide 17

Slide 17 text

And UPDATES get higher priority than SELECTS.

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

So now what?

Slide 20

Slide 20 text

Distribute!

Slide 21

Slide 21 text

Specifically, shard.

Slide 22

Slide 22 text

Partition tweets into different databases using some consistent hash scheme
 (put a hash ring on it).

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

This alleviates lock contention and improves throughput…
 
 but fetching timelines is still extremely costly (now scatter-gather query across multiple DBs).

Slide 25

Slide 25 text

Observation: Twitter is a consumption mechanism more than an ingestion one… i.e. cheap reads > cheap writes

Slide 26

Slide 26 text

Move tweet processing to the write path
 rather than read path.

Slide 27

Slide 27 text

Ingestion/Fan-Out Process 1. Tweet comes in 2. Query the social graph service for followers 3. Iterate through each follower and insert tweet ID into their timeline (stored in Redis) 4. Store tweet on disk (MySQL)

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Ingestion/Fan-Out Process • Lots of processing on ingest, no computation on reads • Redis stores timelines in memory—very fast • Fetching timeline involves no queries—get timeline from Redis cache and rehydrate with multi-get on IDs • If timeline falls out of cache, reconstitute from disk • O(n) on writes, O(1) on reads • http://www.infoq.com/presentations/Twitter-Timeline-Scalability

Slide 30

Slide 30 text

Key Takeaway: think about your access patterns and design accordingly. 
 Optimize for the critical path.

Slide 31

Slide 31 text

Let’s Recap… • Advantages of single database system: • Simple! • Data and invariants are consistent (ACID transactions) • Disadvantages of single database system: • Slow • Doesn’t scale • Single point of failure

Slide 32

Slide 32 text

Going distributed solved the problem,
 but at what cost? (hint: your sanity)

Slide 33

Slide 33 text

Distributed systems are all about trade-offs.

Slide 34

Slide 34 text

By choosing availability, we give up consistency.

Slide 35

Slide 35 text

This problem happens all the time on Twitter.
 
 For example, you tweet, someone else replies, and I see the reply before the original tweet.

Slide 36

Slide 36 text

Can we solve this problem?

Slide 37

Slide 37 text

Sure, just coordinate things before proceeding… “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.” “Have you seen this tweet? Okay, good.”

Slide 38

Slide 38 text

Sooo what do you do when Justin Bieber tweets to his 67 million followers?

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Coordinating for consistency is expensive
 when data is distributed
 because processes
 can’t make progress independently.

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Source: Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design

Slide 44

Slide 44 text

Key Takeaway: strong consistency is slow and distributed coordination is expensive (in terms of latency and throughput).

Slide 45

Slide 45 text

Sharing mutable data at large scale is difficult.

Slide 46

Slide 46 text

If we don’t distribute, we risk scale problems.

Slide 47

Slide 47 text

Let’s say we want to count the number of times a tweet is retweeted.

Slide 48

Slide 48 text

“Get, add 1, and put” transaction will not
 scale.

Slide 49

Slide 49 text

If we do distribute, we risk consistency problems.

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

What do we do when our system is partitioned?

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

If we allow writes on both sides of the partition, how do we resolve conflicts when the partition heals?

Slide 54

Slide 54 text

Distributed systems are hard!

Slide 55

Slide 55 text

But lots of good research going on to solve these problems… CRDTs Lasp SyncFree RAMP transactions etc.

Slide 56

Slide 56 text

Twitter has 316 million monthly active users. Facebook has 1.49 billion monthly active users. Netflix has 62.3 million streaming subscribers.

Slide 57

Slide 57 text

How do you build resilient systems at this scale?

Slide 58

Slide 58 text

Embrace failure.

Slide 59

Slide 59 text

Provide partial availability.

Slide 60

Slide 60 text

If an overloaded service is not essential to
 core business, fail fast to prevent availability or latency problems upstream.

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

It’s better to fail predictably than fail in unexpected ways.

Slide 64

Slide 64 text

Use backpressure to reduce load.

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

Flow-Control Mechanisms • Rate limit • Bound queues/buffers • Backpressure - drop messages on the floor • Increment stat counters for monitoring/alerting • Exponential back-off • Use application-level acks for critical transactions

Slide 68

Slide 68 text

Bounding resource utilization and failing fast helps maintain predictable performance and impedes cascading failures.

Slide 69

Slide 69 text

Going distributed means more wire time.
 How do you improve performance?

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

Cache everything.

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

“There are only two hard things in computer science: cache invalidation and naming things.”

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

Embrace asynchrony.

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

Distributed systems are not just about workload scale, they’re about organizational scale.

Slide 79

Slide 79 text

In 2010, Workiva released a product to streamline financial reporting.

Slide 80

Slide 80 text

A specific solution to solve a very specific problem, originally built by a few dozen engineers.

Slide 81

Slide 81 text

Fast forward to today: a couple hundred engineers, more users, more markets, more solutions.

Slide 82

Slide 82 text

How do you ramp up new products quickly?

Slide 83

Slide 83 text

You stop thinking in terms of products and start thinking in terms of platform.

Slide 84

Slide 84 text

From Product to Platform

Slide 85

Slide 85 text

At this point, going distributed is all but necessary.

Slide 86

Slide 86 text

Service-Oriented Architecture allows us to independently build, deploy, and scale discrete parts of the platform.

Slide 87

Slide 87 text

Loosely coupled services let us tolerate failure. And things fail constantly.

Slide 88

Slide 88 text

Shit happens — network partitions, hardware failure, GC pauses, latency, dropped packets…

Slide 89

Slide 89 text

Build resilient systems.

Slide 90

Slide 90 text

–Ken Arnold “You have to design distributed systems with the expectation of failure.”

Slide 91

Slide 91 text

Design for failure.

Slide 92

Slide 92 text

Consider the trade-off between consistency and availability.

Slide 93

Slide 93 text

Most important?

Slide 94

Slide 94 text

Don’t distribute until you have a reason to!

Slide 95

Slide 95 text

Scale up until you have to
 scale out.

Slide 96

Slide 96 text

–Paul Barham “You can have a second computer once you’ve shown you know how to use the first one.”

Slide 97

Slide 97 text

And when you do distribute,
 don’t go overboard. Walk before you run.

Slide 98

Slide 98 text

Remember, when it comes to distributed systems…
 for every promise there’s a peril.

Slide 99

Slide 99 text

Thanks! @tyler_treat github.com/tylertreat
 bravenewgeek.com