Slide 1

Slide 1 text

Systems ! Distributed The world of

Slide 2

Slide 2 text

John Allison @jrallison Aside, We also run a distributed team, east/west coast of US, London, Leeds? Picture from a week-long company retreat this summer in the catskills. Great for a ton of reasons for our company.

Slide 3

Slide 3 text

Email your customers based on what they do (or don’t do) in your app Email platform for web and mobile apps. Quit our jobs almost two years ago now, the last year I’ve been consumed by scaling our backend.

Slide 4

Slide 4 text

Behavioral analytics Real-time segmentation Critical path for our customers Email What we do? we make a promise to our customers that we’ll track, segment, and email their customers in “real-time”. high write load (99%) … most of this talk will be from that perspective.

Slide 5

Slide 5 text

What is A ! Distributed system? I’m not an expert, so would love to learn from anyone with experience while I’m here. I have been thinking/experimenting with this almost exclusively for the last year. It’s generally what I think about in the shower most days.

Slide 6

Slide 6 text

Application Server Database Cache A distributed system. Classic starting setup for an MVP or something. Maybe not cache, perhaps nosql, etc.

Slide 7

Slide 7 text

Cache Cache Database App App Starting to scale up, adding additional machines to increase read performance.

Slide 8

Slide 8 text

Database App App LB LB App App Cache Cache Cache Cache Continuing to add redundancy, adding load balancers for better availability maybe?

Slide 9

Slide 9 text

Database LB LB ??? App App App App Cache Cache Cache Cache You can get pretty far without needing to do this if you’re read heavy. Up to this point, it’s been easy to add capacity and fault tolerance. Here, I make the case that you have to make compromises. The difference is state. Once you want distributed consensus, you’re in trouble.

Slide 10

Slide 10 text

(╯°□°)╯︵ ┻━┻ single biggest lesson: most databases are broken in the distributed case. Make a lot of promises, but most have significant issues. Take into account different failure modes when modeling your use.

Slide 11

Slide 11 text

CAP Theorem Consistency Availability Partition Tolerance Pick 2 you all know this.

Slide 12

Slide 12 text

CAP Theorem Consistency Availability Partition Tolerance Pick 1 - kinda… and hope the store you use handles partitions correctly.

Slide 13

Slide 13 text

What COULD happen? Are any acknowledged writes lost? Is write availability maintained? Are writes visible to others? myself?

Slide 14

Slide 14 text

ElasticSearch 3 node cluster, minimum_master_nodes = 2 https://github.com/elasticsearch/elasticsearch/issues/2488 1 2 3 ElasticSearch considers itself a CP system, built in clustering, synchronous replication, etc. minimum_master_nodes is number of nodes required to operate.

Slide 15

Slide 15 text

ElasticSearch 3 node cluster, minimum_master_nodes = 2 https://github.com/elasticsearch/elasticsearch/issues/2488 1 2 3 If 2 is master for a shard and 3 is a slave, 2 tells 1 it doesn’t have a slave. 3 tells 1 it has lost it’s master, held an election and promoted itself to master. 1 evidentially says, “go for it guys!”. Split brain… lost writes… bad times.

Slide 16

Slide 16 text

Riak 5 node cluster, default settings of “last write wins” http://aphyr.com/posts/285-call-me-maybe-riak 1 2 3 4 5 Riak considers itself an AP system. Can allow writes if just one node is up, but generally require quorum acknowledgement. In either case, it tries to accept all writes and replicate after a partition heals.

Slide 17

Slide 17 text

Riak 5 node cluster, default settings of “last write wins” http://aphyr.com/posts/285-call-me-maybe-riak 1 2 3 4 5 71% of acknowledged updates lost. Healthly cluster. By default, it uses “last write wins” based on the vector clock of the update. “Older” updates are discarded. Concurrent writes or writes from a node with an out of sync clock = lost updates.

Slide 18

Slide 18 text

Riak 5 node cluster, default settings of “last write wins” http://aphyr.com/posts/285-call-me-maybe-riak 1 2 3 4 5 92% of acknowledged updates lost. PR=PW=R=W=quorum still leads to losses. Minority side still holds on to failed writes, and will propagate them when the partition heals. These failed writes can destroy all of the majority side’s successful writes.

Slide 19

Slide 19 text

Redis cluster/sentinel 1 Redis on one node is CP! Love redis, we still use it a lot today.

Slide 20

Slide 20 text

Redis cluster/sentinel 1 2 3 No guarantees! =/ All asynchronous replication, window for lost writes, new “wait” command doesn’t help due to partially failed writes, no “eventually”, etc.

Slide 21

Slide 21 text

Redis cluster/sentinel https://groups.google.com/forum/#!topic/redis-db/ Oazt2k7Lzz4%5B1-25-false%5D Great thread about the disconnect that can happen between database practitioners and people with experience in working with distributed systems.

Slide 22

Slide 22 text

Many more ! (╯°□°)╯︵ ┻━┻ http://aphyr.com/tags/jepsen Rather then debating proofs and theory, Kyle started testing database models under different network partitions. Enlightening.

Slide 23

Slide 23 text

Kind of left me depressed for awhile, especially the seemingly lack of an agreed upon terminology for a discussion. What does it mean when a database says it’s distributed, fault tolerant, highly available, etc? Most likely none of them mean the same thing.

Slide 24

Slide 24 text

! there is HOPE came out the other side, there are things we can do to navigate this area. However, it does require a different way of thinking about interacting with data.

Slide 25

Slide 25 text

! Immutability An object’s state cannot be changed after it’s created. All data we store in Riak, for example, is immutable. Which gets around the problem of conflicts in an eventually consistent datastore, as it removes the possibility of conflicts. Only “safe” way to have mutable objects is in a consistent store?

Slide 26

Slide 26 text

! Idempotence Apply the same operation multiple times without changing there result. Very helpful in the presence of failures. All of our processing operates in this way, if for whatever reason an operation fails, we retry it with exponential backoff.

Slide 27

Slide 27 text

! Zookeeper http://aphyr.com/posts/291-call-me-maybe-zookeeper + battle tested distributed consensus + consistent and partition tolerant - best for small amounts of coordination - limited dataset & throughput - can be hard to operate or use properly Some things did pass Kyles tests. Can be used to do master election for elastic search to remove the split brain problem. handles coordination in many distributed data stores.

Slide 28

Slide 28 text

! Zookeeper https://github.com/Netflix/exhibitor ZooKeeper co-process for instance monitoring, backup/recovery, cleanup and visualization.

Slide 29

Slide 29 text

! Zookeeper https://github.com/Netflix/exhibitor ZooKeeper co-process for instance monitoring, backup/recovery, cleanup and visualization.

Slide 30

Slide 30 text

Not all bad: + remove defaults, handle conflict resolution yourself. + great for immutable data (no updates). + CRDTs commutative replicated data types. Haven’t experienced must with them yet, but can be great for specific cases where you’d like to update things, like counters, or sets, etc.

Slide 31

Slide 31 text

http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should- know-about-real-time-datas-unifying Distributed messaging through the use of a consistent commit log. + uses zookeeper for leader election and client log position. + great for ensuring many different systems see the same ordering of data.

Slide 32

Slide 32 text

LB LB App App App App Cache Cache Cache Cache DB DB DB DB So, if you’ve made it through that, it may get you here, or you may take it further and break your system into smaller pieces, and multiple data stores for different types of data. Which starts to open up another can of worms…

Slide 33

Slide 33 text

Service discovery In a world where you have many different pieces (or services) in your system who need to navigate the network to communicate in the presence of failures and new instances coming up, it can become a big hairy ball of configuration really quick.

Slide 34

Slide 34 text

Restart the world on every configuration change Round-robin DNS Internal loadbalancers everywhere Each instance responsible for broadcasting availability Options? Have punted on this for awhile. restart: what we do now - chef scripts manually upload new config, rolling restart. DNS is slow to update… heavily cached everywhere. Broadcast: updating something like zookeeper with live nodes of a given service. internal loadbalancers: more complicated configuration.

Slide 35

Slide 35 text

https://github.com/airbnb/synapse https://github.com/airbnb/nerve discovered these two projects recently, and have been fairly happy with the result. Based on haproxy, less configuration to add instances, don’t need to rewrite everything to “register” itself in zookeeper, etc. Can use zookeeper for coordination. Recently added some docker support for discovering new docker instances across a number of machines.

Slide 36

Slide 36 text

zookeeper Zookeeper App App App App Cache Cache Cache Cache DB DB DB DB zookeeper = nerve Each server has an instance of nerve that watches the services on that node. Nerve creates ephemeral nodes in zookeeper when a service is up. If the service goes down, or the node dies, or nerve dies, the ephemeral node is removed.

Slide 37

Slide 37 text

zookeeper Zookeeper App App App App Cache Cache Cache Cache DB DB DB DB zookeeper = nerve = synapse! & haproxy Haproxy and synapse are run on any nodes that need to communicate. They connect to only their local instance of haproxy, which is configured with the latest live nodes by synapse/nerve. Haproxy is responsible for navigating the faulty network (connection retries, etc).

Slide 38

Slide 38 text

@jrallison john@customer.io Thanks! https://speakerdeck.com/jrallison/pivotal-london