Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pivotal-london

jrallison
January 22, 2014

 pivotal-london

Talk about distributed systems for the Pivotal London office.

jrallison

January 22, 2014
Tweet

More Decks by jrallison

Other Decks in Programming

Transcript

  1. John Allison @jrallison Aside, We also run a distributed team,

    east/west coast of US, London, Leeds? Picture from a week-long company retreat this summer in the catskills. Great for a ton of reasons for our company.
  2. Email your customers based on what they do (or don’t

    do) in your app Email platform for web and mobile apps. Quit our jobs almost two years ago now, the last year I’ve been consumed by scaling our backend.
  3. Behavioral analytics Real-time segmentation Critical path for our customers Email

    What we do? we make a promise to our customers that we’ll track, segment, and email their customers in “real-time”. high write load (99%) … most of this talk will be from that perspective.
  4. What is A ! Distributed system? I’m not an expert,

    so would love to learn from anyone with experience while I’m here. I have been thinking/experimenting with this almost exclusively for the last year. It’s generally what I think about in the shower most days.
  5. Application Server Database Cache A distributed system. Classic starting setup

    for an MVP or something. Maybe not cache, perhaps nosql, etc.
  6. Cache Cache Database App App Starting to scale up, adding

    additional machines to increase read performance.
  7. Database App App LB LB App App Cache Cache Cache

    Cache Continuing to add redundancy, adding load balancers for better availability maybe?
  8. Database LB LB ??? App App App App Cache Cache

    Cache Cache You can get pretty far without needing to do this if you’re read heavy. Up to this point, it’s been easy to add capacity and fault tolerance. Here, I make the case that you have to make compromises. The difference is state. Once you want distributed consensus, you’re in trouble.
  9. (╯°□°)╯︵ ┻━┻ single biggest lesson: most databases are broken in

    the distributed case. Make a lot of promises, but most have significant issues. Take into account different failure modes when modeling your use.
  10. CAP Theorem Consistency Availability Partition Tolerance Pick 1 - kinda…

    and hope the store you use handles partitions correctly.
  11. What COULD happen? Are any acknowledged writes lost? Is write

    availability maintained? Are writes visible to others? myself?
  12. ElasticSearch 3 node cluster, minimum_master_nodes = 2 https://github.com/elasticsearch/elasticsearch/issues/2488 1 2

    3 ElasticSearch considers itself a CP system, built in clustering, synchronous replication, etc. minimum_master_nodes is number of nodes required to operate.
  13. ElasticSearch 3 node cluster, minimum_master_nodes = 2 https://github.com/elasticsearch/elasticsearch/issues/2488 1 2

    3 If 2 is master for a shard and 3 is a slave, 2 tells 1 it doesn’t have a slave. 3 tells 1 it has lost it’s master, held an election and promoted itself to master. 1 evidentially says, “go for it guys!”. Split brain… lost writes… bad times.
  14. Riak 5 node cluster, default settings of “last write wins”

    http://aphyr.com/posts/285-call-me-maybe-riak 1 2 3 4 5 Riak considers itself an AP system. Can allow writes if just one node is up, but generally require quorum acknowledgement. In either case, it tries to accept all writes and replicate after a partition heals.
  15. Riak 5 node cluster, default settings of “last write wins”

    http://aphyr.com/posts/285-call-me-maybe-riak 1 2 3 4 5 71% of acknowledged updates lost. Healthly cluster. By default, it uses “last write wins” based on the vector clock of the update. “Older” updates are discarded. Concurrent writes or writes from a node with an out of sync clock = lost updates.
  16. Riak 5 node cluster, default settings of “last write wins”

    http://aphyr.com/posts/285-call-me-maybe-riak 1 2 3 4 5 92% of acknowledged updates lost. PR=PW=R=W=quorum still leads to losses. Minority side still holds on to failed writes, and will propagate them when the partition heals. These failed writes can destroy all of the majority side’s successful writes.
  17. Redis cluster/sentinel 1 Redis on one node is CP! Love

    redis, we still use it a lot today.
  18. Redis cluster/sentinel 1 2 3 No guarantees! =/ All asynchronous

    replication, window for lost writes, new “wait” command doesn’t help due to partially failed writes, no “eventually”, etc.
  19. Redis cluster/sentinel https://groups.google.com/forum/#!topic/redis-db/ Oazt2k7Lzz4%5B1-25-false%5D Great thread about the disconnect that

    can happen between database practitioners and people with experience in working with distributed systems.
  20. Many more ! (╯°□°)╯︵ ┻━┻ http://aphyr.com/tags/jepsen Rather then debating proofs

    and theory, Kyle started testing database models under different network partitions. Enlightening.
  21. Kind of left me depressed for awhile, especially the seemingly

    lack of an agreed upon terminology for a discussion. What does it mean when a database says it’s distributed, fault tolerant, highly available, etc? Most likely none of them mean the same thing.
  22. ! there is HOPE came out the other side, there

    are things we can do to navigate this area. However, it does require a different way of thinking about interacting with data.
  23. ! Immutability An object’s state cannot be changed after it’s

    created. All data we store in Riak, for example, is immutable. Which gets around the problem of conflicts in an eventually consistent datastore, as it removes the possibility of conflicts. Only “safe” way to have mutable objects is in a consistent store?
  24. ! Idempotence Apply the same operation multiple times without changing

    there result. Very helpful in the presence of failures. All of our processing operates in this way, if for whatever reason an operation fails, we retry it with exponential backoff.
  25. ! Zookeeper http://aphyr.com/posts/291-call-me-maybe-zookeeper + battle tested distributed consensus + consistent

    and partition tolerant - best for small amounts of coordination - limited dataset & throughput - can be hard to operate or use properly Some things did pass Kyles tests. Can be used to do master election for elastic search to remove the split brain problem. handles coordination in many distributed data stores.
  26. Not all bad: + remove defaults, handle conflict resolution yourself.

    + great for immutable data (no updates). + CRDTs commutative replicated data types. Haven’t experienced must with them yet, but can be great for specific cases where you’d like to update things, like counters, or sets, etc.
  27. http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should- know-about-real-time-datas-unifying Distributed messaging through the use of a consistent

    commit log. + uses zookeeper for leader election and client log position. + great for ensuring many different systems see the same ordering of data.
  28. LB LB App App App App Cache Cache Cache Cache

    DB DB DB DB So, if you’ve made it through that, it may get you here, or you may take it further and break your system into smaller pieces, and multiple data stores for different types of data. Which starts to open up another can of worms…
  29. Service discovery In a world where you have many different

    pieces (or services) in your system who need to navigate the network to communicate in the presence of failures and new instances coming up, it can become a big hairy ball of configuration really quick.
  30. Restart the world on every configuration change Round-robin DNS Internal

    loadbalancers everywhere Each instance responsible for broadcasting availability Options? Have punted on this for awhile. restart: what we do now - chef scripts manually upload new config, rolling restart. DNS is slow to update… heavily cached everywhere. Broadcast: updating something like zookeeper with live nodes of a given service. internal loadbalancers: more complicated configuration.
  31. https://github.com/airbnb/synapse https://github.com/airbnb/nerve discovered these two projects recently, and have been

    fairly happy with the result. Based on haproxy, less configuration to add instances, don’t need to rewrite everything to “register” itself in zookeeper, etc. Can use zookeeper for coordination. Recently added some docker support for discovering new docker instances across a number of machines.
  32. zookeeper Zookeeper App App App App Cache Cache Cache Cache

    DB DB DB DB zookeeper = nerve Each server has an instance of nerve that watches the services on that node. Nerve creates ephemeral nodes in zookeeper when a service is up. If the service goes down, or the node dies, or nerve dies, the ephemeral node is removed.
  33. zookeeper Zookeeper App App App App Cache Cache Cache Cache

    DB DB DB DB zookeeper = nerve = synapse! & haproxy Haproxy and synapse are run on any nodes that need to communicate. They connect to only their local instance of haproxy, which is configured with the latest live nodes by synapse/nerve. Haproxy is responsible for navigating the faulty network (connection retries, etc).