Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elixir Conf 2016 - Building Available and Partition Tolerant Systems with Phoenix Tracker.

Gabi Zuniga
September 02, 2016

Elixir Conf 2016 - Building Available and Partition Tolerant Systems with Phoenix Tracker.

VoiceLayer is a real-time voice messaging platform developed using the Phoenix Framework. Its media router core service must be resilient to a wide variety of adverse conditions, including software, hardware and even network failures.
According to the CAP theorem, a system can only meet two of the following three guarantees: Consistency, Availability or Partition Tolerance. We will characterize Available and Partition Tolerant (aka AP) systems and describe different implementation approaches.
One such approach is based on Riak-Core, which is the distributed systems framework that forms the basis of how Riak distributes data and scales. Despite Riak-Core being an impressive framework, at VoiceLayer, we took a different route and relied on Phoenix Tracker (a core component of Phoenix Pubsub) to meet our requirements. We will describe this approach details, contrast it to other solutions and discuss its tradeoffs.

Gabi Zuniga

September 02, 2016
Tweet

Other Decks in Programming

Transcript

  1. Outline • Why do we care about available and partition

    tolerant systems? • Review the CAP theorem and characterize AP systems. • Explore concepts behind systems like Amazon Dynamo and Riak. • Discuss how we leveraged Phoenix Tracker at VoiceLayer to meet our needs. • Build an Available and Partition Tolerant Distributed Hash Table (DHT) application.
  2. Recurrent Airline System Failures • On August 8 a power

    outage caused Delta Airlines’ computer systems to fail, grounding over 1000 flights. • Southwest as well had a massive computer outage on July 20 caused by a faulty switch. It took them days to get their systems fully operational with losses over tens of millions of dollars. • A router malfunction caused United Airlines systems to fail on July 2015 affecting flights for several hours.
  3. Why Does This Is Happening? • Airline computer systems are

    always on and highly distributed. • Their systems are very complex and integrate multiple vendors and technologies • Network failures can make failover mechanisms ineffective in achieving high availability. • It can take days to recover from networking induced failures.
  4. The CAP Theorem The CAP theorem states that it is

    not possible for a distributed computer system to provide all three of the following guarantees simultaneously: • Consistency : all nodes have the same view of the state. • Availability : for every request there is always going to be a response. • Partition tolerance : the system continues to operate despite arbitrary partitioning due to network failures.
  5. CAP Triangle Availability Consistency Partition Tolerance Can always read and

    write All clients always get the same view of the data The system works despite network partitions CA CP AP RDBMS MySql Postgres Dynamo Riak Casandra BigTable HBase MongoDB
  6. AP Systems • AP systems are systems that guarantee availability

    and partitioning tolerance at the expense of consistency. • These systems will look for inconsistencies and recover from them by resolving conflicts. • Sometimes called “Eventually Consistent Systems”
  7. Embracing Partition Tolerance • Networking failures like network splits sometimes

    happen in highly distributed systems. • It can be hard for complex consistent systems to recover from network failures. • Making individual subsystems partition tolerant when possible increases the robustness of the whole system.
  8. Amazon Dynamo • In 2007 Amazon published the influential Dynamo

    Paper. • Amazon described the design of a key-value store system with high availability at a massive scale. • Intended for applications like their shopping cart system. • Dynamo sacrifices consistency under certain failure scenarios.
  9. Techniques used by Dynamo • Partitioning - consistent hashing •

    High available writes - vector clocks and reconciliation during reads. • Handling temporary failures - Sloppy Quorum and Hinted handoffs • Membership and failure detection - gossip based membership detection to avoid a centralized registry. • Recovery from permanent failures - Anti entropy and Merkle trees to synchronize divergent replicas.
  10. Data Versioning • Each write creates a new immutable version

    of the data with an associated vector clock. • Multiple versions of the data can be present in the system simultaneously. • When safe, old versions are discarded. • Syntactic Reconciliation: The system know how to merge multiple versions. • Semantic Reconciliation: Application assisted conflict resolution.
  11. Anti-Entropy • Anti-entropy means comparing data in all the replicas

    and updating each replica to the newest version. • Dynamo uses Merkle trees to efficiently compare replicas. • Each node maintains a Merkle tree for each partition it hosts. • When a discrepancy is found between replicas they are synced.
  12. Quorum • In a Quorum system R nodes need to

    succeed for the read operation to succeed. W nodes for write operations. • R+W<<N increases availability and the risk of inconsistencies.
  13. Membership • Membership which nodes own partitions. • Gossip based

    membership protocols: Nodes update each other about additions and removals of other nodes to the ring by broadcasting changes. • All nodes need to agree before a membership change to take effect. • When member nodes do not respond they are considered down. • Hinted replicas are nodes that cover for node when is temporarily down. When the node comes back up then the data gets transferred.
  14. Riak KV • Riak® KV is a distributed NoSQL key/

    value database with advanced local and multi-cluster replication that guarantees reads and writes even in the event of hardware failures or network partitions. • Riak KV can be considered an implementation in Erlang of the Amazon Dynamo paper.
  15. • Riak Core is an Erlang Library that helps building

    distributed, scalable, failure tolerant applications. • When Basho came up with new products like Riak Search, they refactored the core components into Riak Core so that other applications can take advantage of the core functionality.
  16. Using Riak Core with Phoenix Experimenting with Superpowered Web Services:

    Phoenix on Riak_Core by Ben Tyler http://www.elixirconf.eu/elixirconf2016/ben-tyler
  17. Issues • Wanted to explore other membership semantics. • Many

    mechanisms that are part of the framework and are hard to modify. • Database aspects like Buckets leaked into Riak Core library. • Routing still handled in service and has not moved to client yet. • Slow to support new Erlang/Elixir releases (R16B) • Risk of not getting prompt resolution to potential issues.
  18. Media Cache Requirements • High availability: Resilience to SW, HW

    and networking failures. • No single point of failure. • Very Low latency. • Voice messages are never lost. • Efficient utilization. • Auto-Scaling
  19. Single-Node Media Cache Storage Registry Message Cache Phoenix Media Cache

    Message Cache Write Handler Read Handler Read Handler Read Handler Reader Receiver Receiver Sender Web handlers Message Cache Media Cache Registry
  20. Multi-Node Media Cache message_id space is divided into multiple partitions.

    Each Registry handles one partition. Storage Phoenix Media Cache Write Handler Read Handler Read Handler Read Handler Reader Receiver Receiver Sender Read Handler Reader
  21. HA Media Cache The write handler creates two message caches

    on different nodes. If one media cache fails then the other node would continue streaming. Storage Phoenix Media Cache Write Handler Read Handler Read Handler Receiver Receiver Sender
  22. Syncing Hash-Rings Phoenix Tracker keeps all the hash-rings in sync

    by tracking media cache registry processes presence. Tracker Ring Tracker Ring Web Node 1 Web Node 2 Tracker Ring Registries Tracker Ring Registries Media Cache Node 1 Media Cache Node M Message Cache Message Cache Web Handler Web Handler
  23. Phoenix Tracker • Phoenix Tracker is the underlying module that

    implements Phoenix Presence. • Can also be can be used for other purposes like service discovery and much more.
  24. Dispatch Library • Implements a distributed service registry. • Requests

    are dispatched to one or more services based on keys. • Relies on Phoenix Tracker to keep service availability information in sync. • Uses Consistent hashing to map keys to services • Support redundancy. • The library is open source in github.com/voicelayer/dispatch.git
  25. Dispatch Registry Not the be confused with the Media Cache

    registries. add_service(type, pid) remove_service(type, pid) get_services(type) find_service(type, key) find_multi_service(count, type, key)
  26. Dispatch Service The dispatch service module allows to easily enhance

    a GenServer to make it trackable. cast(type, key, params) call(type, key, params, timeout \\ 5000) multi_cast(count, type, key, params) multi_call(count, type, key, params)
  27. Distributed Hash Table • Let’s build a Available and Partition

    Tolerant DHT using the Dispatch Library. • DHT is a decentralized distributed system that provides a lookup service similar to a hash table. • For redundancy each key-value is stored in 2 different nodes.