Elixir Conf 2016 - Building Available and Partition Tolerant Systems with Phoenix Tracker.

Building Available and Partition Tolerant Systems with Phoenix Tracker Gabi
Zuniga

Outline • Why do we care about available and partition
tolerant systems? • Review the CAP theorem and characterize AP systems. • Explore concepts behind systems like Amazon Dynamo and Riak. • Discuss how we leveraged Phoenix Tracker at VoiceLayer to meet our needs. • Build an Available and Partition Tolerant Distributed Hash Table (DHT) application.

Recurrent Airline System Failures • On August 8 a power
outage caused Delta Airlines’ computer systems to fail, grounding over 1000 ﬂights. • Southwest as well had a massive computer outage on July 20 caused by a faulty switch. It took them days to get their systems fully operational with losses over tens of millions of dollars. • A router malfunction caused United Airlines systems to fail on July 2015 affecting ﬂights for several hours.

Why Does This Is Happening? • Airline computer systems are
always on and highly distributed. • Their systems are very complex and integrate multiple vendors and technologies • Network failures can make failover mechanisms ineffective in achieving high availability. • It can take days to recover from networking induced failures.

The CAP Theorem The CAP theorem states that it is
not possible for a distributed computer system to provide all three of the following guarantees simultaneously: • Consistency : all nodes have the same view of the state. • Availability : for every request there is always going to be a response. • Partition tolerance : the system continues to operate despite arbitrary partitioning due to network failures.

CAP Triangle Availability Consistency Partition Tolerance Can always read and
write All clients always get the same view of the data The system works despite network partitions CA CP AP RDBMS MySql Postgres Dynamo Riak Casandra BigTable HBase MongoDB

AP Systems • AP systems are systems that guarantee availability
and partitioning tolerance at the expense of consistency. • These systems will look for inconsistencies and recover from them by resolving conﬂicts. • Sometimes called “Eventually Consistent Systems”

Embracing Partition Tolerance • Networking failures like network splits sometimes
happen in highly distributed systems. • It can be hard for complex consistent systems to recover from network failures. • Making individual subsystems partition tolerant when possible increases the robustness of the whole system.

Amazon Dynamo • In 2007 Amazon published the inﬂuential Dynamo
Paper. • Amazon described the design of a key-value store system with high availability at a massive scale. • Intended for applications like their shopping cart system. • Dynamo sacriﬁces consistency under certain failure scenarios.

Techniques used by Dynamo • Partitioning - consistent hashing •
High available writes - vector clocks and reconciliation during reads. • Handling temporary failures - Sloppy Quorum and Hinted handoffs • Membership and failure detection - gossip based membership detection to avoid a centralized registry. • Recovery from permanent failures - Anti entropy and Merkle trees to synchronize divergent replicas.

Partitioning Hashed Keys Hash-Ring Partitions Virtual Nodes (vnodes) Node assignment
Consistent Hashing Key3 Key2 Key1 vnode/partition

Replication Redundancy Preference List Different nodes Key1 Primary Partition Replicas
(N=3)

Data Versioning • Each write creates a new immutable version
of the data with an associated vector clock. • Multiple versions of the data can be present in the system simultaneously. • When safe, old versions are discarded. • Syntactic Reconciliation: The system know how to merge multiple versions. • Semantic Reconciliation: Application assisted conﬂict resolution.

Anti-Entropy • Anti-entropy means comparing data in all the replicas
and updating each replica to the newest version. • Dynamo uses Merkle trees to efﬁciently compare replicas. • Each node maintains a Merkle tree for each partition it hosts. • When a discrepancy is found between replicas they are synced.

Quorum • In a Quorum system R nodes need to
succeed for the read operation to succeed. W nodes for write operations. • R+W<<N increases availability and the risk of inconsistencies.

Membership • Membership which nodes own partitions. • Gossip based
membership protocols: Nodes update each other about additions and removals of other nodes to the ring by broadcasting changes. • All nodes need to agree before a membership change to take effect. • When member nodes do not respond they are considered down. • Hinted replicas are nodes that cover for node when is temporarily down. When the node comes back up then the data gets transferred.

Riak KV • Riak® KV is a distributed NoSQL key/
value database with advanced local and multi-cluster replication that guarantees reads and writes even in the event of hardware failures or network partitions. • Riak KV can be considered an implementation in Erlang of the Amazon Dynamo paper.

• Riak Core is an Erlang Library that helps building
distributed, scalable, failure tolerant applications. • When Basho came up with new products like Riak Search, they refactored the core components into Riak Core so that other applications can take advantage of the core functionality.

Using Riak Core with Phoenix Experimenting with Superpowered Web Services:
Phoenix on Riak_Core by Ben Tyler http://www.elixirconf.eu/elixirconf2016/ben-tyler

Issues • Wanted to explore other membership semantics. • Many
mechanisms that are part of the framework and are hard to modify. • Database aspects like Buckets leaked into Riak Core library. • Routing still handled in service and has not moved to client yet. • Slow to support new Erlang/Elixir releases (R16B) • Risk of not getting prompt resolution to potential issues.

High Level Architecture Load Balancer Clients (iOS, Android, Web) Web
Tier (Phoenix) Media Cache Storage

Media Cache Requirements • High availability: Resilience to SW, HW
and networking failures. • No single point of failure. • Very Low latency. • Voice messages are never lost. • Efﬁcient utilization. • Auto-Scaling

Single-Node Media Cache Storage Registry Message Cache Phoenix Media Cache
Message Cache Write Handler Read Handler Read Handler Read Handler Reader Receiver Receiver Sender Web handlers Message Cache Media Cache Registry

Multi-Node Media Cache message_id space is divided into multiple partitions.
Each Registry handles one partition. Storage Phoenix Media Cache Write Handler Read Handler Read Handler Read Handler Reader Receiver Receiver Sender Read Handler Reader

HA Media Cache The write handler creates two message caches
on different nodes. If one media cache fails then the other node would continue streaming. Storage Phoenix Media Cache Write Handler Read Handler Read Handler Receiver Receiver Sender

Registry Hash-Ring Hash-Ring Node 1 Node 2 Node 3 Node
4 Key 2 => Key1 =>

Syncing Hash-Rings Phoenix Tracker keeps all the hash-rings in sync
by tracking media cache registry processes presence. Tracker Ring Tracker Ring Web Node 1 Web Node 2 Tracker Ring Registries Tracker Ring Registries Media Cache Node 1 Media Cache Node M Message Cache Message Cache Web Handler Web Handler

Phoenix Tracker • Phoenix Tracker is the underlying module that
implements Phoenix Presence. • Can also be can be used for other purposes like service discovery and much more.

Using Phoenix Tracker

Dispatch Library • Implements a distributed service registry. • Requests
are dispatched to one or more services based on keys. • Relies on Phoenix Tracker to keep service availability information in sync. • Uses Consistent hashing to map keys to services • Support redundancy. • The library is open source in github.com/voicelayer/dispatch.git

Dispatch Registry Not the be confused with the Media Cache
registries. add_service(type, pid) remove_service(type, pid) get_services(type) ﬁnd_service(type, key) ﬁnd_multi_service(count, type, key)

Dispatch Service The dispatch service module allows to easily enhance
a GenServer to make it trackable. cast(type, key, params) call(type, key, params, timeout \\ 5000) multi_cast(count, type, key, params) multi_call(count, type, key, params)

Distributed Hash Table • Let’s build a Available and Partition
Tolerant DHT using the Dispatch Library. • DHT is a decentralized distributed system that provides a lookup service similar to a hash table. • For redundancy each key-value is stored in 2 different nodes.

Network Split and Conﬂict Resolution

Create Mix project

Adding Dependency

Conﬁg

Application Start

Dht Supervisor

Dht Service

Initialization

Setting a Key/Value

Getting Values

Resolving Conﬂicts on Get

Handoff

Thank You

Elixir Conf 2016 - Building Available and Part...

Elixir Conf 2016 - Building Available and Partition Tolerant Systems with Phoenix Tracker.

Other Decks in Programming

Featured

Transcript