Slide 1

Slide 1 text

Implementing a Distributed Process Registry on Riak Core Christopher Meiklejohn @cmeik Saturday, September 14, 13

Slide 2

Slide 2 text

cmeiklejohn / @cmeik Saturday, September 14, 13

Slide 3

Slide 3 text

Saturday, September 14, 13

Slide 4

Slide 4 text

Saturday, September 14, 13

Slide 5

Slide 5 text

Saturday, September 14, 13

Slide 6

Slide 6 text

Saturday, September 14, 13

Slide 7

Slide 7 text

The Goal Saturday, September 14, 13

Slide 8

Slide 8 text

Build a highly-available, fault-tolerant registry. The Goal Saturday, September 14, 13

Slide 9

Slide 9 text

Understand the tradeo!s. The Goal Saturday, September 14, 13

Slide 10

Slide 10 text

Riak Stream The Goal Saturday, September 14, 13

Slide 11

Slide 11 text

The Problem Saturday, September 14, 13

Slide 12

Slide 12 text

Highly-available distributed process groups. The Problem Saturday, September 14, 13

Slide 13

Slide 13 text

Examples: pg2, gproc The Problem Saturday, September 14, 13

Slide 14

Slide 14 text

Reappearing groups; synchronous global writes. The pg2 Problem Saturday, September 14, 13

Slide 15

Slide 15 text

Election deadlocks; con"icts; dynamic clusters. The gproc Problem Saturday, September 14, 13

Slide 16

Slide 16 text

The Challenges Saturday, September 14, 13

Slide 17

Slide 17 text

Dynamic addition and removal of nodes. The 3 Challenges Saturday, September 14, 13

Slide 18

Slide 18 text

Coordination of state mutation. The 3 Challenges Saturday, September 14, 13

Slide 19

Slide 19 text

Resolution of con"icting values. The 3 Challenges Saturday, September 14, 13

Slide 20

Slide 20 text

Riak PG Saturday, September 14, 13

Slide 21

Slide 21 text

Dynamic membership through virtual nodes. Riak PG Saturday, September 14, 13

Slide 22

Slide 22 text

Replicated state; quorum reads and writes. Riak PG Saturday, September 14, 13

Slide 23

Slide 23 text

Con"ict-free resolution with CRDTs. Riak PG Saturday, September 14, 13

Slide 24

Slide 24 text

Eventually consistent; harvest vs. yield tradeo!. Riak PG Saturday, September 14, 13

Slide 25

Slide 25 text

Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. “Eventual Consistency”, Wikipedia Saturday, September 14, 13

Slide 26

Slide 26 text

Our approaches tolerate partial failures by emphasizing simple composition mechanisms that promote fault containment, and by translating possible partial failure modes into engineering mechanisms that provide smoothly degrading functionality rather than lack of availability of the service as a whole. “Harvest, Yield, and Scalable Tolerant Systems”, Fox and Brewer Saturday, September 14, 13

Slide 27

Slide 27 text

The Requirements Saturday, September 14, 13

Slide 28

Slide 28 text

Structured names. The Requirements Saturday, September 14, 13

Slide 29

Slide 29 text

Multiple non-unique names per process. The Requirements Saturday, September 14, 13

Slide 30

Slide 30 text

Dynamic cluster membership. The Requirements Saturday, September 14, 13

Slide 31

Slide 31 text

Partition tolerance and con"ict resolution. The Requirements Saturday, September 14, 13

Slide 32

Slide 32 text

The Applications Saturday, September 14, 13

Slide 33

Slide 33 text

Service lookup pattern; publish and subscribe. The Applications Saturday, September 14, 13

Slide 34

Slide 34 text

Trade consistency for availability. The Applications Saturday, September 14, 13

Slide 35

Slide 35 text

Riak Core; CRDTs The Background Saturday, September 14, 13

Slide 36

Slide 36 text

Riak Core Saturday, September 14, 13

Slide 37

Slide 37 text

Erlang implementation of Dynamo. Riak Core Saturday, September 14, 13

Slide 38

Slide 38 text

Consistent hashing. Riak Core Saturday, September 14, 13

Slide 39

Slide 39 text

Hash-space partitioning. Riak Core Saturday, September 14, 13

Slide 40

Slide 40 text

Dynamic membership. Riak Core Saturday, September 14, 13

Slide 41

Slide 41 text

Replication factor. Riak Core Saturday, September 14, 13

Slide 42

Slide 42 text

Observed-Removed Set Saturday, September 14, 13

Slide 43

Slide 43 text

CvRDT; bounded join-semilattice. Observed-Removed Set Saturday, September 14, 13

Slide 44

Slide 44 text

Set; with merge function computing a LUB. Observed-Removed Set Saturday, September 14, 13

Slide 45

Slide 45 text

Two G-Sets; preserves monotonicity. Observed-Removed Set Saturday, September 14, 13

Slide 46

Slide 46 text

[ [{1, a}], [] ] Saturday, September 14, 13

Slide 47

Slide 47 text

[ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] Saturday, September 14, 13

Slide 48

Slide 48 text

[ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}, {2, b}], [{1, a}] ] Saturday, September 14, 13

Slide 49

Slide 49 text

[ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}, {2, b}], [{1, a}] ] [ [{1, a}, {2, b}], [{1, a}] ] Saturday, September 14, 13

Slide 50

Slide 50 text

[ [{1, a}], [] ] Saturday, September 14, 13

Slide 51

Slide 51 text

[ [{1, a}], [] ] [ [{1, a}], [] ] Saturday, September 14, 13

Slide 52

Slide 52 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] Saturday, September 14, 13

Slide 53

Slide 53 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ] Saturday, September 14, 13

Slide 54

Slide 54 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, b}], [{1, a}] ] Saturday, September 14, 13

Slide 55

Slide 55 text

[ [{1, a}], [] ] [ [{1, a}], [] ] Saturday, September 14, 13

Slide 56

Slide 56 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] Saturday, September 14, 13

Slide 57

Slide 57 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] Saturday, September 14, 13

Slide 58

Slide 58 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] Saturday, September 14, 13

Slide 59

Slide 59 text

The Implementation Saturday, September 14, 13

Slide 60

Slide 60 text

Same as pg2; create, join, leave, and members. The Implementation Saturday, September 14, 13

Slide 61

Slide 61 text

Extended with local and connected members. The Implementation Saturday, September 14, 13

Slide 62

Slide 62 text

Membership vnode stores registrations. The Implementation Saturday, September 14, 13

Slide 63

Slide 63 text

Con"ict-free resolution with OR-set. The Implementation Saturday, September 14, 13

Slide 64

Slide 64 text

Process pruning; lack of monitors. The Implementation Saturday, September 14, 13

Slide 65

Slide 65 text

Code examples. The Implementation Saturday, September 14, 13

Slide 66

Slide 66 text

The Virtual Node Saturday, September 14, 13

Slide 67

Slide 67 text

%% @doc Respond to a join request. handle_command({join, {ReqId, _}, Group, Pid}, _Sender, #state{groups=Groups0, partition=Partition}=State) -> %% Find existing list of Pids, and add object to it. Pids0 = pids(Groups0, Group, riak_dt_vvorset:new()), Pids = riak_dt_vvorset:update({add, Pid}, Partition, Pids0), %% Store back into the dict. Groups = dict:store(Group, Pids, Groups0), %% Return updated groups. {reply, {ok, ReqId}, State#state{groups=Groups}}; %% @doc Return pids from the dict. -spec pids(dict(), atom(), term()) -> term(). pids(Groups, Group, Default) -> case dict:find(Group, Groups) of {ok, Object} -> Object; _ -> Default end. riak_pg/src/riak_pg_memberships_vnode.erl Saturday, September 14, 13

Slide 68

Slide 68 text

%% @doc Respond to a leave request. handle_command({leave, {ReqId, _}, Group, Pid}, _Sender, #state{groups=Groups0, partition=Partition}=State) -> %% Find existing list of Pids, and remove object from it. Pids0 = pids(Groups0, Group, riak_dt_vvorset:new()), Pids = riak_dt_vvorset:update({remove, Pid}, Partition, Pids0), %% Store back into the dict. Groups = dict:store(Group, Pids, Groups0), %% Return updated groups. {reply, {ok, ReqId}, State#state{groups=Groups}}; %% @doc Return pids from the dict. -spec pids(dict(), atom(), term()) -> term(). pids(Groups, Group, Default) -> case dict:find(Group, Groups) of {ok, Object} -> Object; _ -> Default end. riak_pg/src/riak_pg_memberships_vnode.erl Saturday, September 14, 13

Slide 69

Slide 69 text

The Write Coordinator Saturday, September 14, 13

Slide 70

Slide 70 text

%% @doc Execute the request. execute(timeout, #state{preflist=Preflist, req_id=ReqId, coordinator=Coordinator, group=Group, pid=Pid}=State) -> riak_pg_memberships_vnode:join(Preflist, {ReqId, Coordinator}, Group, Pid), {next_state, waiting, State}. %% @doc Attempt to write to every single node responsible for this %% group. waiting({ok, ReqId}, #state{responses=Responses0, from=From}=State0) -> Responses = Responses0 + 1, State = State0#state{responses=Responses}, case Responses =:= ?W of true -> From ! {ReqId, ok}, {stop, normal, State}; false -> {next_state, waiting, State} end. riak_pg/src/riak_pg_memberships_vnode.erl Saturday, September 14, 13

Slide 71

Slide 71 text

The Read Coordinator Saturday, September 14, 13

Slide 72

Slide 72 text

%% @doc Pull a unique list of memberships from replicas, and %% relay the message to it. waiting({ok, _ReqId, IndexNode, Reply}, #state{from=From, req_id=ReqId, num_responses=NumResponses0, replies=Replies0}=State0) -> NumResponses = NumResponses0 + 1, Replies = [{IndexNode, Reply}|Replies0], State = State0#state{num_responses=NumResponses, replies=Replies}, case NumResponses =:= ?R of true -> Pids = riak_dt_vvorset:value(merge(Replies)), From ! {ReqId, ok, Pids}, case NumResponses =:= ?N of true -> {next_state, finalize, State, 0}; false -> {next_state, waiting_n, State} end; false -> {next_state, waiting, State} end. riak_pg/src/riak_pg_members_fsm.erl Saturday, September 14, 13

Slide 73

Slide 73 text

%% @doc Perform merge of replicas. merge(Replies) -> lists:foldl(fun({_, Pids}, Acc) -> riak_dt_vvorset:merge(Pids, Acc) end, riak_dt_vvorset:new(), Replies). riak_pg/src/riak_pg_members_fsm.erl Saturday, September 14, 13

Slide 74

Slide 74 text

%% @doc Wait for the remainder of responses from replicas. waiting_n({ok, _ReqId, IndexNode, Reply}, #state{num_responses=NumResponses0, replies=Replies0}=State0) -> NumResponses = NumResponses0 + 1, Replies = [{IndexNode, Reply}|Replies0], State = State0#state{num_responses=NumResponses, replies=Replies}, case NumResponses =:= ?N of true -> {next_state, finalize, State, 0}; false -> {next_state, waiting_n, State} end. riak_pg/src/riak_pg_members_fsm.erl Saturday, September 14, 13

Slide 75

Slide 75 text

%% @doc Perform read repair. finalize(timeout, #state{replies=Replies}=State) -> Merged = merge(Replies), Pruned = prune(Merged), ok = repair(Replies, State#state{pids=Pruned}), {stop, normal, State}. %% @doc If the node is connected, and the process is not alive, prune %% it. prune_pid(Pid) when is_pid(Pid) -> lists:member(node(Pid), nodes()) andalso (is_process_alive(Pid) =:= false). %% @doc Based on connected nodes, prune out processes that no longer %% exist. prune(Set) -> Pids0 = riak_dt_vvorset:value(Set), lists:foldl(fun(Pid, Pids) -> case prune_pid(Pid) of true -> riak_dt_vvorset:update({remove, Pid}, none, Pids); false -> Pids end end, Set, Pids0). riak_pg/src/riak_pg_members_fsm.erl Saturday, September 14, 13

Slide 76

Slide 76 text

The Evaluation Saturday, September 14, 13

Slide 77

Slide 77 text

pg2 members vs riak_pg connected members The Evaluation Saturday, September 14, 13

Slide 78

Slide 78 text

Partitions heal without con"icts. The Evaluation Saturday, September 14, 13

Slide 79

Slide 79 text

Howl; CloudI Process Groups; Riak Pipe The Related Work Saturday, September 14, 13

Slide 80

Slide 80 text

The Future Work Saturday, September 14, 13

Slide 81

Slide 81 text

CRDT garbage collection. The Future Work Saturday, September 14, 13

Slide 82

Slide 82 text

Active anti-entropy mechanism. The Future Work Saturday, September 14, 13

Slide 83

Slide 83 text

The Conclusion Saturday, September 14, 13

Slide 84

Slide 84 text

http://github.com/cmeiklejohn/riak_pg Thanks! Questions? Saturday, September 14, 13