Slide 1

Slide 1 text

Introduction to Riak Christopher Meiklejohn BOBkonf 2015 @cmeik

Slide 2

Slide 2 text

History

Slide 3

Slide 3 text

Published SOSP 2007; key-value storage system Amazon Dynamo

Slide 4

Slide 4 text

Focused on high-availability and low-latency Amazon Dynamo

Slide 5

Slide 5 text

Collection of distributed systems techniques Amazon Dynamo

Slide 6

Slide 6 text

LinkedIn Voldemort, Facebook Cassandra Amazon Dynamo

Slide 7

Slide 7 text

Released 2009; Apache2 licensed Dynamo clone Basho Riak

Slide 8

Slide 8 text

Installing and Using Riak

Slide 9

Slide 9 text

$ curl -O http://s3.amazonaws.com/downloads.basho.com/erlang/ otp_src_R16B02-basho5.tar.gz $ tar -xvf otp_src_R16B02-basho5.tar.gz $ cd otp_src_R16B02-basho5 $ ./configure && make && sudo make install Installing Erlang

Slide 10

Slide 10 text

$ git clone https://github.com/basho/riak.git $ cd riak $ make all Building Riak

Slide 11

Slide 11 text

$ make devrel DEVNODES=5 $ cd dev; ls Building a devrel

Slide 12

Slide 12 text

$ for node in dev*; do $node/bin/riak start; done Starting a devrel

Slide 13

Slide 13 text

$ for node in dev*; do $node/bin/riak ping; done Pinging all nodes in a devrel

Slide 14

Slide 14 text

$ dev2/bin/riak-admin cluster join [email protected] $ dev3/bin/riak-admin cluster join [email protected] $ dev4/bin/riak-admin cluster join [email protected] $ dev5/bin/riak-admin cluster join [email protected] Stage a join

Slide 15

Slide 15 text

$ dev1/bin/riak-admin cluster plan View a staged plan

Slide 16

Slide 16 text

=============================== Staged Changes ================================ Action Nodes(s) ------------------------------------------------------------------------------- join '[email protected]' join '[email protected]' join '[email protected]' join '[email protected]' ------------------------------------------------------------------------------- NOTE: Applying these changes will result in 1 cluster transition ############################################################################### After cluster transition 1/1 ############################################################################### ================================= Membership ================================== Status Ring Pending Node ------------------------------------------------------------------------------- valid 100.0% 20.3% '[email protected]' valid 0.0% 20.3% '[email protected]' valid 0.0% 20.3% '[email protected]' valid 0.0% 20.3% '[email protected]' valid 0.0% 18.8% '[email protected]' ------------------------------------------------------------------------------- Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 Transfers resulting from cluster changes: 51 12 transfers from '[email protected]' to '[email protected]' 13 transfers from '[email protected]' to '[email protected]' 13 transfers from '[email protected]' to '[email protected]' 13 transfers from '[email protected]' to '[email protected]' View a staged plan

Slide 17

Slide 17 text

$ dev2/bin/riak-admin cluster commit Commit the plan

Slide 18

Slide 18 text

$ dev1/bin/riak-admin member-status View members of cluster

Slide 19

Slide 19 text

================================= Membership ================================== Status Ring Pending Node ------------------------------------------------------------------------------- valid 20.3% -- '[email protected]' valid 20.3% -- '[email protected]' valid 20.3% -- '[email protected]' valid 20.3% -- '[email protected]' valid 18.8% -- '[email protected]' ------------------------------------------------------------------------------- Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 View members of cluster

Slide 20

Slide 20 text

$ curl -XPUT http://localhost:10018/buckets/welcome/keys/german -H 'Content-Type: text/plain' -d 'herzlich willkommen' Storing data via HTTP

Slide 21

Slide 21 text

$ curl http://localhost:10018/buckets/welcome/keys/german Retrieving data via HTTP

Slide 22

Slide 22 text

$ curl -XPUT http://localhost:10018/buckets/images/keys/ .jpg \ -H 'Content-Type: image/jpeg' \ --data-binary @.jpg Storing an image via HTTP

Slide 23

Slide 23 text

$ curl -O http://localhost:10018/buckets/images/keys/.jpg Retrieving an image via HTTP

Slide 24

Slide 24 text

Riak Architecture

Slide 25

Slide 25 text

Consistent Hashing hash(bucket/key)

Slide 26

Slide 26 text

hash ring

Slide 27

Slide 27 text

tokenize it

Slide 28

Slide 28 text

node 0 node 1 node 2 hash(key)

Slide 29

Slide 29 text

node 0 node 1 node 2 Replicas are stored to the N - 1 contiguous partitions

Slide 30

Slide 30 text

node 0 node 1 node 2 hash(companies/cisco) Replicas are stored to the N - 1 contiguous partitions

Slide 31

Slide 31 text

node 0 node 1 node 2 hash(companies/cisco) Replicas are stored to the N - 1 contiguous partitions

Slide 32

Slide 32 text

node 0 node 1 node 2

Slide 33

Slide 33 text

Scaling out node 0 node 1 node 2 node 3 +

Slide 34

Slide 34 text

Quorum requests N R W PR/PW DW

Slide 35

Slide 35 text

Vector Clocks establish temporality

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Anatomy of a Request get(users/clay-davis)

Slide 39

Slide 39 text

Anatomy of a Request get(users/clay-davis) client Riak

Slide 40

Slide 40 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

Slide 41

Slide 41 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak hash(users/clay-davis) == 10, 11, 12

Slide 42

Slide 42 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak hash(users/clay-davis) == 10, 11, 12 Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring

Slide 43

Slide 43 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak get(users/clay-davis) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring

Slide 44

Slide 44 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2

Slide 45

Slide 45 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 obj

Slide 46

Slide 46 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak R=2 obj obj

Slide 47

Slide 47 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak R=2 obj obj

Slide 48

Slide 48 text

Anatomy of a Request get(users/clay-davis) obj

Slide 49

Slide 49 text

Read Repair (Anti-Entropy)

Slide 50

Slide 50 text

replica replica replica

Slide 51

Slide 51 text

replica replica replica X

Slide 52

Slide 52 text

replica replica replica replica replica replica

Slide 53

Slide 53 text

Active Anti-Entropy (self healing clusters)

Slide 54

Slide 54 text

real-time updates persistent non-blocking disk-based

Slide 55

Slide 55 text

merkle tree to track changes coordinated at the vnode level runs as a background process exchange with neighbor vnodes for inconsistencies resolution semantics: trigger read-repair

Slide 56

Slide 56 text

= hashes marked dirty

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

= keys to read-repair

Slide 62

Slide 62 text

Riak and Consistency

Slide 63

Slide 63 text

Riak Object

Slide 64

Slide 64 text

BKey Value

Slide 65

Slide 65 text

Consistent hashing; dynamic membership Data Placement

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

Replication per-value across ring Data Placement

Slide 70

Slide 70 text

Replica Replica Replica

Slide 71

Slide 71 text

High Availability …any non-failing node can respond to any request Gilbert & Lynch

Slide 72

Slide 72 text

Eventual Consistency Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Wikipedia

Slide 73

Slide 73 text

Take the form: {Writer, Value, Time} Concurrent writes

Slide 74

Slide 74 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] Concurrent writes

Slide 75

Slide 75 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] [{b, v1, t2}] [{b, v1, t2}] [{b, v1, t2}] Last Writer Wins

Slide 76

Slide 76 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] Allow Mult

Slide 77

Slide 77 text

User specificed Merge

Slide 78

Slide 78 text

Two Approaches

Slide 79

Slide 79 text

Strong Eventual Consistency

Slide 80

Slide 80 text

Designed for convergence; allows divergence Conflict-free Replicated Data Types

Slide 81

Slide 81 text

Strong Consistency

Slide 82

Slide 82 text

Provides atomicity and recency Strong Consistency

Slide 83

Slide 83 text

Prohibits partial writes Strong Consistency

Slide 84

Slide 84 text

A A A

Slide 85

Slide 85 text

A A A Val = B

Slide 86

Slide 86 text

A A A Val = B

Slide 87

Slide 87 text

B A A

Slide 88

Slide 88 text

B A A Get Operation with Read Repair

Slide 89

Slide 89 text

B A A Get Operation with Read Repair

Slide 90

Slide 90 text

B A A Get Operation with Read Repair B B

Slide 91

Slide 91 text

Single key atomic operations Strong Consistency

Slide 92

Slide 92 text

Requires read/modify/write cycle (CAS) Strong Consistency

Slide 93

Slide 93 text

Consensus

Slide 94

Slide 94 text

Distributed Consensus The problem of reaching agreement among remote processes is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault-tolerant distributed applications. Fischer, Lynch, Paterson

Slide 95

Slide 95 text

Termination, agreement, validity The Consensus Problem

Slide 96

Slide 96 text

All processes eventually decide on a value Termination

Slide 97

Slide 97 text

All processes decide on the same value Agreement

Slide 98

Slide 98 text

Value decided on had to have been proposed Validity

Slide 99

Slide 99 text

Consensus Algorithms

Slide 100

Slide 100 text

Paxos, ZAB, Raft, etc. Consensus Algorithms

Slide 101

Slide 101 text

Coordinated requests with a chosen leader The Paxos Algorithm

Slide 102

Slide 102 text

Node 1 Node 2 Node 3 N++ prepare(N) promise(N, Vb) promise(N, Vc) Vn = f(Va, Vb, Vc) commit(N, Vn) accept(N)

Slide 103

Slide 103 text

First request Multi-Paxos

Slide 104

Slide 104 text

Node 1 Node 2 Node 3 N++; I = 0 prepare(N, I) promise(N, I, Vb) promise(N, I, Vc) Vn = f(Va, Vb, Vc) commit(N, I, Vn) accept(N, I)

Slide 105

Slide 105 text

Each additional request Multi-Paxos

Slide 106

Slide 106 text

Node 1 Node 2 Node 3 I++ commit(N, I, V) accept(N, I)

Slide 107

Slide 107 text

Ship entire state! Multi-Paxos

Slide 108

Slide 108 text

Riak

Slide 109

Slide 109 text

Key-value store; keys are independent state Riak

Slide 110

Slide 110 text

Multi-Paxos per key; CAS on isolated state Riak

Slide 111

Slide 111 text

Consensus Groups

Slide 112

Slide 112 text

Participants in decisioning; ensembles Consensus Groups

Slide 113

Slide 113 text

Use the preference list! Consensus Groups

Slide 114

Slide 114 text

preflist

Slide 115

Slide 115 text

No content

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

No content

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

One ensemble per preference list; ring size Consensus Groups

Slide 120

Slide 120 text

Ensembles

Slide 121

Slide 121 text

election of leader; get/put operations Riak Ensembles

Slide 122

Slide 122 text

read local; refresh, if old Get Operations

Slide 123

Slide 123 text

Node 1 Node 2 Node 3 obj.epoch < epoch get(key) reply(Epochb, Seqb, Valb) Val = latest(Vala, Valb, Valc) Val.epoch = epoch write(Epoch, ++Seq, Val) ack(Epoch, Seq) reply(Epochc, Seqc, Valc)

Slide 124

Slide 124 text

Node 1 Node 2 Node 3 obj.epoch == epoch Reply = local_get(Key)

Slide 125

Slide 125 text

Worst Case: 2 roundtrips / write Get Operations Best Case: 0 roundtrips / write

Slide 126

Slide 126 text

read local; refresh, modify and commit if old Put Operations

Slide 127

Slide 127 text

Node 1 Node 2 Node 3 obj.epoch < epoch get(key) reply(Epochb, Seqb, Valb) Latest = latest(Vala, Valb, Valc) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq) reply(Epochc, Seqc, Valc)

Slide 128

Slide 128 text

Node 1 Node 2 Node 3 obj.epoch == epoch Latest = local_get(Key) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq)

Slide 129

Slide 129 text

Worst Case: 2 roundtrips / write Put Operations Best Case: 1 roundtrips / write

Slide 130

Slide 130 text

Elect a new leader; start a new epoch Failed Quorums

Slide 131

Slide 131 text

Cluster Membership

Slide 132

Slide 132 text

Use joint consensus from multi paxos Dynamic Membership

Slide 133

Slide 133 text

Existing Ensemble Joining Ensemble riak_01 riak_02 riak_03 riak_07 riak_08 riak_09 [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]

Slide 134

Slide 134 text

Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

Slide 135

Slide 135 text

Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

Slide 136

Slide 136 text

New Ensemble riak_07 riak_08 riak_09 [{riak_07}, {riak_08}, {riak_09}]

Slide 137

Slide 137 text

Distributed batch processing for Riak MapReduce

Slide 138

Slide 138 text

Data locality for map; coordinator for reduce MapReduce

Slide 139

Slide 139 text

No content

Slide 140

Slide 140 text

$ curl -XPUT http://localhost:10018/buckets/training/keys/foo \ -H 'Content-Type: text/plain' \ -d 'caremad data goes here' $ curl -XPUT http://localhost:10018/buckets/training/keys/bar \ -H 'Content-Type: text/plain' \ -d 'caremad caremad caremad caremad' $ curl -XPUT http://localhost:10018/buckets/training/keys/baz \ -H 'Content-Type: text/plain' \ -d 'nothing to see here' $ curl -XPUT http://localhost:10018/buckets/training/keys/bam \ -H 'Content-Type: text/plain' \ -d 'caremad caremad caremad' Create some objects; http://docs.basho.com/riak/latest/dev/using/mapreduce/

Slide 141

Slide 141 text

> ReFun = fun(O, _, Re) -> case re:run(riak_object:get_value(O), Re, [global]) of {match, Matches} -> [{riak_object:key(O), length(Matches)}]; nomatch -> [{riak_object:key(O), 0}] end end. > {ok, Re} = re:compile("caremad"). > {ok, Riak} = riakc_pb_socket:start_link("127.0.0.1", 8087). > riakc_pb_socket:mapred_bucket(Riak, <<"training">>, [{map, {qfun, ReFun}, Re, true}]). Run Erlang MapReduce; http://docs.basho.com/riak/latest/dev/using/mapreduce/

Slide 142

Slide 142 text

Distributed secondary indexing over values Secondary Indexes (2i)

Slide 143

Slide 143 text

Requires LevelDB or memory backend Secondary Indexes (2i)

Slide 144

Slide 144 text

Tag objects; perform equality or range queries Secondary Indexes (2i)

Slide 145

Slide 145 text

$ curl -XPOST localhost:8098/types/mytype/buckets/users/keys/ john_smith \ -H 'x-riak-index-twitter_bin: jsmith123' \ -H 'x-riak-index-email_bin: [email protected]' \ -H 'Content-Type: application/json' \ -d '{"userData":"data"}' Create values with secondary index tags; http://docs.basho.com/riak/latest/dev/using/2i/

Slide 146

Slide 146 text

$ curl http://localhost:10018/buckets/users/index/twitter_bin/ jsmith123 Query secondary index; http://docs.basho.com/riak/latest/dev/using/2i/

Slide 147

Slide 147 text

Riak integration with Solr Distributed Search Riak Search

Slide 148

Slide 148 text

No content

Slide 149

Slide 149 text

Schemas explain how to index fields Riak Search Components

Slide 150

Slide 150 text

Indexes are built and queried against Riak Search Components

Slide 151

Slide 151 text

Bucket-Index associations say when to index Riak Search Components

Slide 152

Slide 152 text

Default schema covers many content-types Riak Search Components

Slide 153

Slide 153 text

$ curl -XPUT http://localhost:10018/search/index/famous Create default index using default schema; http://docs.basho.com/riak/latest/dev/using/search/

Slide 154

Slide 154 text

$ curl -XPUT http://localhost:10018/search/index/famous \ -H 'Content-Type: application/json' \ -d '{"schema":"_yz_default"}' Create default index using default schema; http://docs.basho.com/riak/latest/dev/using/search/

Slide 155

Slide 155 text

$ riak-admin bucket-type create animals '{"props":{}}' $ riak-admin bucket-type activate animals Create bucket type for search; http://docs.basho.com/riak/latest/dev/using/search/

Slide 156

Slide 156 text

$ curl -XPUT http://localhost:10018/types/animals/buckets/cats/props \ -H 'Content-Type: application/json' \ -d '{"props":{"search_index":"famous"}}' Associate bucket, bucket type, and index; http://docs.basho.com/riak/latest/dev/using/search/

Slide 157

Slide 157 text

$ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/liono \ -H 'Content-Type: application/json' \ -d '{"name_s":"Lion-o", "age_i":30, "leader_b":true}' $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/cheetara \ -H 'Content-Type: application/json' \ -d '{"name_s":"Cheetara", "age_i":28, "leader_b":false}' $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/snarf \ -H 'Content-Type: application/json' \ -d '{"name_s":"Snarf", "age_i":43}' $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/panthro \ -H 'Content-Type: application/json' \ -d '{"name_s":"Panthro", "age_i":36}' Store some values; http://docs.basho.com/riak/latest/dev/using/search/

Slide 158

Slide 158 text

$ curl “http://localhost:10018/search/query/famous? wt=json&q=name_s:Lion*” | jsonpp $ curl “http://localhost:10018/search/query/famous?wt=json&q=age_i: %5B30%20TO%20*%5D” | jsonpp $ curl “http://localhost:10018/search/query/famous? wt=json&q=leader_b:true%20AND%20age_i:%5B25%20TO%20*%5D” | jsonpp Perform search queries; http://docs.basho.com/riak/latest/dev/using/search/

Slide 159

Slide 159 text

Single-key linearizability; reduced availability Strong Consistency

Slide 160

Slide 160 text

$ riak-admin bucket-type create strongly_consistent \ ‘{"props":{"consistent":true}}' $ riak-admin bucket-type status strongly_consistent $ riak-admin bucket-type activate strongly_consistent Enable strong consistency; http://docs.basho.com/riak/latest/dev/advanced/strong-consistency/

Slide 161

Slide 161 text

Read and write a value to SC bucket Exercise

Slide 162

Slide 162 text

Conflict-Free Replicated Data Types Strong Eventual Consistency

Slide 163

Slide 163 text

Converge correctly under concurrent ops * Strong Eventual Consistency * See the next talk from Annette Bieniusa!

Slide 164

Slide 164 text

$ riak-admin bucket-type create maps \ '{"props":{"datatype":"map"}}' $ riak-admin bucket-type create sets \ '{"props":{"datatype":"set"}}' $ riak-admin bucket-type create counters \ ‘{“props":{"datatype":"counter"}}' $ riak-admin bucket-type status maps $ riak-admin bucket-type activate maps Create bucket type for data types; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 165

Slide 165 text

$ curl -XPOST http://localhost:10018/types/counters/buckets/counters/ datatypes/traffic_tickets \ -H "Content-Type: application/json" \ -d '{"increment": 1}’ $ curl http://localhost:10018/types/counters/buckets/counters/ datatypes/traffic_tickets Operate on counters; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 166

Slide 166 text

$ curl -XPOST http://localhost:10018/types/sets/buckets/travel/ datatypes/cities \ -H "Content-Type: application/json" \ -d '{"add_all":["Toronto", “Montreal"]}' $ curl -XPOST http://localhost:10018/types/sets/buckets/travel/ datatypes/cities \ -H "Content-Type: application/json" \ -d '{"remove": “Montreal"}' $ curl http://localhost:10018/types/sets/buckets/travel/datatypes/ cities Operate on sets; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 167

Slide 167 text

$ curl -XPOST http://localhost:10018/types/maps/buckets/customers/ datatypes/ahmed_info \ -H "Content-Type: application/json" \ -d ' { "update": { "first_name_register": "Ahmed", "phone_number_register": "5551234567" } }' $ curl -XPOST http://localhost:8098/types/maps/buckets/customers/ datatypes/ahmed_info \ -H "Content-Type: application/json" \ -d ' { "update": { "annika_info_map": { "update": { "interests_set": { "add": "tango dancing" } } } } } ' Operate on maps; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 168

Slide 168 text

Read and write a value to map Exercise

Slide 169

Slide 169 text

Questions?