Introduction to Riak

Introduction to Riak

BOBkonf, 2015
Berlin, Germany
Tutorial

3e09fee7b359be847ed5fa48f524a3d3?s=128

Christopher Meiklejohn

January 23, 2015
Tweet

Transcript

  1. Introduction to Riak Christopher Meiklejohn BOBkonf 2015 @cmeik

  2. History

  3. Published SOSP 2007; key-value storage system Amazon Dynamo

  4. Focused on high-availability and low-latency Amazon Dynamo

  5. Collection of distributed systems techniques Amazon Dynamo

  6. LinkedIn Voldemort, Facebook Cassandra Amazon Dynamo

  7. Released 2009; Apache2 licensed Dynamo clone Basho Riak

  8. Installing and Using Riak

  9. $ curl -O http://s3.amazonaws.com/downloads.basho.com/erlang/ otp_src_R16B02-basho5.tar.gz $ tar -xvf otp_src_R16B02-basho5.tar.gz $

    cd otp_src_R16B02-basho5 $ ./configure && make && sudo make install Installing Erlang
  10. $ git clone https://github.com/basho/riak.git $ cd riak $ make all

    Building Riak
  11. $ make devrel DEVNODES=5 $ cd dev; ls Building a

    devrel
  12. $ for node in dev*; do $node/bin/riak start; done Starting

    a devrel
  13. $ for node in dev*; do $node/bin/riak ping; done Pinging

    all nodes in a devrel
  14. $ dev2/bin/riak-admin cluster join dev1@127.0.0.1 $ dev3/bin/riak-admin cluster join dev1@127.0.0.1

    $ dev4/bin/riak-admin cluster join dev1@127.0.0.1 $ dev5/bin/riak-admin cluster join dev1@127.0.0.1 Stage a join
  15. $ dev1/bin/riak-admin cluster plan View a staged plan

  16. =============================== Staged Changes ================================ Action Nodes(s) ------------------------------------------------------------------------------- join 'dev2@127.0.0.1' join

    'dev3@127.0.0.1' join 'dev4@127.0.0.1' join 'dev5@127.0.0.1' ------------------------------------------------------------------------------- NOTE: Applying these changes will result in 1 cluster transition ############################################################################### After cluster transition 1/1 ############################################################################### ================================= Membership ================================== Status Ring Pending Node ------------------------------------------------------------------------------- valid 100.0% 20.3% 'dev1@127.0.0.1' valid 0.0% 20.3% 'dev2@127.0.0.1' valid 0.0% 20.3% 'dev3@127.0.0.1' valid 0.0% 20.3% 'dev4@127.0.0.1' valid 0.0% 18.8% 'dev5@127.0.0.1' ------------------------------------------------------------------------------- Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 Transfers resulting from cluster changes: 51 12 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1' 13 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1' 13 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1' 13 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1' View a staged plan
  17. $ dev2/bin/riak-admin cluster commit Commit the plan

  18. $ dev1/bin/riak-admin member-status View members of cluster

  19. ================================= Membership ================================== Status Ring Pending Node ------------------------------------------------------------------------------- valid 20.3%

    -- 'dev1@127.0.0.1' valid 20.3% -- 'dev2@127.0.0.1' valid 20.3% -- 'dev3@127.0.0.1' valid 20.3% -- 'dev4@127.0.0.1' valid 18.8% -- 'dev5@127.0.0.1' ------------------------------------------------------------------------------- Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 View members of cluster
  20. $ curl -XPUT http://localhost:10018/buckets/welcome/keys/german -H 'Content-Type: text/plain' -d 'herzlich willkommen'

    Storing data via HTTP
  21. $ curl http://localhost:10018/buckets/welcome/keys/german Retrieving data via HTTP

  22. $ curl -XPUT http://localhost:10018/buckets/images/keys/ <image_name>.jpg \ -H 'Content-Type: image/jpeg' \

    --data-binary @<image_name>.jpg Storing an image via HTTP
  23. $ curl -O http://localhost:10018/buckets/images/keys/<image_name>.jpg Retrieving an image via HTTP

  24. Riak Architecture

  25. Consistent Hashing hash(bucket/key)

  26. hash ring

  27. tokenize it

  28. node 0 node 1 node 2 hash(key)

  29. node 0 node 1 node 2 Replicas are stored to

    the N - 1 contiguous partitions
  30. node 0 node 1 node 2 hash(companies/cisco) Replicas are stored

    to the N - 1 contiguous partitions
  31. node 0 node 1 node 2 hash(companies/cisco) Replicas are stored

    to the N - 1 contiguous partitions
  32. node 0 node 1 node 2

  33. Scaling out node 0 node 1 node 2 node 3

    +
  34. Quorum requests N R W PR/PW DW

  35. Vector Clocks establish temporality

  36. None
  37. None
  38. Anatomy of a Request get(users/clay-davis)

  39. Anatomy of a Request get(users/clay-davis) client Riak

  40. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

  41. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    hash(users/clay-davis) == 10, 11, 12
  42. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    hash(users/clay-davis) == 10, 11, 12 Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring
  43. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    get(users/clay-davis) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring
  44. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2
  45. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 obj
  46. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    R=2 obj obj
  47. Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

    R=2 obj obj
  48. Anatomy of a Request get(users/clay-davis) obj

  49. Read Repair (Anti-Entropy)

  50. replica replica replica

  51. replica replica replica X

  52. replica replica replica replica replica replica

  53. Active Anti-Entropy (self healing clusters)

  54. real-time updates persistent non-blocking disk-based

  55. merkle tree to track changes coordinated at the vnode level

    runs as a background process exchange with neighbor vnodes for inconsistencies resolution semantics: trigger read-repair
  56. = hashes marked dirty

  57. None
  58. None
  59. None
  60. None
  61. = keys to read-repair

  62. Riak and Consistency

  63. Riak Object

  64. BKey Value

  65. Consistent hashing; dynamic membership Data Placement

  66. None
  67. None
  68. None
  69. Replication per-value across ring Data Placement

  70. Replica Replica Replica

  71. High Availability …any non-failing node can respond to any request

    Gilbert & Lynch
  72. Eventual Consistency Eventual consistency is a consistency model used in

    distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Wikipedia
  73. Take the form: {Writer, Value, Time} Concurrent writes

  74. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] Concurrent

    writes
  75. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] [{b,

    v1, t2}] [{b, v1, t2}] [{b, v1, t2}] Last Writer Wins
  76. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] [[{a,

    v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] Allow Mult
  77. User specificed Merge

  78. Two Approaches

  79. Strong Eventual Consistency

  80. Designed for convergence; allows divergence Conflict-free Replicated Data Types

  81. Strong Consistency

  82. Provides atomicity and recency Strong Consistency

  83. Prohibits partial writes Strong Consistency

  84. A A A

  85. A A A Val = B

  86. A A A Val = B

  87. B A A

  88. B A A Get Operation with Read Repair

  89. B A A Get Operation with Read Repair

  90. B A A Get Operation with Read Repair B B

  91. Single key atomic operations Strong Consistency

  92. Requires read/modify/write cycle (CAS) Strong Consistency

  93. Consensus

  94. Distributed Consensus The problem of reaching agreement among remote processes

    is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault-tolerant distributed applications. Fischer, Lynch, Paterson
  95. Termination, agreement, validity The Consensus Problem

  96. All processes eventually decide on a value Termination

  97. All processes decide on the same value Agreement

  98. Value decided on had to have been proposed Validity

  99. Consensus Algorithms

  100. Paxos, ZAB, Raft, etc. Consensus Algorithms

  101. Coordinated requests with a chosen leader The Paxos Algorithm

  102. Node 1 Node 2 Node 3 N++ prepare(N) promise(N, Vb)

    promise(N, Vc) Vn = f(Va, Vb, Vc) commit(N, Vn) accept(N)
  103. First request Multi-Paxos

  104. Node 1 Node 2 Node 3 N++; I = 0

    prepare(N, I) promise(N, I, Vb) promise(N, I, Vc) Vn = f(Va, Vb, Vc) commit(N, I, Vn) accept(N, I)
  105. Each additional request Multi-Paxos

  106. Node 1 Node 2 Node 3 I++ commit(N, I, V)

    accept(N, I)
  107. Ship entire state! Multi-Paxos

  108. Riak

  109. Key-value store; keys are independent state Riak

  110. Multi-Paxos per key; CAS on isolated state Riak

  111. Consensus Groups

  112. Participants in decisioning; ensembles Consensus Groups

  113. Use the preference list! Consensus Groups

  114. preflist

  115. None
  116. None
  117. None
  118. None
  119. One ensemble per preference list; ring size Consensus Groups

  120. Ensembles

  121. election of leader; get/put operations Riak Ensembles

  122. read local; refresh, if old Get Operations

  123. Node 1 Node 2 Node 3 obj.epoch < epoch get(key)

    reply(Epochb, Seqb, Valb) Val = latest(Vala, Valb, Valc) Val.epoch = epoch write(Epoch, ++Seq, Val) ack(Epoch, Seq) reply(Epochc, Seqc, Valc)
  124. Node 1 Node 2 Node 3 obj.epoch == epoch Reply

    = local_get(Key)
  125. Worst Case: 2 roundtrips / write Get Operations Best Case:

    0 roundtrips / write
  126. read local; refresh, modify and commit if old Put Operations

  127. Node 1 Node 2 Node 3 obj.epoch < epoch get(key)

    reply(Epochb, Seqb, Valb) Latest = latest(Vala, Valb, Valc) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq) reply(Epochc, Seqc, Valc)
  128. Node 1 Node 2 Node 3 obj.epoch == epoch Latest

    = local_get(Key) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq)
  129. Worst Case: 2 roundtrips / write Put Operations Best Case:

    1 roundtrips / write
  130. Elect a new leader; start a new epoch Failed Quorums

  131. Cluster Membership

  132. Use joint consensus from multi paxos Dynamic Membership

  133. Existing Ensemble Joining Ensemble riak_01 riak_02 riak_03 riak_07 riak_08 riak_09

    [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]
  134. Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

  135. Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

  136. New Ensemble riak_07 riak_08 riak_09 [{riak_07}, {riak_08}, {riak_09}]

  137. Distributed batch processing for Riak MapReduce

  138. Data locality for map; coordinator for reduce MapReduce

  139. None
  140. $ curl -XPUT http://localhost:10018/buckets/training/keys/foo \ -H 'Content-Type: text/plain' \ -d

    'caremad data goes here' $ curl -XPUT http://localhost:10018/buckets/training/keys/bar \ -H 'Content-Type: text/plain' \ -d 'caremad caremad caremad caremad' $ curl -XPUT http://localhost:10018/buckets/training/keys/baz \ -H 'Content-Type: text/plain' \ -d 'nothing to see here' $ curl -XPUT http://localhost:10018/buckets/training/keys/bam \ -H 'Content-Type: text/plain' \ -d 'caremad caremad caremad' Create some objects; http://docs.basho.com/riak/latest/dev/using/mapreduce/
  141. > ReFun = fun(O, _, Re) -> case re:run(riak_object:get_value(O), Re,

    [global]) of {match, Matches} -> [{riak_object:key(O), length(Matches)}]; nomatch -> [{riak_object:key(O), 0}] end end. > {ok, Re} = re:compile("caremad"). > {ok, Riak} = riakc_pb_socket:start_link("127.0.0.1", 8087). > riakc_pb_socket:mapred_bucket(Riak, <<"training">>, [{map, {qfun, ReFun}, Re, true}]). Run Erlang MapReduce; http://docs.basho.com/riak/latest/dev/using/mapreduce/
  142. Distributed secondary indexing over values Secondary Indexes (2i)

  143. Requires LevelDB or memory backend Secondary Indexes (2i)

  144. Tag objects; perform equality or range queries Secondary Indexes (2i)

  145. $ curl -XPOST localhost:8098/types/mytype/buckets/users/keys/ john_smith \ -H 'x-riak-index-twitter_bin: jsmith123' \

    -H 'x-riak-index-email_bin: jsmith@basho.com' \ -H 'Content-Type: application/json' \ -d '{"userData":"data"}' Create values with secondary index tags; http://docs.basho.com/riak/latest/dev/using/2i/
  146. $ curl http://localhost:10018/buckets/users/index/twitter_bin/ jsmith123 Query secondary index; http://docs.basho.com/riak/latest/dev/using/2i/

  147. Riak integration with Solr Distributed Search Riak Search

  148. None
  149. Schemas explain how to index fields Riak Search Components

  150. Indexes are built and queried against Riak Search Components

  151. Bucket-Index associations say when to index Riak Search Components

  152. Default schema covers many content-types Riak Search Components

  153. $ curl -XPUT http://localhost:10018/search/index/famous Create default index using default schema;

    http://docs.basho.com/riak/latest/dev/using/search/
  154. $ curl -XPUT http://localhost:10018/search/index/famous \ -H 'Content-Type: application/json' \ -d

    '{"schema":"_yz_default"}' Create default index using default schema; http://docs.basho.com/riak/latest/dev/using/search/
  155. $ riak-admin bucket-type create animals '{"props":{}}' $ riak-admin bucket-type activate

    animals Create bucket type for search; http://docs.basho.com/riak/latest/dev/using/search/
  156. $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/props \ -H 'Content-Type: application/json' \ -d

    '{"props":{"search_index":"famous"}}' Associate bucket, bucket type, and index; http://docs.basho.com/riak/latest/dev/using/search/
  157. $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/liono \ -H 'Content-Type: application/json' \ -d

    '{"name_s":"Lion-o", "age_i":30, "leader_b":true}' $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/cheetara \ -H 'Content-Type: application/json' \ -d '{"name_s":"Cheetara", "age_i":28, "leader_b":false}' $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/snarf \ -H 'Content-Type: application/json' \ -d '{"name_s":"Snarf", "age_i":43}' $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/panthro \ -H 'Content-Type: application/json' \ -d '{"name_s":"Panthro", "age_i":36}' Store some values; http://docs.basho.com/riak/latest/dev/using/search/
  158. $ curl “http://localhost:10018/search/query/famous? wt=json&q=name_s:Lion*” | jsonpp $ curl “http://localhost:10018/search/query/famous?wt=json&q=age_i: %5B30%20TO%20*%5D”

    | jsonpp $ curl “http://localhost:10018/search/query/famous? wt=json&q=leader_b:true%20AND%20age_i:%5B25%20TO%20*%5D” | jsonpp Perform search queries; http://docs.basho.com/riak/latest/dev/using/search/
  159. Single-key linearizability; reduced availability Strong Consistency

  160. $ riak-admin bucket-type create strongly_consistent \ ‘{"props":{"consistent":true}}' $ riak-admin bucket-type

    status strongly_consistent $ riak-admin bucket-type activate strongly_consistent Enable strong consistency; http://docs.basho.com/riak/latest/dev/advanced/strong-consistency/
  161. Read and write a value to SC bucket Exercise

  162. Conflict-Free Replicated Data Types Strong Eventual Consistency

  163. Converge correctly under concurrent ops * Strong Eventual Consistency *

    See the next talk from Annette Bieniusa!
  164. $ riak-admin bucket-type create maps \ '{"props":{"datatype":"map"}}' $ riak-admin bucket-type

    create sets \ '{"props":{"datatype":"set"}}' $ riak-admin bucket-type create counters \ ‘{“props":{"datatype":"counter"}}' $ riak-admin bucket-type status maps $ riak-admin bucket-type activate maps Create bucket type for data types; http://docs.basho.com/riak/latest/dev/using/data-types/
  165. $ curl -XPOST http://localhost:10018/types/counters/buckets/counters/ datatypes/traffic_tickets \ -H "Content-Type: application/json" \

    -d '{"increment": 1}’ $ curl http://localhost:10018/types/counters/buckets/counters/ datatypes/traffic_tickets Operate on counters; http://docs.basho.com/riak/latest/dev/using/data-types/
  166. $ curl -XPOST http://localhost:10018/types/sets/buckets/travel/ datatypes/cities \ -H "Content-Type: application/json" \

    -d '{"add_all":["Toronto", “Montreal"]}' $ curl -XPOST http://localhost:10018/types/sets/buckets/travel/ datatypes/cities \ -H "Content-Type: application/json" \ -d '{"remove": “Montreal"}' $ curl http://localhost:10018/types/sets/buckets/travel/datatypes/ cities Operate on sets; http://docs.basho.com/riak/latest/dev/using/data-types/
  167. $ curl -XPOST http://localhost:10018/types/maps/buckets/customers/ datatypes/ahmed_info \ -H "Content-Type: application/json" \

    -d ' { "update": { "first_name_register": "Ahmed", "phone_number_register": "5551234567" } }' $ curl -XPOST http://localhost:8098/types/maps/buckets/customers/ datatypes/ahmed_info \ -H "Content-Type: application/json" \ -d ' { "update": { "annika_info_map": { "update": { "interests_set": { "add": "tango dancing" } } } } } ' Operate on maps; http://docs.basho.com/riak/latest/dev/using/data-types/
  168. Read and write a value to map Exercise

  169. Questions?