$30 off During Our Annual Pro Sale. View Details »

Introduction to Riak

Introduction to Riak

BOBkonf, 2015
Berlin, Germany
Tutorial

Christopher Meiklejohn

January 23, 2015
Tweet

More Decks by Christopher Meiklejohn

Other Decks in Programming

Transcript

  1. Introduction to Riak
    Christopher Meiklejohn
    BOBkonf 2015
    @cmeik

    View Slide

  2. History

    View Slide

  3. Published SOSP 2007; key-value storage system
    Amazon Dynamo

    View Slide

  4. Focused on high-availability and low-latency
    Amazon Dynamo

    View Slide

  5. Collection of distributed systems techniques
    Amazon Dynamo

    View Slide

  6. LinkedIn Voldemort, Facebook Cassandra
    Amazon Dynamo

    View Slide

  7. Released 2009; Apache2 licensed Dynamo clone
    Basho Riak

    View Slide

  8. Installing and Using Riak

    View Slide

  9. $ curl -O http://s3.amazonaws.com/downloads.basho.com/erlang/
    otp_src_R16B02-basho5.tar.gz
    $ tar -xvf otp_src_R16B02-basho5.tar.gz
    $ cd otp_src_R16B02-basho5
    $ ./configure && make && sudo make install
    Installing Erlang

    View Slide

  10. $ git clone https://github.com/basho/riak.git
    $ cd riak
    $ make all
    Building Riak

    View Slide

  11. $ make devrel DEVNODES=5
    $ cd dev; ls
    Building a devrel

    View Slide

  12. $ for node in dev*; do $node/bin/riak start; done
    Starting a devrel

    View Slide

  13. $ for node in dev*; do $node/bin/riak ping; done
    Pinging all nodes in a devrel

    View Slide

  14. $ dev2/bin/riak-admin cluster join [email protected]
    $ dev3/bin/riak-admin cluster join [email protected]
    $ dev4/bin/riak-admin cluster join [email protected]
    $ dev5/bin/riak-admin cluster join [email protected]
    Stage a join

    View Slide

  15. $ dev1/bin/riak-admin cluster plan
    View a staged plan

    View Slide

  16. =============================== Staged Changes ================================
    Action Nodes(s)
    -------------------------------------------------------------------------------
    join '[email protected]'
    join '[email protected]'
    join '[email protected]'
    join '[email protected]'
    -------------------------------------------------------------------------------
    NOTE: Applying these changes will result in 1 cluster transition
    ###############################################################################
    After cluster transition 1/1
    ###############################################################################
    ================================= Membership ==================================
    Status Ring Pending Node
    -------------------------------------------------------------------------------
    valid 100.0% 20.3% '[email protected]'
    valid 0.0% 20.3% '[email protected]'
    valid 0.0% 20.3% '[email protected]'
    valid 0.0% 20.3% '[email protected]'
    valid 0.0% 18.8% '[email protected]'
    -------------------------------------------------------------------------------
    Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
    Transfers resulting from cluster changes: 51
    12 transfers from '[email protected]' to '[email protected]'
    13 transfers from '[email protected]' to '[email protected]'
    13 transfers from '[email protected]' to '[email protected]'
    13 transfers from '[email protected]' to '[email protected]'
    View a staged plan

    View Slide

  17. $ dev2/bin/riak-admin cluster commit
    Commit the plan

    View Slide

  18. $ dev1/bin/riak-admin member-status
    View members of cluster

    View Slide

  19. ================================= Membership ==================================
    Status Ring Pending Node
    -------------------------------------------------------------------------------
    valid 20.3% -- '[email protected]'
    valid 20.3% -- '[email protected]'
    valid 20.3% -- '[email protected]'
    valid 20.3% -- '[email protected]'
    valid 18.8% -- '[email protected]'
    -------------------------------------------------------------------------------
    Valid:5 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
    View members of cluster

    View Slide

  20. $ curl -XPUT http://localhost:10018/buckets/welcome/keys/german -H
    'Content-Type: text/plain' -d 'herzlich willkommen'
    Storing data via HTTP

    View Slide

  21. $ curl http://localhost:10018/buckets/welcome/keys/german
    Retrieving data via HTTP

    View Slide

  22. $ curl -XPUT http://localhost:10018/buckets/images/keys/
    .jpg \
    -H 'Content-Type: image/jpeg' \
    --data-binary @.jpg
    Storing an image via HTTP

    View Slide

  23. $ curl -O http://localhost:10018/buckets/images/keys/.jpg
    Retrieving an image via HTTP

    View Slide

  24. Riak Architecture

    View Slide

  25. Consistent Hashing
    hash(bucket/key)

    View Slide

  26. hash ring

    View Slide

  27. tokenize it

    View Slide

  28. node 0
    node 1
    node 2
    hash(key)

    View Slide

  29. node 0
    node 1
    node 2
    Replicas are stored to the N - 1
    contiguous partitions

    View Slide

  30. node 0
    node 1
    node 2
    hash(companies/cisco)
    Replicas are stored to the N - 1
    contiguous partitions

    View Slide

  31. node 0
    node 1
    node 2
    hash(companies/cisco)
    Replicas are stored to the N - 1
    contiguous partitions

    View Slide

  32. node 0
    node 1
    node 2

    View Slide

  33. Scaling out
    node 0
    node 1
    node 2
    node 3 +

    View Slide

  34. Quorum
    requests
    N R W PR/PW DW

    View Slide

  35. Vector Clocks
    establish temporality

    View Slide

  36. View Slide

  37. View Slide

  38. Anatomy of a Request
    get(users/clay-davis)

    View Slide

  39. Anatomy of a Request
    get(users/clay-davis)
    client
    Riak

    View Slide

  40. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak

    View Slide

  41. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    hash(users/clay-davis)
    == 10, 11, 12

    View Slide

  42. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    hash(users/clay-davis)
    == 10, 11, 12
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring

    View Slide

  43. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    get(users/clay-davis)
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring

    View Slide

  44. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring
    R=2

    View Slide

  45. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    Coordinating node
    Cluster
    6 7 8 9 10 11 12 13 14 15 16
    The Ring
    R=2 obj

    View Slide

  46. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    R=2 obj obj

    View Slide

  47. Anatomy of a Request
    get(users/clay-davis)
    Get Handler (FSM)
    client
    Riak
    R=2 obj
    obj

    View Slide

  48. Anatomy of a Request
    get(users/clay-davis)
    obj

    View Slide

  49. Read Repair
    (Anti-Entropy)

    View Slide

  50. replica replica replica

    View Slide

  51. replica replica replica
    X

    View Slide

  52. replica replica replica
    replica replica replica

    View Slide

  53. Active Anti-Entropy
    (self healing clusters)

    View Slide

  54. real-time updates
    persistent
    non-blocking
    disk-based

    View Slide

  55. merkle tree to track changes
    coordinated at the vnode level
    runs as a background process
    exchange with
    neighbor vnodes for inconsistencies
    resolution semantics:
    trigger read-repair

    View Slide

  56. = hashes marked dirty

    View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. = keys to read-repair

    View Slide

  62. Riak and Consistency

    View Slide

  63. Riak Object

    View Slide

  64. BKey Value

    View Slide

  65. Consistent hashing; dynamic membership
    Data Placement

    View Slide

  66. View Slide

  67. View Slide

  68. View Slide

  69. Replication per-value across ring
    Data Placement

    View Slide

  70. Replica Replica Replica

    View Slide

  71. High Availability
    …any non-failing node can respond to any request
    Gilbert & Lynch

    View Slide

  72. Eventual Consistency
    Eventual consistency is a consistency model used in
    distributed computing that informally guarantees that, if no
    new updates are made to a given data item, eventually all
    accesses to that item will return the last updated value.
    Wikipedia

    View Slide

  73. Take the form: {Writer, Value, Time}
    Concurrent writes

    View Slide

  74. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}]
    Concurrent writes

    View Slide

  75. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}]
    [{b, v1, t2}] [{b, v1, t2}] [{b, v1, t2}]
    Last Writer Wins

    View Slide

  76. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}]
    [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}]
    Allow Mult

    View Slide

  77. User specificed
    Merge

    View Slide

  78. Two Approaches

    View Slide

  79. Strong
    Eventual Consistency

    View Slide

  80. Designed for convergence; allows divergence
    Conflict-free Replicated Data Types

    View Slide

  81. Strong
    Consistency

    View Slide

  82. Provides atomicity and recency
    Strong Consistency

    View Slide

  83. Prohibits partial writes
    Strong Consistency

    View Slide

  84. A A A

    View Slide

  85. A A A
    Val = B

    View Slide

  86. A A A
    Val = B

    View Slide

  87. B A A

    View Slide

  88. B A A
    Get Operation with Read Repair

    View Slide

  89. B A A
    Get Operation with Read Repair

    View Slide

  90. B A A
    Get Operation with Read Repair
    B B

    View Slide

  91. Single key atomic operations
    Strong Consistency

    View Slide

  92. Requires read/modify/write cycle (CAS)
    Strong Consistency

    View Slide

  93. Consensus

    View Slide

  94. Distributed Consensus
    The problem of reaching agreement among remote
    processes is one of the most fundamental problems in
    distributed computing and is at the core of many
    algorithms for distributed data processing,
    distributed file management, and fault-tolerant
    distributed applications.
    Fischer, Lynch, Paterson

    View Slide

  95. Termination, agreement, validity
    The Consensus Problem

    View Slide

  96. All processes eventually decide on a value
    Termination

    View Slide

  97. All processes decide on the same value
    Agreement

    View Slide

  98. Value decided on had to have been proposed
    Validity

    View Slide

  99. Consensus Algorithms

    View Slide

  100. Paxos, ZAB, Raft, etc.
    Consensus Algorithms

    View Slide

  101. Coordinated requests with a chosen leader
    The Paxos Algorithm

    View Slide

  102. Node 1 Node 2 Node 3
    N++ prepare(N)
    promise(N, Vb) promise(N, Vc)
    Vn = f(Va, Vb, Vc) commit(N, Vn)
    accept(N)

    View Slide

  103. First request
    Multi-Paxos

    View Slide

  104. Node 1 Node 2 Node 3
    N++; I = 0 prepare(N, I)
    promise(N, I, Vb) promise(N, I, Vc)
    Vn = f(Va, Vb, Vc) commit(N, I, Vn)
    accept(N, I)

    View Slide

  105. Each additional request
    Multi-Paxos

    View Slide

  106. Node 1 Node 2 Node 3
    I++ commit(N, I, V)
    accept(N, I)

    View Slide

  107. Ship entire state!
    Multi-Paxos

    View Slide

  108. Riak

    View Slide

  109. Key-value store; keys are independent state
    Riak

    View Slide

  110. Multi-Paxos per key; CAS on isolated state
    Riak

    View Slide

  111. Consensus Groups

    View Slide

  112. Participants in decisioning; ensembles
    Consensus Groups

    View Slide

  113. Use the preference list!
    Consensus Groups

    View Slide

  114. preflist

    View Slide

  115. View Slide

  116. View Slide

  117. View Slide

  118. View Slide

  119. One ensemble per preference list; ring size
    Consensus Groups

    View Slide

  120. Ensembles

    View Slide

  121. election of leader; get/put operations
    Riak Ensembles

    View Slide

  122. read local; refresh, if old
    Get Operations

    View Slide

  123. Node 1 Node 2 Node 3
    obj.epoch < epoch get(key)
    reply(Epochb, Seqb, Valb)
    Val = latest(Vala, Valb, Valc)
    Val.epoch = epoch
    write(Epoch, ++Seq, Val)
    ack(Epoch, Seq)
    reply(Epochc, Seqc, Valc)

    View Slide

  124. Node 1 Node 2 Node 3
    obj.epoch == epoch
    Reply = local_get(Key)

    View Slide

  125. Worst Case: 2 roundtrips / write
    Get Operations
    Best Case: 0 roundtrips / write

    View Slide

  126. read local; refresh, modify and commit if old
    Put Operations

    View Slide

  127. Node 1 Node 2 Node 3
    obj.epoch < epoch get(key)
    reply(Epochb, Seqb, Valb)
    Latest = latest(Vala, Valb, Valc)
    Val = modify(Latest)
    write(Epoch, ++Seq, Val)
    ack(Epoch, Seq)
    reply(Epochc, Seqc, Valc)

    View Slide

  128. Node 1 Node 2 Node 3
    obj.epoch == epoch
    Latest = local_get(Key)
    Val = modify(Latest)
    write(Epoch, ++Seq, Val)
    ack(Epoch, Seq)

    View Slide

  129. Worst Case: 2 roundtrips / write
    Put Operations
    Best Case: 1 roundtrips / write

    View Slide

  130. Elect a new leader; start a new epoch
    Failed Quorums

    View Slide

  131. Cluster Membership

    View Slide

  132. Use joint consensus from multi paxos
    Dynamic Membership

    View Slide

  133. Existing Ensemble Joining Ensemble
    riak_01
    riak_02
    riak_03
    riak_07
    riak_08
    riak_09
    [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]

    View Slide

  134. Joint-Consensus Ensemble
    [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

    View Slide

  135. Joint-Consensus Ensemble
    [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

    View Slide

  136. New Ensemble
    riak_07
    riak_08
    riak_09
    [{riak_07}, {riak_08}, {riak_09}]

    View Slide

  137. Distributed batch processing for Riak
    MapReduce

    View Slide

  138. Data locality for map; coordinator for reduce
    MapReduce

    View Slide

  139. View Slide

  140. $ curl -XPUT http://localhost:10018/buckets/training/keys/foo \
    -H 'Content-Type: text/plain' \
    -d 'caremad data goes here'
    $ curl -XPUT http://localhost:10018/buckets/training/keys/bar \
    -H 'Content-Type: text/plain' \
    -d 'caremad caremad caremad caremad'
    $ curl -XPUT http://localhost:10018/buckets/training/keys/baz \
    -H 'Content-Type: text/plain' \
    -d 'nothing to see here'
    $ curl -XPUT http://localhost:10018/buckets/training/keys/bam \
    -H 'Content-Type: text/plain' \
    -d 'caremad caremad caremad'
    Create some objects; http://docs.basho.com/riak/latest/dev/using/mapreduce/

    View Slide

  141. > ReFun = fun(O, _, Re) ->
    case re:run(riak_object:get_value(O), Re, [global]) of
    {match, Matches} -> [{riak_object:key(O), length(Matches)}];
    nomatch -> [{riak_object:key(O), 0}]
    end
    end.
    > {ok, Re} = re:compile("caremad").
    > {ok, Riak} = riakc_pb_socket:start_link("127.0.0.1", 8087).
    > riakc_pb_socket:mapred_bucket(Riak, <<"training">>,
    [{map, {qfun, ReFun}, Re, true}]).
    Run Erlang MapReduce; http://docs.basho.com/riak/latest/dev/using/mapreduce/

    View Slide

  142. Distributed secondary indexing over values
    Secondary Indexes (2i)

    View Slide

  143. Requires LevelDB or memory backend
    Secondary Indexes (2i)

    View Slide

  144. Tag objects; perform equality or range queries
    Secondary Indexes (2i)

    View Slide

  145. $ curl -XPOST localhost:8098/types/mytype/buckets/users/keys/
    john_smith \
    -H 'x-riak-index-twitter_bin: jsmith123' \
    -H 'x-riak-index-email_bin: [email protected]' \
    -H 'Content-Type: application/json' \
    -d '{"userData":"data"}'
    Create values with secondary index tags; http://docs.basho.com/riak/latest/dev/using/2i/

    View Slide

  146. $ curl http://localhost:10018/buckets/users/index/twitter_bin/
    jsmith123
    Query secondary index; http://docs.basho.com/riak/latest/dev/using/2i/

    View Slide

  147. Riak integration with Solr Distributed Search
    Riak Search

    View Slide

  148. View Slide

  149. Schemas explain how to index fields
    Riak Search Components

    View Slide

  150. Indexes are built and queried against
    Riak Search Components

    View Slide

  151. Bucket-Index associations say when to index
    Riak Search Components

    View Slide

  152. Default schema covers many content-types
    Riak Search Components

    View Slide

  153. $ curl -XPUT http://localhost:10018/search/index/famous
    Create default index using default schema; http://docs.basho.com/riak/latest/dev/using/search/

    View Slide

  154. $ curl -XPUT http://localhost:10018/search/index/famous \
    -H 'Content-Type: application/json' \
    -d '{"schema":"_yz_default"}'
    Create default index using default schema; http://docs.basho.com/riak/latest/dev/using/search/

    View Slide

  155. $ riak-admin bucket-type create animals '{"props":{}}'
    $ riak-admin bucket-type activate animals
    Create bucket type for search; http://docs.basho.com/riak/latest/dev/using/search/

    View Slide

  156. $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/props \
    -H 'Content-Type: application/json' \
    -d '{"props":{"search_index":"famous"}}'
    Associate bucket, bucket type, and index; http://docs.basho.com/riak/latest/dev/using/search/

    View Slide

  157. $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/liono \
    -H 'Content-Type: application/json' \
    -d '{"name_s":"Lion-o", "age_i":30, "leader_b":true}'
    $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/cheetara \
    -H 'Content-Type: application/json' \
    -d '{"name_s":"Cheetara", "age_i":28, "leader_b":false}'
    $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/snarf \
    -H 'Content-Type: application/json' \
    -d '{"name_s":"Snarf", "age_i":43}'
    $ curl -XPUT http://localhost:10018/types/animals/buckets/cats/keys/panthro \
    -H 'Content-Type: application/json' \
    -d '{"name_s":"Panthro", "age_i":36}'
    Store some values; http://docs.basho.com/riak/latest/dev/using/search/

    View Slide

  158. $ curl “http://localhost:10018/search/query/famous?
    wt=json&q=name_s:Lion*” | jsonpp
    $ curl “http://localhost:10018/search/query/famous?wt=json&q=age_i:
    %5B30%20TO%20*%5D” | jsonpp
    $ curl “http://localhost:10018/search/query/famous?
    wt=json&q=leader_b:true%20AND%20age_i:%5B25%20TO%20*%5D” | jsonpp
    Perform search queries; http://docs.basho.com/riak/latest/dev/using/search/

    View Slide

  159. Single-key linearizability; reduced availability
    Strong Consistency

    View Slide

  160. $ riak-admin bucket-type create strongly_consistent \
    ‘{"props":{"consistent":true}}'
    $ riak-admin bucket-type status strongly_consistent
    $ riak-admin bucket-type activate strongly_consistent
    Enable strong consistency; http://docs.basho.com/riak/latest/dev/advanced/strong-consistency/

    View Slide

  161. Read and write a value to SC bucket
    Exercise

    View Slide

  162. Conflict-Free Replicated Data Types
    Strong Eventual Consistency

    View Slide

  163. Converge correctly under concurrent ops *
    Strong Eventual Consistency
    * See the next talk from Annette Bieniusa!

    View Slide

  164. $ riak-admin bucket-type create maps \
    '{"props":{"datatype":"map"}}'
    $ riak-admin bucket-type create sets \
    '{"props":{"datatype":"set"}}'
    $ riak-admin bucket-type create counters \
    ‘{“props":{"datatype":"counter"}}'
    $ riak-admin bucket-type status maps
    $ riak-admin bucket-type activate maps
    Create bucket type for data types; http://docs.basho.com/riak/latest/dev/using/data-types/

    View Slide

  165. $ curl -XPOST http://localhost:10018/types/counters/buckets/counters/
    datatypes/traffic_tickets \
    -H "Content-Type: application/json" \
    -d '{"increment": 1}’
    $ curl http://localhost:10018/types/counters/buckets/counters/
    datatypes/traffic_tickets
    Operate on counters; http://docs.basho.com/riak/latest/dev/using/data-types/

    View Slide

  166. $ curl -XPOST http://localhost:10018/types/sets/buckets/travel/
    datatypes/cities \
    -H "Content-Type: application/json" \
    -d '{"add_all":["Toronto", “Montreal"]}'
    $ curl -XPOST http://localhost:10018/types/sets/buckets/travel/
    datatypes/cities \
    -H "Content-Type: application/json" \
    -d '{"remove": “Montreal"}'
    $ curl http://localhost:10018/types/sets/buckets/travel/datatypes/
    cities
    Operate on sets; http://docs.basho.com/riak/latest/dev/using/data-types/

    View Slide

  167. $ curl -XPOST http://localhost:10018/types/maps/buckets/customers/
    datatypes/ahmed_info \
    -H "Content-Type: application/json" \
    -d '
    {
    "update": {
    "first_name_register": "Ahmed",
    "phone_number_register": "5551234567"
    }
    }'
    $ curl -XPOST http://localhost:8098/types/maps/buckets/customers/
    datatypes/ahmed_info \
    -H "Content-Type: application/json" \
    -d '
    {
    "update": {
    "annika_info_map": {
    "update": {
    "interests_set": {
    "add": "tango dancing"
    }
    }
    }
    }
    }
    ' Operate on maps; http://docs.basho.com/riak/latest/dev/using/data-types/

    View Slide

  168. Read and write a value to map
    Exercise

    View Slide

  169. Questions?

    View Slide