Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kudos on Operating Riak or Riak CS

UENISHI Kota
January 22, 2015

Kudos on Operating Riak or Riak CS

渾身のアニメーションはなかったことになっていますが。。。

UENISHI Kota

January 22, 2015
Tweet

More Decks by UENISHI Kota

Other Decks in Technology

Transcript

  1. ࣗݾ঺հ • @kuenishi • Github, Twitter, etc • ෼ࢄγεςϜྺ6೥ •

    Bashoδϟύϯͷํ͔Βདྷ·ͨ͠ • Riak CSͷ։ൃ
  2. Consistent Hashing & The Ring • 160-bit integer keyspace •

    Divided into fixed number of evenly-sized partitions • Partitions are claimed by nodes in the cluster • Replicas go to the N partitions following the key 32 partitions node 0 node 1 node 2 node 3 0 2160/2 2160/4 hash(“conferences/thoughtworks”) N=3
  3. Quorum •Every request contacts all replicas of key •N -

    number of replicas (default 3) •R - read quorum •W - write quorum
  4. Hinted Handoff • Node fails • Requests go to fallback

    • Node comes back • “Handoff” - data returns to recovered node • Normal operations resume hash(“conferences/thoughtworks”) X X X X X X X X
  5. Anatomy of a Request get(“conferences/thoughtworks”) Get Handler (FSM) client Riak

    hash(“conferences/ thoughtworks”) == 10, 11, 12 get(“conferences/thoughtworks”) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 v2 v1 v2 v2 v1 v2
  6. Read Repair v2 v2 get(“conferences/thoughtworks”) Get Handler (FSM) client Riak

    Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v1 v2 v2 v1 v2 v1 v1 v2 v2
  7. Active Anti Entropy • APࢦ޲ͷDBͷσʔλྼԽΛ๷͙ ͨΊͷόοΫάϥ΢ϯυॲཧ • Merkle-TreeΛ࢖ͬͯύʔςΟγϣ ϯຖͷʮνΣοΫαϜʯΛܭࢉ •

    ࠩ෼Λݟ͚ͭͨΒͦ͜ΛRead Repair͢Δˠम෮ͷॴཁ࣌ؒ͸ࠩ ෼ͷྔʹൺྫ • KeyຖʹϥϯμϜΞΫηε hash(vnode=0, ɹpid=0) hash(vnode=1, ɹpid=0) hash(vnode=2, ɹpid=0) pid: partition id
  8. ෺ཧత഑ஔ Ring vnode vnode vnode vnode vnode vnode vnode Node1

    Ring vnode vnode vnode vnode vnode vnode vnode Node2 Ring vnode vnode vnode vnode vnode vnode vnode Node3 Ring vnode vnode vnode vnode vnode vnode vnode Node4
  9. ࿦ཧత഑ஔ Ring vnode vnode vnode vnode v vnode vnode vnode

    vnode vnode vnode vnode vnode vnode ࿦ཧత഑ஔͱ෺ཧత഑ஔ͕ ه࿥͞Ε͍ͯΔϝλσʔλ͕Ring
  10. Repair Transfer vnode vnode vnode vnode v vnode v vnode

    vnode vnode vnode vnode vnode vnode ྆ྡͷvnode͔ΒσʔλΛ ίϐʔ͢Δ vnode
  11. Ringมߋ •riak-admin cluster plan •ʮRingมߋϓϥϯʯΛશϊʔυͰڞ༗ •BeforeͷRingͱɺAfterͷRing •͜ͷஈ֊Ͱ͸·࣮ͩߦ͠ͳ͍ Ring Ring’ Ring

    vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode •riak-admin cluster commit •planͰ࡞੒ͨ͠ϓϥϯͷ࣮ߦΛ։࢝ ͢Δ •Transfer͕࢝·Δ plan Ring Ring’ vnode vnode vnode vnode vnode vnode vnode commit
  12. cluster leave [email protected] •อ͍࣋ͯ͠ΔvnodeΛશͯ୭͔ଞͷਓʹOwnership Handoff͔ͯ͠Βऴྃ Ring Ring’ vnode vnode vnode

    vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode Ring Ring’ Ring Ring’ Ring Ring’ Ring’ Ring’ Ring’
  13. cluster force-remove [email protected] •vnodeͷ࣋ͪओ͸ࢮΜͩʢͦͯ͠ؼͬͯ͜ͳ͍ʣͷͰɺ طଘͷผͷϊʔυͰ৽ۭ͘͠ͷvnodeΛ࡞੒͢Δ Ring vnode vnode vnode vnode

    vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode Ring Ring’ Ring Ring’ Ring Ring’ v v v v v v v Ring’ Ring’ Ring’
  14. ۭͷvnode͕Ͱ͖·ͨ͠ Ͳ͏͢Δʁ • AAE • ॴཁ࣌ؒ͸ࠩ෼ͷྔʹൺྫ • Read RepairΛ൐͏ʢϥϯμϜΞΫηεʣ •

    VNode Repair • ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ • σʔλΛόϧΫͰసૹʢγʔέϯγϟϧʣ
  15. Case study 1: ಉ͡ϊʔυͰ • Lv1: RingϑΝΠϧ΋σʔλ΋શ ෦࢒͍ͬͯΔ • ىಈ͠ͳ͓ͯ͠AAE଴ͪ

    • མͪͯͨظ͕ؒ௕͚Ε͹ʢࠩ ෼͕େ͖͚Ε͹ʣVNode RepairΛ࣮ߦ • Lv2: RingϑΝΠϧ͕࢒͍ͬͯΔ • ىಈ͠ͳ͓ͯ͠VNode Repair Ring vnode vnode vnode vnode vnode vnode vnode v v v v v v v Ring
  16. Case study 1: ಉ͡ϊʔυͰ • Lv3: Ring͸ͳ͍͕σʔλ͸࢒ͬͯ ͍Δ • joinͯ͠

    AAE ʹΑΔम෮଴ͪ • མͪͯͨظ͕ؒ௕͚Ε͹ʢࠩ෼ ͕େ͖͚Ε͹ʣ joinͯ͠ VNode Repair • Lv4: Ring΋σʔλ΋࢒͍ͬͯͳ͍ • joinͯ͠VNode Repair Ring vnode vnode vnode vnode vnode vnode vnode v v v v v v v Ring
  17. Case study 2: ಉ͡ϊʔυ͸΋͏ͳ͍ • Lv1: ୅ΘΓͷϊʔυΛ͙͢ʹ༻ҙ Ͱ͖Δ • force-replace͔ΒͷVNode

    Repair • force-remove͔Βͷjoin ͸ק Ίͳ͍ • Lv2: ୅ΘΓͷϊʔυΛ͙͢ʹ༻ҙ Ͱ͖ͳ͍ • force-remove͔ΒͷVNode Repair v Ring’ v v v v v v
  18. Handoff͕ਐ·ͳ͍ͱ͖͸ • riak-admin cluster [partition| partition-count|partitions] • ύʔςΟγϣϯɺͲΕ͕Ͳ͜ʹ͋Δ͔Λ೺Ѳ͢Δ • riak-admin

    handoff [enable|disable| summary|details|config] • HandoffΛࡉ੍͔͘ޚͨ͠ΓɺઃఆΛ֬ೝ ※Riak 2.0.4~
  19. Tip: ଟॏނোͨ͠ͱ͖͸ • ྡΓ߹͏vnodeΛ࣋ͭϊʔυ͕མͪͯͳ͍͔Ͳ͏͔νΣοΫ ͢Δʢ3ͭ࿈ଓͯ͠ྡΓ߹͏vnode͕ͳ͘ͳͬͨΒσʔλϩ εʹͳΔʣ • 1.4ܥͳΒ $ riak

    attach > riak_core_ring:pretty_print(element(2, riak_core_ring_manager:get_my_ring()), []). • 2.0.4 Ҏ߱ͳΒ $ riak-admin cluster partitions —[email protected]
  20. Riak CSͷσʔλߏ଄ http://spambucket.riakcs.net/ham/to/foo.txt {spambucket, ham/to/foo.txt} {“0o:”+hash(spambucket), hash(ham/to/foo.txt)} {“0b:”+hash(spambucket), {UUID, N}}

    manifest content-size UUID content-type etc.. chunk 1MB (default) Riak key Riak key Riak CS Bucket Riak bucket Riak bucket ɾશͯRiakͷKey-ValueͰදݱ Riak CS Key
  21. ΦϒδΣΫτͷϥΠϑαΠΫϧ •ෳ਺ͷߋ৽ϦΫΤετ͕ಉ࣌ʹ͖ͯ΋ΦϒδΣ Ϋτ಺༰Λഁյ͠ͳ͍ writing active pending_delete scheduled_delete new UUID GC

    upload complete delete overwrite del by GC • ϚχϑΣετͷঢ়ଶ؅ཧ ͰϑΝΠϧΛ੔߹ͨ͠Α ͏ʹݟͤΔ • ಉ࣌ߋ৽Ͱ΋Vector Clocksͷಡࠐ࣌ʹCSͷϩ δοΫͰղܾ
  22. Garbage Collection • DELETE͸ϚχϑΣετͷঢ়ଶΛมߋ͢Δ͚ͩ • ࣮ࡍͷʢRiak্ͷʣDELETE͸ leeway seconds ޙʹ GC

    WorkerʹΑͬͯىಈ͞ΕΔ • ઌʹ࡟আ͞Εͨ΋ͷ͔Βॱ൪ʹফ͍ͯ͘͠ • blockΛશ෦ফͨ͠ΒManifestΛߋ৽ • app.config “gc_max_workers” • RiakͰͷDELETE͸ read-before-put-tombstone
  23. FAQ • ʮΦϒδΣΫτΛDELETEͨ͠ͷʹσΟεΫ࢖༻ྔ͕ݮΒͳ͍ʯ • GC͸௥͍͍͍ͭͯΔ͔ʁ • ʮԿ΋͍ͯ͠ͳ͍ͷʹσΟεΫෛՙ͕ߴ͍ʯ • AAE tree

    build, GC, Bitcask merge, VNode Repair͕ى͖͍ͯͳ͍͔νΣοΫ • ετϨʔδͷϒϩοΫ্͕͔ΒԼ·Ͱਖ਼͍͠αΠζͰΞϥΠϯϝϯτ͞Ε͍ͯ Δ͔ • ετϨʔδ͕ਖ਼͘͠શσΟεΫʹετϥΠϐϯά͞Ε͍ͯΔ͔ • CSͷϓϩηεͷϝϞϦ࢖༻ྔ͕ߴ͍ • ManifestʹେྔͷSiblings͕Ͱ͖͍ͯΔՄೳੑ͕ • node_get_fsm_siblings_100 ΛνΣοΫ
  24. ࢀߟ • ো֐ؔ࿈ͷ೔ຊޠͰͷղઆ • http://qiita.com/kuenishi/items/d49554874305e34619bc • Handoff·ΘΓͷίϚϯυղઆ • http://docs.basho.com/riak/latest/ops/running/handoff/ •

    VNode Repairͷखॱ • http://docs.basho.com/riak/latest/ops/running/recovery/ repairing-partitions/ • Transfer·ΘΓͷύϥϝʔλΛಈతʹઃఆ͢Δ • http://docs.basho.com/riak/latest/ops/running/tools/riak- admin/#set