Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kudos on Operating Riak or Riak CS

E1923013dacab39eb231a2fffbf7b33c?s=47 UENISHI Kota
January 22, 2015

Kudos on Operating Riak or Riak CS

渾身のアニメーションはなかったことになっていますが。。。

E1923013dacab39eb231a2fffbf7b33c?s=128

UENISHI Kota

January 22, 2015
Tweet

Transcript

  1. Riak / Riak CS ӡ༻ͷίπ 2015/1/22 Riak Meetup #5 @Yahoo!JAPAN

    Bashoδϟύϯɹ্੢
  2. ࣗݾ঺հ • @kuenishi • Github, Twitter, etc • ෼ࢄγεςϜྺ6೥ •

    Bashoδϟύϯͷํ͔Βདྷ·ͨ͠ • Riak CSͷ։ൃ
  3. ͋Β͢͡ •Riak CS Λӡ༻͢Δʹ͸ Riak ͷӡ༻΋ඞཁ •ਖ਼͘͠ӡ༻͢ΔͨΊʹ͸ɺͦΕͳΓͷ஌͕ࣝඞཁ •σʔλྔ͕૿͑ΔͱɺͲΜͳσʔλετΞͰ΋ͦ ΕͳΓʹେม •Ͳ͜ʹԿ͕ೖ͍ͬͯΔ͔͕গ͠ϢχʔΫ

    •͜͏͍ͬͨجຊΛӡ༻ͱؔ࿈෇͚ͯղઆ
  4. جຊతͳ࢓૊Έ

  5. Consistent Hashing & The Ring • 160-bit integer keyspace •

    Divided into fixed number of evenly-sized partitions • Partitions are claimed by nodes in the cluster • Replicas go to the N partitions following the key 32 partitions node 0 node 1 node 2 node 3 0 2160/2 2160/4 hash(“conferences/thoughtworks”) N=3
  6. Quorum •Every request contacts all replicas of key •N -

    number of replicas (default 3) •R - read quorum •W - write quorum
  7. Hinted Handoff • Node fails • Requests go to fallback

    • Node comes back • “Handoff” - data returns to recovered node • Normal operations resume hash(“conferences/thoughtworks”) X X X X X X X X
  8. Anatomy of a Request get(“conferences/thoughtworks”) Get Handler (FSM) client Riak

    hash(“conferences/ thoughtworks”) == 10, 11, 12 get(“conferences/thoughtworks”) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 v2 v1 v2 v2 v1 v2
  9. Read Repair v2 v2 get(“conferences/thoughtworks”) Get Handler (FSM) client Riak

    Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v1 v2 v2 v1 v2 v1 v1 v2 v2
  10. Active Anti Entropy • APࢦ޲ͷDBͷσʔλྼԽΛ๷͙ ͨΊͷόοΫάϥ΢ϯυॲཧ • Merkle-TreeΛ࢖ͬͯύʔςΟγϣ ϯຖͷʮνΣοΫαϜʯΛܭࢉ •

    ࠩ෼Λݟ͚ͭͨΒͦ͜ΛRead Repair͢Δˠम෮ͷॴཁ࣌ؒ͸ࠩ ෼ͷྔʹൺྫ • KeyຖʹϥϯμϜΞΫηε hash(vnode=0, ɹpid=0) hash(vnode=1, ɹpid=0) hash(vnode=2, ɹpid=0) pid: partition id
  11. VNode Repair •ίϚϯυʹ͸ͳ͍ͬͯͳ͍ศརػೳ •྆ྡͷϊʔυ͔Βɺࣗ෼͕࣋ͭ΂͖σʔλ ΛόϧΫͰશͯίϐʔͯ͘͠Δ •ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ

  12. Ring •ΫϥελʔʹࢀՃ͍ͯ͠ΔϝϯόʔͷҰཡ •Ͳͷvnode͕Ͳͷϊʔυʹೖ͍ͬͯΔ͔ •TransferதͳΒɺ৽چͷvnode഑ஔ

  13. ෺ཧత഑ஔ Ring vnode vnode vnode vnode vnode vnode vnode Node1

    Ring vnode vnode vnode vnode vnode vnode vnode Node2 Ring vnode vnode vnode vnode vnode vnode vnode Node3 Ring vnode vnode vnode vnode vnode vnode vnode Node4
  14. ࿦ཧత഑ஔ Ring vnode vnode vnode vnode v vnode vnode vnode

    vnode vnode vnode vnode vnode vnode ࿦ཧత഑ஔͱ෺ཧత഑ஔ͕ ه࿥͞Ε͍ͯΔϝλσʔλ͕Ring
  15. Transfer •vnodeΛผͷϊʔυ΁సૹ͢Δಈ࡞ •ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ •Ownership Transfer •Handoff Transfer •Repair Transfer

  16. Ownership Transfer •vnodeͷॴ༗ݖΛผͷϊʔυ΁సૹ͢Δಈ࡞ •RingͷมߋΛ൐͏ Ring vnode vnode* vnode vnode vnode

    vnode vnode Node Ring vnode* vnode vnode vnode vnode vnode vnode Node
  17. Handoff Transfer •HandoffͰҰ࣌తʹड͚͍࣋ͬͯͨσʔλΛOwner΁ฦ͢ •RingͷมߋΛ൐ͳΘͳ͍ Ring vnode vnode vnode vnode vnode

    vnode vnode Node Ring vnode* vnode vnode vnode vnode vnode vnode Node vnode*
  18. Repair Transfer vnode vnode vnode vnode v vnode v vnode

    vnode vnode vnode vnode vnode vnode ྆ྡͷvnode͔ΒσʔλΛ ίϐʔ͢Δ vnode
  19. ࣮ફฤ

  20. ӡ༻ͷͱ͖ʹ࢖͏ίϚϯυ •riak [start|stop|ping] •riak-admin cluster [commit|plan|join] •riak-admin cluster [force-remove| force-replace]

    •riak-admin down
  21. ϊʔυ௥Ճ •riak-admin cluster join <node> •ϊʔυ௥ՃͷʮมߋʯΛ༧໿ •શvnodeͰOwnershipͷ࠶഑ஔΛܭը •commitޙOwnership Transfer͕ൃੜ •ͳΔ΂͘·ͱΊͯjoin

  22. Ringมߋ •riak-admin cluster plan •ʮRingมߋϓϥϯʯΛશϊʔυͰڞ༗ •BeforeͷRingͱɺAfterͷRing •͜ͷஈ֊Ͱ͸·࣮ͩߦ͠ͳ͍ Ring Ring’ Ring

    vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode •riak-admin cluster commit •planͰ࡞੒ͨ͠ϓϥϯͷ࣮ߦΛ։࢝ ͢Δ •Transfer͕࢝·Δ plan Ring Ring’ vnode vnode vnode vnode vnode vnode vnode commit
  23. ϊʔυ࡟আ •riak-admin cluster … •leave •force-remove •force-replace

  24. cluster leave riak@10.1.1.1 •อ͍࣋ͯ͠ΔvnodeΛશͯ୭͔ଞͷਓʹOwnership Handoff͔ͯ͠Βऴྃ Ring Ring’ vnode vnode vnode

    vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode Ring Ring’ Ring Ring’ Ring Ring’ Ring’ Ring’ Ring’
  25. cluster force-remove riak@10.1.1.1 •vnodeͷ࣋ͪओ͸ࢮΜͩʢͦͯ͠ؼͬͯ͜ͳ͍ʣͷͰɺ طଘͷผͷϊʔυͰ৽ۭ͘͠ͷvnodeΛ࡞੒͢Δ Ring vnode vnode vnode vnode

    vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode Ring Ring’ Ring Ring’ Ring Ring’ v v v v v v v Ring’ Ring’ Ring’
  26. cluster force-replace (from) (to) •vnodeͷ࣋ͪओ͸ࢮΜͩʢͦͯ͠ؼͬͯ͜ͳ͍ʣͷͰɺผ ͷ৽͍͠ϊʔυʹׂΓ౰ͯͯ৽ۭ͘͠ͷvnodeΛ࡞੒͢Δ Ring vnode vnode vnode

    vnode vnode vnode vnode (from) v (to) Ring’ v v v v v v Ring Ring’
  27. ·ͱΊ •leave •Ownership Transfer͋Γ •force-remove •Ownership Transferͳ͠ɺશମͰ࠶഑ஔ͕ൃੜ •ۭͷvnode͕࡞ΒΕΔ •force-replace •Ownership

    Transferͳ͠ɺ࠶഑ஔ͸ൃੜ͠ͳ͍ •ۭͷvnode͕࡞ΒΕΔ
  28. ۭͷvnode͕Ͱ͖·ͨ͠ Ͳ͏͢Δʁ • AAE • ॴཁ࣌ؒ͸ࠩ෼ͷྔʹൺྫ • Read RepairΛ൐͏ʢϥϯμϜΞΫηεʣ •

    VNode Repair • ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ • σʔλΛόϧΫͰసૹʢγʔέϯγϟϧʣ
  29. Case study 1: ಉ͡ϊʔυͰ • Lv1: RingϑΝΠϧ΋σʔλ΋શ ෦࢒͍ͬͯΔ • ىಈ͠ͳ͓ͯ͠AAE଴ͪ

    • མͪͯͨظ͕ؒ௕͚Ε͹ʢࠩ ෼͕େ͖͚Ε͹ʣVNode RepairΛ࣮ߦ • Lv2: RingϑΝΠϧ͕࢒͍ͬͯΔ • ىಈ͠ͳ͓ͯ͠VNode Repair Ring vnode vnode vnode vnode vnode vnode vnode v v v v v v v Ring
  30. Case study 1: ಉ͡ϊʔυͰ • Lv3: Ring͸ͳ͍͕σʔλ͸࢒ͬͯ ͍Δ • joinͯ͠

    AAE ʹΑΔम෮଴ͪ • མͪͯͨظ͕ؒ௕͚Ε͹ʢࠩ෼ ͕େ͖͚Ε͹ʣ joinͯ͠ VNode Repair • Lv4: Ring΋σʔλ΋࢒͍ͬͯͳ͍ • joinͯ͠VNode Repair Ring vnode vnode vnode vnode vnode vnode vnode v v v v v v v Ring
  31. Case study 2: ಉ͡ϊʔυ͸΋͏ͳ͍ • Lv1: ୅ΘΓͷϊʔυΛ͙͢ʹ༻ҙ Ͱ͖Δ • force-replace͔ΒͷVNode

    Repair • force-remove͔Βͷjoin ͸ק Ίͳ͍ • Lv2: ୅ΘΓͷϊʔυΛ͙͢ʹ༻ҙ Ͱ͖ͳ͍ • force-remove͔ΒͷVNode Repair v Ring’ v v v v v v
  32. ஫ҙ఺ • ϊʔυ1,2୆͸͍͍ͭͳ͘ͳͬͯ΋େৎ෉ͳΑ͏ ʹΩϟύγςΟઃܭ͠·͠ΐ͏ • ϊʔυ௥Ճ·ͰͷϦʔυλΠϜͱෳ੡ͷ෮چʹ͔ ͔Δ࣌ؒʢVNode Repair or AAEͰશ෦෮چ͢

    Δ࣌ؒʣ͕ɺMTBF ҎԼʹͳΔΑ͏ʹ͠·͠ΐ͏
  33. Handoff͕ਐ·ͳ͍ͱ͖͸ • riak-admin cluster [partition| partition-count|partitions] • ύʔςΟγϣϯɺͲΕ͕Ͳ͜ʹ͋Δ͔Λ೺Ѳ͢Δ • riak-admin

    handoff [enable|disable| summary|details|config] • HandoffΛࡉ੍͔͘ޚͨ͠ΓɺઃఆΛ֬ೝ ※Riak 2.0.4~
  34. AAEͷਐḿͲ͏Ͱ͔͢ʁ • ͜·Ίʹ riak-admin aae-status Ͱ֬ೝ͠Α͏ ʢsearchͷ৔߹͸ riak-admin search aae-status

    ʣ • Tree Expire͕1िؒͱ୹͍ͷͰɺ௕Ίʹઃఆ • 2w ~ 8w ?
  35. Bitcask merge / LevelDB compaction • ࿦ཧతʹ࡟আ͞ΕͨσʔλΛϑΝΠϧγεςϜ ্͔Βফ͢ॲཧ • Ұ౓શ෦ಡΜͰɺੜ͖͍ͯΔσʔλΛҠ͔ͯ͠

    Β unlink • bitcask͸ merge_window, max_file_size ͳͲͰௐ੔
  36. Tip: ଟॏނোͨ͠ͱ͖͸ • ྡΓ߹͏vnodeΛ࣋ͭϊʔυ͕མͪͯͳ͍͔Ͳ͏͔νΣοΫ ͢Δʢ3ͭ࿈ଓͯ͠ྡΓ߹͏vnode͕ͳ͘ͳͬͨΒσʔλϩ εʹͳΔʣ • 1.4ܥͳΒ $ riak

    attach > riak_core_ring:pretty_print(element(2, riak_core_ring_manager:get_my_ring()), []). • 2.0.4 Ҏ߱ͳΒ $ riak-admin cluster partitions —node=riak@10.1.1.10
  37. جຊతͳ࢓૊Έ

  38. s3cmd s3fs AWS SDK Stanchion LB S3 REST API Riak

    CS ͷϓϩηεߏ੒ PB
  39. Riak CSͷσʔλߏ଄ http://spambucket.riakcs.net/ham/to/foo.txt {spambucket, ham/to/foo.txt} {“0o:”+hash(spambucket), hash(ham/to/foo.txt)} {“0b:”+hash(spambucket), {UUID, N}}

    manifest content-size UUID content-type etc.. chunk 1MB (default) Riak key Riak key Riak CS Bucket Riak bucket Riak bucket ɾશͯRiakͷKey-ValueͰදݱ Riak CS Key
  40. ΦϒδΣΫτͷϥΠϑαΠΫϧ •ෳ਺ͷߋ৽ϦΫΤετ͕ಉ࣌ʹ͖ͯ΋ΦϒδΣ Ϋτ಺༰Λഁյ͠ͳ͍ writing active pending_delete scheduled_delete new UUID GC

    upload complete delete overwrite del by GC • ϚχϑΣετͷঢ়ଶ؅ཧ ͰϑΝΠϧΛ੔߹ͨ͠Α ͏ʹݟͤΔ • ಉ࣌ߋ৽Ͱ΋Vector Clocksͷಡࠐ࣌ʹCSͷϩ δοΫͰղܾ
  41. Garbage Collection • DELETE͸ϚχϑΣετͷঢ়ଶΛมߋ͢Δ͚ͩ • ࣮ࡍͷʢRiak্ͷʣDELETE͸ leeway seconds ޙʹ GC

    WorkerʹΑͬͯىಈ͞ΕΔ • ઌʹ࡟আ͞Εͨ΋ͷ͔Βॱ൪ʹফ͍ͯ͘͠ • blockΛશ෦ফͨ͠ΒManifestΛߋ৽ • app.config “gc_max_workers” • RiakͰͷDELETE͸ read-before-put-tombstone
  42. FAQ • ʮΦϒδΣΫτΛDELETEͨ͠ͷʹσΟεΫ࢖༻ྔ͕ݮΒͳ͍ʯ • GC͸௥͍͍͍ͭͯΔ͔ʁ • ʮԿ΋͍ͯ͠ͳ͍ͷʹσΟεΫෛՙ͕ߴ͍ʯ • AAE tree

    build, GC, Bitcask merge, VNode Repair͕ى͖͍ͯͳ͍͔νΣοΫ • ετϨʔδͷϒϩοΫ্͕͔ΒԼ·Ͱਖ਼͍͠αΠζͰΞϥΠϯϝϯτ͞Ε͍ͯ Δ͔ • ετϨʔδ͕ਖ਼͘͠શσΟεΫʹετϥΠϐϯά͞Ε͍ͯΔ͔ • CSͷϓϩηεͷϝϞϦ࢖༻ྔ͕ߴ͍ • ManifestʹେྔͷSiblings͕Ͱ͖͍ͯΔՄೳੑ͕ • node_get_fsm_siblings_100 ΛνΣοΫ
  43. ࠔͬͨͱ͖͸ •riak-users-jp@lists.basho.com •ϩά΍ઃఆϑΝΠϧΛఴ෇ͯ͠ •ͦΕͰ΋ΠϚΠνͳͱ͖͸riak-debug, riak-cs-debug ͷ݁ՌΛ౤ߘͯ͠ΈΑ͏ʂ

  44. We are hiring! •࣮ੈքͷ෼ࢄγεςϜͷ ໰୊ʹڵຯ͋Δਓʂ •@BashoJapan •kota@basho.com

  45. Questions?

  46. ࢀߟ • ো֐ؔ࿈ͷ೔ຊޠͰͷղઆ • http://qiita.com/kuenishi/items/d49554874305e34619bc • Handoff·ΘΓͷίϚϯυղઆ • http://docs.basho.com/riak/latest/ops/running/handoff/ •

    VNode Repairͷखॱ • http://docs.basho.com/riak/latest/ops/running/recovery/ repairing-partitions/ • Transfer·ΘΓͷύϥϝʔλΛಈతʹઃఆ͢Δ • http://docs.basho.com/riak/latest/ops/running/tools/riak- admin/#set