Slide 1

Slide 1 text

Riak / Riak CS ӡ༻ͷίπ 2015/1/22 Riak Meetup #5 @Yahoo!JAPAN Bashoδϟύϯɹ্੢

Slide 2

Slide 2 text

ࣗݾ঺հ • @kuenishi • Github, Twitter, etc • ෼ࢄγεςϜྺ6೥ • Bashoδϟύϯͷํ͔Βདྷ·ͨ͠ • Riak CSͷ։ൃ

Slide 3

Slide 3 text

͋Β͢͡ •Riak CS Λӡ༻͢Δʹ͸ Riak ͷӡ༻΋ඞཁ •ਖ਼͘͠ӡ༻͢ΔͨΊʹ͸ɺͦΕͳΓͷ஌͕ࣝඞཁ •σʔλྔ͕૿͑ΔͱɺͲΜͳσʔλετΞͰ΋ͦ ΕͳΓʹେม •Ͳ͜ʹԿ͕ೖ͍ͬͯΔ͔͕গ͠ϢχʔΫ •͜͏͍ͬͨجຊΛӡ༻ͱؔ࿈෇͚ͯղઆ

Slide 4

Slide 4 text

جຊతͳ࢓૊Έ

Slide 5

Slide 5 text

Consistent Hashing & The Ring • 160-bit integer keyspace • Divided into fixed number of evenly-sized partitions • Partitions are claimed by nodes in the cluster • Replicas go to the N partitions following the key 32 partitions node 0 node 1 node 2 node 3 0 2160/2 2160/4 hash(“conferences/thoughtworks”) N=3

Slide 6

Slide 6 text

Quorum •Every request contacts all replicas of key •N - number of replicas (default 3) •R - read quorum •W - write quorum

Slide 7

Slide 7 text

Hinted Handoff • Node fails • Requests go to fallback • Node comes back • “Handoff” - data returns to recovered node • Normal operations resume hash(“conferences/thoughtworks”) X X X X X X X X

Slide 8

Slide 8 text

Anatomy of a Request get(“conferences/thoughtworks”) Get Handler (FSM) client Riak hash(“conferences/ thoughtworks”) == 10, 11, 12 get(“conferences/thoughtworks”) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 v2 v1 v2 v2 v1 v2

Slide 9

Slide 9 text

Read Repair v2 v2 get(“conferences/thoughtworks”) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 R=2 v1 v2 v2 v1 v2 v1 v1 v2 v2

Slide 10

Slide 10 text

Active Anti Entropy • APࢦ޲ͷDBͷσʔλྼԽΛ๷͙ ͨΊͷόοΫάϥ΢ϯυॲཧ • Merkle-TreeΛ࢖ͬͯύʔςΟγϣ ϯຖͷʮνΣοΫαϜʯΛܭࢉ • ࠩ෼Λݟ͚ͭͨΒͦ͜ΛRead Repair͢Δˠम෮ͷॴཁ࣌ؒ͸ࠩ ෼ͷྔʹൺྫ • KeyຖʹϥϯμϜΞΫηε hash(vnode=0, ɹpid=0) hash(vnode=1, ɹpid=0) hash(vnode=2, ɹpid=0) pid: partition id

Slide 11

Slide 11 text

VNode Repair •ίϚϯυʹ͸ͳ͍ͬͯͳ͍ศརػೳ •྆ྡͷϊʔυ͔Βɺࣗ෼͕࣋ͭ΂͖σʔλ ΛόϧΫͰશͯίϐʔͯ͘͠Δ •ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ

Slide 12

Slide 12 text

Ring •ΫϥελʔʹࢀՃ͍ͯ͠ΔϝϯόʔͷҰཡ •Ͳͷvnode͕Ͳͷϊʔυʹೖ͍ͬͯΔ͔ •TransferதͳΒɺ৽چͷvnode഑ஔ

Slide 13

Slide 13 text

෺ཧత഑ஔ Ring vnode vnode vnode vnode vnode vnode vnode Node1 Ring vnode vnode vnode vnode vnode vnode vnode Node2 Ring vnode vnode vnode vnode vnode vnode vnode Node3 Ring vnode vnode vnode vnode vnode vnode vnode Node4

Slide 14

Slide 14 text

࿦ཧత഑ஔ Ring vnode vnode vnode vnode v vnode vnode vnode vnode vnode vnode vnode vnode vnode ࿦ཧత഑ஔͱ෺ཧత഑ஔ͕ ه࿥͞Ε͍ͯΔϝλσʔλ͕Ring

Slide 15

Slide 15 text

Transfer •vnodeΛผͷϊʔυ΁సૹ͢Δಈ࡞ •ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ •Ownership Transfer •Handoff Transfer •Repair Transfer

Slide 16

Slide 16 text

Ownership Transfer •vnodeͷॴ༗ݖΛผͷϊʔυ΁సૹ͢Δಈ࡞ •RingͷมߋΛ൐͏ Ring vnode vnode* vnode vnode vnode vnode vnode Node Ring vnode* vnode vnode vnode vnode vnode vnode Node

Slide 17

Slide 17 text

Handoff Transfer •HandoffͰҰ࣌తʹड͚͍࣋ͬͯͨσʔλΛOwner΁ฦ͢ •RingͷมߋΛ൐ͳΘͳ͍ Ring vnode vnode vnode vnode vnode vnode vnode Node Ring vnode* vnode vnode vnode vnode vnode vnode Node vnode*

Slide 18

Slide 18 text

Repair Transfer vnode vnode vnode vnode v vnode v vnode vnode vnode vnode vnode vnode vnode ྆ྡͷvnode͔ΒσʔλΛ ίϐʔ͢Δ vnode

Slide 19

Slide 19 text

࣮ફฤ

Slide 20

Slide 20 text

ӡ༻ͷͱ͖ʹ࢖͏ίϚϯυ •riak [start|stop|ping] •riak-admin cluster [commit|plan|join] •riak-admin cluster [force-remove| force-replace] •riak-admin down

Slide 21

Slide 21 text

ϊʔυ௥Ճ •riak-admin cluster join •ϊʔυ௥ՃͷʮมߋʯΛ༧໿ •શvnodeͰOwnershipͷ࠶഑ஔΛܭը •commitޙOwnership Transfer͕ൃੜ •ͳΔ΂͘·ͱΊͯjoin

Slide 22

Slide 22 text

Ringมߋ •riak-admin cluster plan •ʮRingมߋϓϥϯʯΛશϊʔυͰڞ༗ •BeforeͷRingͱɺAfterͷRing •͜ͷஈ֊Ͱ͸·࣮ͩߦ͠ͳ͍ Ring Ring’ Ring vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode •riak-admin cluster commit •planͰ࡞੒ͨ͠ϓϥϯͷ࣮ߦΛ։࢝ ͢Δ •Transfer͕࢝·Δ plan Ring Ring’ vnode vnode vnode vnode vnode vnode vnode commit

Slide 23

Slide 23 text

ϊʔυ࡟আ •riak-admin cluster … •leave •force-remove •force-replace

Slide 24

Slide 24 text

cluster leave [email protected] •อ͍࣋ͯ͠ΔvnodeΛશͯ୭͔ଞͷਓʹOwnership Handoff͔ͯ͠Βऴྃ Ring Ring’ vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode Ring Ring’ Ring Ring’ Ring Ring’ Ring’ Ring’ Ring’

Slide 25

Slide 25 text

cluster force-remove [email protected] •vnodeͷ࣋ͪओ͸ࢮΜͩʢͦͯ͠ؼͬͯ͜ͳ͍ʣͷͰɺ طଘͷผͷϊʔυͰ৽ۭ͘͠ͷvnodeΛ࡞੒͢Δ Ring vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode vnode Ring Ring’ Ring Ring’ Ring Ring’ v v v v v v v Ring’ Ring’ Ring’

Slide 26

Slide 26 text

cluster force-replace (from) (to) •vnodeͷ࣋ͪओ͸ࢮΜͩʢͦͯ͠ؼͬͯ͜ͳ͍ʣͷͰɺผ ͷ৽͍͠ϊʔυʹׂΓ౰ͯͯ৽ۭ͘͠ͷvnodeΛ࡞੒͢Δ Ring vnode vnode vnode vnode vnode vnode vnode (from) v (to) Ring’ v v v v v v Ring Ring’

Slide 27

Slide 27 text

·ͱΊ •leave •Ownership Transfer͋Γ •force-remove •Ownership Transferͳ͠ɺશମͰ࠶഑ஔ͕ൃੜ •ۭͷvnode͕࡞ΒΕΔ •force-replace •Ownership Transferͳ͠ɺ࠶഑ஔ͸ൃੜ͠ͳ͍ •ۭͷvnode͕࡞ΒΕΔ

Slide 28

Slide 28 text

ۭͷvnode͕Ͱ͖·ͨ͠ Ͳ͏͢Δʁ • AAE • ॴཁ࣌ؒ͸ࠩ෼ͷྔʹൺྫ • Read RepairΛ൐͏ʢϥϯμϜΞΫηεʣ • VNode Repair • ॴཁ࣌ؒ͸vnodeͷαΠζʹൺྫ • σʔλΛόϧΫͰసૹʢγʔέϯγϟϧʣ

Slide 29

Slide 29 text

Case study 1: ಉ͡ϊʔυͰ • Lv1: RingϑΝΠϧ΋σʔλ΋શ ෦࢒͍ͬͯΔ • ىಈ͠ͳ͓ͯ͠AAE଴ͪ • མͪͯͨظ͕ؒ௕͚Ε͹ʢࠩ ෼͕େ͖͚Ε͹ʣVNode RepairΛ࣮ߦ • Lv2: RingϑΝΠϧ͕࢒͍ͬͯΔ • ىಈ͠ͳ͓ͯ͠VNode Repair Ring vnode vnode vnode vnode vnode vnode vnode v v v v v v v Ring

Slide 30

Slide 30 text

Case study 1: ಉ͡ϊʔυͰ • Lv3: Ring͸ͳ͍͕σʔλ͸࢒ͬͯ ͍Δ • joinͯ͠ AAE ʹΑΔम෮଴ͪ • མͪͯͨظ͕ؒ௕͚Ε͹ʢࠩ෼ ͕େ͖͚Ε͹ʣ joinͯ͠ VNode Repair • Lv4: Ring΋σʔλ΋࢒͍ͬͯͳ͍ • joinͯ͠VNode Repair Ring vnode vnode vnode vnode vnode vnode vnode v v v v v v v Ring

Slide 31

Slide 31 text

Case study 2: ಉ͡ϊʔυ͸΋͏ͳ͍ • Lv1: ୅ΘΓͷϊʔυΛ͙͢ʹ༻ҙ Ͱ͖Δ • force-replace͔ΒͷVNode Repair • force-remove͔Βͷjoin ͸ק Ίͳ͍ • Lv2: ୅ΘΓͷϊʔυΛ͙͢ʹ༻ҙ Ͱ͖ͳ͍ • force-remove͔ΒͷVNode Repair v Ring’ v v v v v v

Slide 32

Slide 32 text

஫ҙ఺ • ϊʔυ1,2୆͸͍͍ͭͳ͘ͳͬͯ΋େৎ෉ͳΑ͏ ʹΩϟύγςΟઃܭ͠·͠ΐ͏ • ϊʔυ௥Ճ·ͰͷϦʔυλΠϜͱෳ੡ͷ෮چʹ͔ ͔Δ࣌ؒʢVNode Repair or AAEͰશ෦෮چ͢ Δ࣌ؒʣ͕ɺMTBF ҎԼʹͳΔΑ͏ʹ͠·͠ΐ͏

Slide 33

Slide 33 text

Handoff͕ਐ·ͳ͍ͱ͖͸ • riak-admin cluster [partition| partition-count|partitions] • ύʔςΟγϣϯɺͲΕ͕Ͳ͜ʹ͋Δ͔Λ೺Ѳ͢Δ • riak-admin handoff [enable|disable| summary|details|config] • HandoffΛࡉ੍͔͘ޚͨ͠ΓɺઃఆΛ֬ೝ ※Riak 2.0.4~

Slide 34

Slide 34 text

AAEͷਐḿͲ͏Ͱ͔͢ʁ • ͜·Ίʹ riak-admin aae-status Ͱ֬ೝ͠Α͏ ʢsearchͷ৔߹͸ riak-admin search aae-status ʣ • Tree Expire͕1िؒͱ୹͍ͷͰɺ௕Ίʹઃఆ • 2w ~ 8w ?

Slide 35

Slide 35 text

Bitcask merge / LevelDB compaction • ࿦ཧతʹ࡟আ͞ΕͨσʔλΛϑΝΠϧγεςϜ ্͔Βফ͢ॲཧ • Ұ౓શ෦ಡΜͰɺੜ͖͍ͯΔσʔλΛҠ͔ͯ͠ Β unlink • bitcask͸ merge_window, max_file_size ͳͲͰௐ੔

Slide 36

Slide 36 text

Tip: ଟॏނোͨ͠ͱ͖͸ • ྡΓ߹͏vnodeΛ࣋ͭϊʔυ͕མͪͯͳ͍͔Ͳ͏͔νΣοΫ ͢Δʢ3ͭ࿈ଓͯ͠ྡΓ߹͏vnode͕ͳ͘ͳͬͨΒσʔλϩ εʹͳΔʣ • 1.4ܥͳΒ $ riak attach > riak_core_ring:pretty_print(element(2, riak_core_ring_manager:get_my_ring()), []). • 2.0.4 Ҏ߱ͳΒ $ riak-admin cluster partitions —[email protected]

Slide 37

Slide 37 text

جຊతͳ࢓૊Έ

Slide 38

Slide 38 text

s3cmd s3fs AWS SDK Stanchion LB S3 REST API Riak CS ͷϓϩηεߏ੒ PB

Slide 39

Slide 39 text

Riak CSͷσʔλߏ଄ http://spambucket.riakcs.net/ham/to/foo.txt {spambucket, ham/to/foo.txt} {“0o:”+hash(spambucket), hash(ham/to/foo.txt)} {“0b:”+hash(spambucket), {UUID, N}} manifest content-size UUID content-type etc.. chunk 1MB (default) Riak key Riak key Riak CS Bucket Riak bucket Riak bucket ɾશͯRiakͷKey-ValueͰදݱ Riak CS Key

Slide 40

Slide 40 text

ΦϒδΣΫτͷϥΠϑαΠΫϧ •ෳ਺ͷߋ৽ϦΫΤετ͕ಉ࣌ʹ͖ͯ΋ΦϒδΣ Ϋτ಺༰Λഁյ͠ͳ͍ writing active pending_delete scheduled_delete new UUID GC upload complete delete overwrite del by GC • ϚχϑΣετͷঢ়ଶ؅ཧ ͰϑΝΠϧΛ੔߹ͨ͠Α ͏ʹݟͤΔ • ಉ࣌ߋ৽Ͱ΋Vector Clocksͷಡࠐ࣌ʹCSͷϩ δοΫͰղܾ

Slide 41

Slide 41 text

Garbage Collection • DELETE͸ϚχϑΣετͷঢ়ଶΛมߋ͢Δ͚ͩ • ࣮ࡍͷʢRiak্ͷʣDELETE͸ leeway seconds ޙʹ GC WorkerʹΑͬͯىಈ͞ΕΔ • ઌʹ࡟আ͞Εͨ΋ͷ͔Βॱ൪ʹফ͍ͯ͘͠ • blockΛશ෦ফͨ͠ΒManifestΛߋ৽ • app.config “gc_max_workers” • RiakͰͷDELETE͸ read-before-put-tombstone

Slide 42

Slide 42 text

FAQ • ʮΦϒδΣΫτΛDELETEͨ͠ͷʹσΟεΫ࢖༻ྔ͕ݮΒͳ͍ʯ • GC͸௥͍͍͍ͭͯΔ͔ʁ • ʮԿ΋͍ͯ͠ͳ͍ͷʹσΟεΫෛՙ͕ߴ͍ʯ • AAE tree build, GC, Bitcask merge, VNode Repair͕ى͖͍ͯͳ͍͔νΣοΫ • ετϨʔδͷϒϩοΫ্͕͔ΒԼ·Ͱਖ਼͍͠αΠζͰΞϥΠϯϝϯτ͞Ε͍ͯ Δ͔ • ετϨʔδ͕ਖ਼͘͠શσΟεΫʹετϥΠϐϯά͞Ε͍ͯΔ͔ • CSͷϓϩηεͷϝϞϦ࢖༻ྔ͕ߴ͍ • ManifestʹେྔͷSiblings͕Ͱ͖͍ͯΔՄೳੑ͕ • node_get_fsm_siblings_100 ΛνΣοΫ

Slide 43

Slide 43 text

ࠔͬͨͱ͖͸ •[email protected] •ϩά΍ઃఆϑΝΠϧΛఴ෇ͯ͠ •ͦΕͰ΋ΠϚΠνͳͱ͖͸riak-debug, riak-cs-debug ͷ݁ՌΛ౤ߘͯ͠ΈΑ͏ʂ

Slide 44

Slide 44 text

We are hiring! •࣮ੈքͷ෼ࢄγεςϜͷ ໰୊ʹڵຯ͋Δਓʂ •@BashoJapan •[email protected]

Slide 45

Slide 45 text

Questions?

Slide 46

Slide 46 text

ࢀߟ • ো֐ؔ࿈ͷ೔ຊޠͰͷղઆ • http://qiita.com/kuenishi/items/d49554874305e34619bc • Handoff·ΘΓͷίϚϯυղઆ • http://docs.basho.com/riak/latest/ops/running/handoff/ • VNode Repairͷखॱ • http://docs.basho.com/riak/latest/ops/running/recovery/ repairing-partitions/ • Transfer·ΘΓͷύϥϝʔλΛಈతʹઃఆ͢Δ • http://docs.basho.com/riak/latest/ops/running/tools/riak- admin/#set