Slide 1

Slide 1 text

HSEͱ͸Կ͔ NAOMASA MATSUBAYASHI

Slide 2

Slide 2 text

Heterogeneous-Memory Storage Engine https://www.micron.com/hse 2020೥4݄ʹMicron͕ൃදͨ͠ Φʔϓϯιʔεͳ Key-Value Store

Slide 3

Slide 3 text

Heterogeneous-Memory Storage Engine https://github.com/hse-project/hse/wiki "HSE͸NANDϑϥογϡ·ͨ͸ෆشൃϝϞϦΛ༻͍ΔSSDͷͨΊʹ࡞ΒΕͨ ૊ΈࠐΈՄೳͳkey-value storeͰ͢ɻHSE͸DRAM͔ΒଟछͷSSD·ͨ͸ͦͷଞͷsolid- stateετϨʔδ·Ͱͷσʔλͷ഑ஔΛ޻෉͢Δ͜ͱͰɺੑೳͱ଱ٱੑΛվળͤ͞·͢ɻ" https://www.micron.com/hse Φʔϓϯιʔεͳ Key-Value Store

Slide 4

Slide 4 text

Heterogeneous-Memory Storage Engine MongoDBͷWiredTigerΛHSEͰஔ͖׵͑Δͱ YCSBϕϯνϚʔΫͷεϧʔϓοτ͕2ഒ͔Β8ഒʹͳΔΒ͍͠ https://github.com/hse-project/hse/wiki/MongoDB https://github.com/hse-project/hse/wiki "HSE͸NANDϑϥογϡ·ͨ͸ෆشൃϝϞϦΛ༻͍ΔSSDͷͨΊʹ࡞ΒΕͨ ૊ΈࠐΈՄೳͳkey-value storeͰ͢ɻHSE͸DRAM͔ΒଟछͷSSD·ͨ͸ͦͷଞͷsolid- stateετϨʔδ·Ͱͷσʔλͷ഑ஔΛ޻෉͢Δ͜ͱͰɺੑೳͱ଱ٱੑΛվળͤ͞·͢ɻ" https://www.micron.com/hse

Slide 5

Slide 5 text

ετϨʔδΤϯδϯ Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ IOεέδϡʔϥ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger ͜Ε InnoDB

Slide 6

Slide 6 text

ετϨʔδΤϯδϯͷ໾ׂ ετϨʔδσόΠε্Ͱͷه࿥Ґஔͷܾఆ

Slide 7

Slide 7 text

ετϨʔδΤϯδϯͷ໾ׂ සൟʹΞΫηε͞ΕΔσʔλͷΩϟογϡ

Slide 8

Slide 8 text

ετϨʔδΤϯδϯͷ໾ׂ PUT ABORT PUT COMMIT BEGIN PUT COMMIT GET GET ͜Ε͕ݟ͑Δ GET GET ͜Ε͕ݟ͑Δ GET ͜Ε͕ݟ͑Δ τϥϯβΫγϣϯ

Slide 9

Slide 9 text

ετϨʔδΤϯδϯͷ໾ׂ COMMIT PUT PUT PUT COMMIT ͜͜ͰΫϥογϡͨ͠Β ࠶ىಈޙ͜͜ͷঢ়ଶʹͳΔ ͜͜ͰΫϥογϡͨ͠Β ࠶ىಈޙ͜͜ͷঢ়ଶʹͳΔ ॲཧ͕தஅ͞Εͯ΋σʔλ͕ෆਖ਼ͳঢ়ଶʹͳΒͳ͍

Slide 10

Slide 10 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 ϩά i 8JSFE5JHFSͷ৔߹

Slide 11

Slide 11 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 1ͱ2ΛGET ηογϣϯ0 ϩά i root 1 2 i 8JSFE5JHFSͷ৔߹

Slide 12

Slide 12 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 root 1 2 2'ΛPUT ηογϣϯ1 root(1) 2' ϩά i i 8JSFE5JHFSͷ৔߹ ηογϣϯ0

Slide 13

Slide 13 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 root 1 2 ηογϣϯ0 2ΛGET ηογϣϯ1 root(1) 2' 2ΛGET ϩά i i 8JSFE5JHFSͷ৔߹

Slide 14

Slide 14 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ i 0 1 2 4 5 root 1 i COMMIT ηογϣϯ1 root(1) 2' ͕ʹͳΔ ϩά 8JSFE5JHFSͷ৔߹ ηογϣϯ0

Slide 15

Slide 15 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 root 1 ηογϣϯ0 ηογϣϯ1 2' ͕ʹͳΔ ϩά 2ΛGET 2ΛGET i i 8JSFE5JHFSͷ৔߹

Slide 16

Slide 16 text

ετϨʔδΤϯδϯͷΨϕʔδίϨΫλ 0 1 2' 4 5 root 1 2' ͕ʹͳΔ ϩά i i 8JSFE5JHFSͷ৔߹

Slide 17

Slide 17 text

ετϨʔδΤϯδϯͷΨϕʔδίϨΫλ 0 1 2' 4 5 root 1 2' ϩά i i 8JSFE5JHFSͷ৔߹

Slide 18

Slide 18 text

ετϨʔδΤϯδϯͷ࣮૷ํ๏ ετϨʔδΤϯδϯ ϑΝΠϧγεςϜ σʔλͷ୳͠ํ Ωʔ ϑΝΠϧύε ه࿥ҐஔΛܾΊΔ ͢Δ ͢Δ Ωϟογϡ ͋Δ ͋Δ τϥϯβΫγϣϯ ΞϓϦέʔγϣϯ੍͕ޚ ͳ͍ தஅޙͷঢ়ଶ ׬ྃͨ͠τϥϯβΫγϣϯ͚͕ͩ ൓ө͞Εͨঢ়ଶ ϑΝΠϧγεςϜ͕ յΕ͍ͯͳ͍ԿΒ͔ͷঢ়ଶ தஅޙͷঢ়ଶͷ ճ෮ํ๏ ߏ଄Խϩά ߏ଄Խϩά ϑΝΠϧγεςϜͷδϟʔφϧͱࣅ͍ͯΔ͕ ϑΝΠϧγεςϜ͸τϥϯβΫγϣϯΛఏڙ͠ͳ͍

Slide 19

Slide 19 text

ϩάͷ্ʹϩά͕৐ͬͨঢ়ଶʹͳΔ Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ IOεέδϡʔϥ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB ϩά ϩά

Slide 20

Slide 20 text

σʔλϕʔεΛSSDͷ্Ͱಈ͔͢͜ͱ͸௝͘͠ͳ͘ͳͬͨ

Slide 21

Slide 21 text

ͷ੍໿ 0 0 0 0 0 20V 20V 20V 20V 20V 20V 20V 0V 0V 0V 0V 0V ಉҰϒϩοΫͷ͢΂ͯͷηϧ͔ΒిՙΛൈ͘ (=ϒϩοΫΛؙ͝ͱθϩΫϦΞ͢Δ) ௨ৗ1ϒϩοΫ͸ෳ਺ͷϖʔδ͔ΒͳΔͨΊ 1ϖʔδ͚ͩθϩΫϦΞ͢Δ͜ͱ͸ग़དྷͳ͍

Slide 22

Slide 22 text

ͷ੍໿ 0 0 1 0 1 0V 0V 0V 0V 0V 0V 0V 0V 0V 20V 0V 20V ͦͷޙ1ʹ͍ͨ͠ηϧʹిՙΛஷΊΔ ͜ͷిՙͷग़ೖΓͷͨͼʹηϧͷτϯωϧࢎԽບ͕ফ໣͢Δ

Slide 23

Slide 23 text

ϒϩοΫ1 ϒϩοΫ0 ͷ੍໿ 0 1 2 3 4 5 6 7 2' Λॻ͖׵͍͑ͨ ϋʔυσΟεΫͷΑ͏ʹ௚઀ Λॻ͖׵͑Δͱ 2 ΛθϩΫϦΞ ϒϩοΫ0 0 1 2 3 Λॻ͖ࠐΈ 0 1 2' 3 1. 2. ௒஗͍

Slide 24

Slide 24 text

ϒϩοΫ1 ϒϩοΫ0 ͷ੍໿ 0 1 2 3 4 5 6 7 2' ϋʔυσΟεΫͷΑ͏ʹ௚઀ Λॻ͖׵͑Δͱ 2 ͕ۃ୺ʹফ໣ͯ͠࢖͑ͳ͘ͳΔ ϒϩοΫ0 0 1 2 3 ௒੬͍ 2' 2' ಉ͡ϖʔδʹԿ౓΋ॻ͖ࠐΈ

Slide 25

Slide 25 text

Flash Translation Layer ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5 6 7 ϒϩοΫ2 ۭ ۭ ۭ ۭ SSDͷίϯτϩʔϥ͸ Ջͳ࣌ʹۭ͖ྖҬΛͲΜͲΜθϩΫϦΞ͢Δ

Slide 26

Slide 26 text

Flash Translation Layer ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5 6 7 ϒϩοΫ2 2' ۭ ۭ ۭ 2' ॻ͖ࠐΈཁٻ͕དྷͨΒ θϩΫϦΞࡁΈͷϖʔδʹॻ͖ࠐΉ Λॻ͖׵͍͑ͨ

Slide 27

Slide 27 text

Flash Translation Layer ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5 6 7 ϒϩοΫ2 2' ۭ ۭ ۭ SSD͸͋ΔLBAͷϖʔδ͕ Ͳͷ෺ཧΞυϨεʹه࿥͞Ε͍ͯΔ͔Λද͢ ม׵දΛ͍࣋ͬͯΔ ม׵ද 2->8

Slide 28

Slide 28 text

ม׵ද LBAͱ෺ཧΞυϨεͷม׵ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA ෺ཧΞυϨε 2 11 3 12 5 9 ม׵ද͸σόΠεͷRAMͱϑϥογϡϝϞϦͷ྆ํʹஔ͔ΕΔ ϑϥογϡϝϞϦ͸ߦ͝ͱʹॻ͖׵͑ΒΕͳ͍ͷͰ ͢΂ͯͷมߋ͕ߏ଄ԽϩάͰ௥ه͞ΕΔ

Slide 29

Slide 29 text

ม׵ද LBAͱ෺ཧΞυϨεͷม׵ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA ෺ཧΞυϨε 2 11 3 12 5 9 ϒϩοΫ3 ϒϩοΫ2 2 5 1 2 3 ۭ ۭ ۭ ΛಡΈ͍ͨ 2

Slide 30

Slide 30 text

ม׵ද LBAͱ෺ཧΞυϨεͷม׵ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA2͸෺ཧΞυϨε13ʹͳͬͨ LBA ෺ཧΞυϨε 2 13 3 12 5 9 ϒϩοΫ3 ϒϩοΫ2 2 5 1 3 2 2 ۭ ۭ ʹॻ͖͍ͨ 2

Slide 31

Slide 31 text

ม׵ද LBAͱ෺ཧΞυϨεͷม׵ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA2͸෺ཧΞυϨε13ʹͳͬͨ LBA5͸TRIM͞Εͨ LBA ෺ཧΞυϨε 2 13 3 12 ϒϩοΫ3 ϒϩοΫ2 2 5 1 3 2 2 ۭ ۭ ΛTRIM 5

Slide 32

Slide 32 text

ม׵ද FTLͷΨϕʔδίϨΫλ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA2͸෺ཧΞυϨε13ʹͳͬͨ LBA5͸TRIM͞Εͨ LBA ෺ཧΞυϨε 2 13 3 12 ϒϩοΫ3 3 2 ۭ ۭ ϒϩοΫ2 ۭ ۭ ۭ ۭ SSDͷίϯτϩʔϥ͸ શͯͷϖʔδ͕ม׵ද͔Βࢀর͞Εͳ͘ͳͬͨϒϩοΫΛ Ջͳ࣌ʹθϩΫϦΞ͢Δ

Slide 33

Slide 33 text

FTLͷΨϕʔδίϨΫλ ϒϩοΫ3 3 2 ̐ ̐ ϒϩοΫ2 1 2 1 2 θϩΫϦΞ͞Εͨϖʔδ͕ݮ͖͍ͬͯͯΔ͕ ͲͷϒϩοΫ΋த్൒୺ʹ࢖ΘΕ͍ͯΔ৔߹

Slide 34

Slide 34 text

FTLͷΨϕʔδίϨΫλ ϒϩοΫ3 3 2 ϒϩοΫ2 ̐ ̐ 1 2 1 2 த్൒୺ʹ࢖ΘΕ͍ͯΔϒϩοΫͷ༗ޮͳϖʔδ͚ͩΛ ৽͍͠ϒϩοΫʹॻ͖ࠐΈ ϒϩοΫ4 ۭ ۭ 1 ۭ

Slide 35

Slide 35 text

FTLͷΨϕʔδίϨΫλ ϒϩοΫ3 3 2 ϒϩοΫ2 ۭ ۭ ۭ ۭ ෆཁʹͳͬͨݩͷϒϩοΫΛθϩΫϦΞ ϒϩοΫ4 ۭ ۭ 1 ۭ ̐ ̐

Slide 36

Slide 36 text

Flash Translation Layer Flash Translation Layer ϑΝΠϧγεςϜ σʔλͷ୳͠ํ LBA ϑΝΠϧύε ه࿥ҐஔΛܾΊΔ ͢Δ ͢Δ Ωϟογϡ ͋Δ ͋Δ τϥϯβΫγϣϯ ͳ͍ ͳ͍ தஅޙͷঢ়ଶ ΞυϨεม׵ද͕յΕ͍ͯͳ͍ ԿΒ͔ͷঢ়ଶ ϑΝΠϧγεςϜ͕ յΕ͍ͯͳ͍ԿΒ͔ͷঢ়ଶ தஅޙͷঢ়ଶͷ ճ෮ํ๏ ߏ଄Խϩά ߏ଄Խϩά ϑΝΠϧγεςϜͷδϟʔφϧͱࣅ͍ͯΔ͕ ϩάͷཻ౓͸ϖʔδ୯Ґ

Slide 37

Slide 37 text

ϩάͷ্ʹϩά͕৐্ͬͨʹϩά͕৐ͬͨঢ়ଶʹͳΔ Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB ϩά ϩά Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ ϩά

Slide 38

Slide 38 text

ϒϩοΫ1 ۭ ۭ ϒϩοΫ0 ۭ ۭ 5 7 7 9 ϒϩοΫ2 ۭ ۭ ۭ ۭ ʹॻ͖͍ͨ 1 2 3 4 ΛTRIM͍ͨ͠ 1 2 3 4 ϒϩοΫ1 1 2 ϒϩοΫ0 3 4 5 7 7 9 ϒϩοΫ2 ۭ ۭ ۭ ۭ ͙͢θϩΫϦΞͰ͖Δ ফڈ͸ϒϩοΫ୯ҐͰདྷΔͱخ͍͠

Slide 39

Slide 39 text

ϩάߏ଄ԽϑΝΠϧγεςϜ ϑΝΠϧhogeΛ࡞ͬͨ ϑΝΠϧhogeͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧhogeͷ1ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧfugaΛ࡞ͬͨ ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧhogeΛ࡟আͨ͠ ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧγεςϜʹ ىͬͨ͜ΠϕϯτΛ ࣌ܥྻॱʹ ετϨʔδʹه࿥ ⋮ ৽͍͠ૢ࡞͸ৗʹϩάͷઌ୺ʹ௥ه͞ΕΔ

Slide 40

Slide 40 text

ϩάߏ଄ԽϑΝΠϧγεςϜͷΨϕʔδίϨΫλ ϑΝΠϧhogeΛ࡞ͬͨ ϑΝΠϧhogeͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧhogeͷ1ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧfugaΛ࡞ͬͨ ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧhogeΛ࡟আͨ͠ ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ ϑΝΠϧγεςϜͷ ݱࡏͷঢ়ଶʹ Өڹ͠ͳ͍ΠϕϯτΛ ݟ͚ͭΔ ͦͷ··ॻ͖ଓ͚ΔͱετϨʔδͷྖҬΛ࢖͍੾Δ ϑΝΠϧfugaΛ࡞ͬͨ ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ Өڹͷ͋Δϩά͚ͩΛ ίϐʔͨ͠ ৽͍͠ϩάΛ࡞Δ

Slide 41

Slide 41 text

ϩάߏ଄ԽϑΝΠϧγεςϜͰ͸ ΨϕʔδίϨΫλ͕૸ΔλΠϛϯάͰ ·ͱ·ͬͨྖҬ͕TRIM͞ΕΔ FTLͷΨϕʔδίϨΫλʹ΍͍͞͠ ·ͱ·ͬͨྖҬ͕SSDͷ෺ཧΞυϨε্Ͱ΋·ͱ·͍ͬͯΔ৔߹ ଈ࠲ʹϒϩοΫΛղ์Ͱ͖ΔՄೳੑ͕ߴ͍

Slide 42

Slide 42 text

ϩά η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ Flash-Friendly File System (F2FS) SB CP SIT NAT SSA Main ϩά η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ ⋯ ෳ਺ͷϩάΛ࣋ͭ ϩάߏ଄ԽϑΝΠϧγεςϜ ϩάʹ࢖͏ྖҬ͸ ηΫγϣϯ୯ҐͰׂΓ౰ͯ ηΫγϣϯαΠζ͸ ଟ෼ϒϩοΫαΠζͱҰக GC࣌ͷTRIM͕ ϒϩοΫ୯Ґʹͳͬͯخ͍͠

Slide 43

Slide 43 text

3ͭͷಠཱʹಈ͘ΨϕʔδίϨΫλ͕ॏͳͬͨঢ়ଶ Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB GC GC Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ GC

Slide 44

Slide 44 text

https://www.usenix.org/node/187064 Don't Stack Your Log On My Log YANG, J., PLASSON, N., GILLIS, G., TALAGALA, N., AND SUNDARARAMAN, S. Don’t stack your log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW) (2014).

Slide 45

Slide 45 text

https://www.usenix.org/node/187064 YANG, J., PLASSON, N., GILLIS, G., TALAGALA, N., AND SUNDARARAMAN, S. Don’t stack your log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW) (2014). ߏ଄ԽϩάΛԿॏʹ΋ॏͶΔͱ NAND΁ͷॻ͖ࠐΈ͕ͲΜͲΜ૿͑ͯੑೳΨλམͪ ͱ͍͏࿦จ Don't Stack Your Log On My Log

Slide 46

Slide 46 text

ϑΝΠϧγεςϜ ετϨʔδΤϯδϯ σʔλ Λॻ͖͍ͨ ϝλ0 σʔλ Λॻ͖͍ͨ ߏ଄Խϩά͸ॻ͖͍ͨσʔλʹՃ͑ͯ ϝλσʔλΛॻ͘ඞཁ͕͋Δ Write Amplification

Slide 47

Slide 47 text

ϑΝΠϧγεςϜ ετϨʔδΤϯδϯ Flash Translation Layer σʔλ Λॻ͖͍ͨ ϝλ0 σʔλ ϝλ0 σʔλ ϝλ1 Λॻ͖͍ͨ Λॻ͖͍ͨ Write Amplification ϝλ2 ϝλ1 σʔλ ϝλ3 ϝλ5 ϝλ2 ϝλ0 ϝλ4 Λॻ ্૚ͷϝλσʔλ͸ Լ૚ʹͱͬͯ͸σʔλͳͷͰ ϝλσʔλʹϝλσʔλ͕෇͘

Slide 48

Slide 48 text

ϑΝΠϧγεςϜ ηΫγϣϯ0 ηΫγϣϯ1 ηΫγϣϯ2 ϩά Flash Translation Layer ϒϩοΫ2 ϩά ηΫγϣϯ3 ʹॻ͔Εͨϩά͕ෆཁʹͳͬͨͷͰTRIM ηΫγϣϯ1 ϒϩοΫ1 ηΫγϣϯαΠζͱϒϩοΫαΠζ͕ҟͳΔͱ ్த·ͰTRIM͞ΕͨϒϩοΫ͕ੜ͡Δ ϒϩοΫ0 Write Amplification

Slide 49

Slide 49 text

Flash Translation Layer ϒϩοΫ2 ϩά ϒϩοΫ1 ϒϩοΫ0 ϒϩοΫ3 ϒϩοΫ4 ΨϕʔδίϨΫλ͸్த·ͰTRIM͞ΕͨϒϩοΫ͔Β θϩΫϦΞ͞ΕͨྖҬΛ࡞ΔͨΊʹ ࢖༻தͷϖʔδΛ৽͍͠ϒϩοΫʹίϐʔ͢Δ Write Amplification

Slide 50

Slide 50 text

Write Amplification ετϨʔδΤϯδϯ͸௨ৗϩά͕TRIMՄೳͰ͋Δ͜ͱΛ Լ૚ʹ௨஌͠ͳ͍ ετϨʔδΤϯδϯ ϖʔδ0 ϖʔδ1 ϖʔδ2 ϩά ϖʔδ3 ࢖༻ࡁΈϩά ϑΝΠϧγεςϜ ϖʔδ0 ϖʔδ1 ϖʔδ2 ϩά ϖʔδ3 Flash Translation Layer ϑΝΠϧ͕͋Δ͔Β࢖༻த ϖʔδ͸࢖༻த͔ͩΒ ผͷϒϩοΫʹίϐʔ͢Δ

Slide 51

Slide 51 text

Write Amplification 0 ϩά 1 2 3 4 5 6 ߏ଄ԽϩάͷΨϕʔδίϨΫγϣϯ͸ طଘͷϩά͔Β·ͩ༗ޮͳཁૉ͚ͩΛऔΓग़ͯ͠ ৽͍͠ϩάʹίϐʔ͢Δ 0 ϩά 1 2 3 4 5 6 ϩά 2 6 ̎ ̒ ͜Ε͸Լ૚ͷϩάʹ৽͍͠ॻ͖ࠐΈΛ࢈Ή ্ Լ

Slide 52

Slide 52 text

Write Amplification 0 ϩά 1 2 3 4 5 6 ΋͠Լ૚ͷϩάͷΨϕʔδίϨΫλ͕ ૸ͬͨ௚ޙʹ্૚ͷΨϕʔδίϨΫλ͕૸Δͱ 0 ϩά 1 2 3 5 6 1 2 ϩά ΨϕʔδίϨΫγϣϯͰྖҬΛۭ͚ͨ͹͔ΓͷԼ૚ͷϩάʹ େྔͷॻ͖ࠐΈΛੜ্ͤͨ͡͞ʹ 0 1 2 3 5 6 3 5 6 ্ Լ 0

Slide 53

Slide 53 text

0 1 2 3 5 6 Write Amplification 0 ϩά 1 2 3 4 5 6 0 1 2 3 5 6 ϩά ΨϕʔδίϨΫγϣϯͰྖҬΛۭ͚ͨ͹͔ΓͷԼ૚ͷϩάʹ େྔͷΨϕʔδίϨΫγϣϯ଴ͪͷཁૉΛੜͤ͡͞Δ 0 1 2 3 5 6 ্ Լ ௚લͷԼ૚ͷΨϕʔδίϨΫγϣϯΛҰॠͰ୆ແ͠ʹ͢Δ ϩά 1 2 3 5 6 0 0 1 2 3 5 6

Slide 54

Slide 54 text

Write Amplification ͜ΕΒͷޮՌ͕߹Θͬͨ݁͞Ռ ෳ਺ͷߏ଄Խϩά͕ॏͳͬͨঢ়گͰ͸ ॻ͖ࠐΈΛཁٻͨ͠σʔλͷαΠζʹରͯ͠ ࣮ࡍʹNANDʹॻ͔ΕΔσʔλͷαΠζ͕ ࠅ͍έʔεͰ 2ഒҎ্ʹ๲Ε্͕Δ

Slide 55

Slide 55 text

Write Amplification ճආํ๏ 1.ߏ଄ԽϩάΛॏͶΔͳ 2.Ͳ͏ͯ͠΋ॏͶΔඞཁ͕͋Δ৔߹͸ ϒϩοΫαΠζΛἧ͑Ζ 3.࢖͍ऴΘͬͨϩά͸TRIM͠Ζ

Slide 56

Slide 56 text

ϑΝΠϧγεςϜΛ΍ΊΑ͏ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB Flash Translation Layer NANDϑϥογϡϝϞϦ ϩά ϩά ϩά τϥϯβΫγϣϯΛ ࣮ݱ͢ΔͨΊʹඞཁ ϋʔυ΢ΣΞͷػೳ ࣺ͍ͯͨ

Slide 57

Slide 57 text

Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ HSE mpool HSE͸ΧʔωϧϞδϡʔϧmpoolΛ࢖͏

Slide 58

Slide 58 text

Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ HSE mpool mpool͸ϒϩοΫσόΠεͷ্Ͱಈ͘

Slide 59

Slide 59 text

ϒϩοΫσόΠεΛࢦఆͯ͠mpoolσόΠεΛ࡞Δ root # modprobe mpool root # ls /dev/mpool* /dev/mpoolctl root # mpool create mp1 /dev/nvme0n1 uid=test gid=test mode=0600 root # ls /dev/mpool* /dev/mpoolctl /dev/mpool: mp1 root # mpool list MPOOL TOTAL USED AVAIL CAPACITY LABEL HEALTH mp1 466g 1.16g 441g 0.26% raw optimal

Slide 60

Slide 60 text

mpool mpoolΧʔωϧϞδϡʔϧ Ϣʔβۭؒ Χʔωϧۭؒ mblock mlog mcache HSE ioctl ioctl ioctl mpool͸3ͭͷػೳΛఏڙ͢Δ mpool ϢʔβۭؒϥΠϒϥϦ mdc

Slide 61

Slide 61 text

mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t block_id = 0u; mblock_props props; mpool_openͰmpoolσόΠεΛ։͖ mblock͸ϖʔδαΠζͷ੔਺ഒͷόΠτྻΛmpoolʹอଘ͢Δ mblock͸࡞੒࣌ʹҰ౓͚ͩॻ͘ࣄ͕Ͱ͖ มߋ΍௥ه͸Ͱ͖ͳ͍͕࡟আ͸Ͱ͖Δ mblockͷAPI

Slide 62

Slide 62 text

mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t block_id = 0u; mblock_props props; size_t length = 0; if( !params.count( "object" ) ) { memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; mpool_mblock_allocͰ৽͍͠mblockΛ࡞੒͢Δ ͜͜ͰฦΔ64bitͷblock id͸ ϑΝΠϧσΟεΫϦϓλͷΑ͏ͳ΋ͷ mblockͷAPI

Slide 63

Slide 63 text

size_t length = 0; if( !params.count( "object" ) ) { memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; mblockͷAPI Ұํಉ࣌ʹಘΒΕΔobject id͸ϑΝΠϧ໊ͷΑ͏ͳ΋ͷ ͜ͷmblockΛ୳͢ͱ͖͸object idΛ࢖༻͢Δ

Slide 64

Slide 64 text

mblock_props props; size_t length = 0; if( !params.count( "object" ) ) { memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); mblockʹॻ͖ࠐΉσʔλ͸ϖʔδڥքʹ ΞϥΠϯ͞Ε͍ͯͳ͚Ε͹ͳΒͳ͍ mblockͷAPI mpoolͷॻ͖ࠐΈʹ͸ϖʔδΩϟογϡ͕ແ͘ Χʔωϧ͸͜͜Ͱ֬อͨ͠ϝϞϦΛ௚઀σόΠευϥΠόʹ౉͢

Slide 65

Slide 65 text

SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; length = buf_size; SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 ) ) if( abort_transaction ) mpool_mblock_writeͰmblockʹσʔλΛॻ͖ࠐΉ iovecΛෳ਺༻ҙ͢Δ͜ͱͰ ෳ਺ͷϝϞϦྖҬ͔ΒͷσʔλΛ૊Έ߹Θͤͯॻ͘͜ͱ΋Ͱ͖Δ mblockͷAPI

Slide 66

Slide 66 text

PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; length = buf_size; SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mblock_abort( pool.get(), block_id ) ) else SAFE_CALL( mpool_mblock_commit( pool.get(), block_id ) ) } else { uint64_t object_id = params[ "object" ].as< uint64_t >(); mpool_mblock_commitͰมߋΛ֬ఆ͢Δ ͜ͷؔ਺ʹ౸ୡ͠ͳ͔ͬͨ৔߹ͦ͜·Ͱͷ mpool_mblock_write͸ແ͔ͬͨ͜ͱʹͳΔ mblockͷAPI mpool_mblock_abortͰ ͦ͜·ͰͷมߋΛ໌ࣔతʹແ͔ͬͨ͜ͱʹ͢Δ

Slide 67

Slide 67 text

iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; length = buf_size; SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mblock_abort( pool.get(), block_id ) ) else SAFE_CALL( mpool_mblock_commit( pool.get(), block_id ) ) } else { uint64_t object_id = params[ "object" ].as< uint64_t >(); SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id, &block_id, &props ) ) length = props.mpr_write_len; std::cout << "object id: " << object_id << std::endl; } طʹॻ͖ࠐ·ΕͨmblockΛ୳͢ʹ͸ mpool_mblock_find_get mblockͷAPI

Slide 68

Slide 68 text

SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id, &block_id, &props ) ) length = props.mpr_write_len; std::cout << "object id: " << object_id << std::endl; } { size_t buf_size = length; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; SAFE_CALL( mpool_mblock_read( pool.get(), block_id, &iov, 1, 0 ) ) std::cout << "length: " << length << std::endl; std::cout << "data: " << buf.get() << std::endl; } mpool_mblock_readͰಡΉ ಡΉͱ͖ʹ࢖͏όοϑΝ΋ ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯΔඞཁ͕͋Δ mblockͷAPI

Slide 69

Slide 69 text

{ size_t buf_size = length; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; SAFE_CALL( mpool_mblock_read( pool.get(), block_id, &iov, 1, 0 ) ) std::cout << "length: " << length << std::endl; std::cout << "data: " << buf.get() << std::endl; } if( delete_block ) SAFE_CALL( mpool_mblock_delete( pool.get(), block_id ) ) mpool_mblock_deleteΛ࢖͑͹ ࢦఆͨ͠mblockΛؙ͝ͱ࡟আͰ͖Δ mblockͷAPI

Slide 70

Slide 70 text

mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); mlog_capacity cap; memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) ); mpool_openͰmpoolσόΠεΛ։͘ͷ͸mblockͱҰॹ mlog͸ޙ͔Β௥هͰ͖ΔόΠτྻΛmpoolʹอଘ͢Δ mlogͷ࠷େαΠζ͸࡞੒࣌ʹܾఆ͞Ε ࠷େαΠζ·Ͱ௥هͨ͠ΒͦΕҎ্ॻ͖ࠐΊͳ͘ͳΔ mlogͷAPI

Slide 71

Slide 71 text

std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); mlog_capacity cap; memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) ); std::shared_ptr< mpool_mlog > log; if( !params.count( "object" ) ) { cap.lcp_captgt = 4 * 1024 * 1024; mlog_props props; memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); mpool_mlog *raw_log = nullptr; SAFE_CALL( mpool_mlog_alloc( pool.get(), &cap, MP_MED_CAPACITY, &props, &raw_log ) ); log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) mlogͷAPI mpool_mlog_allocͰ৽͍͠mlogΛ࡞੒ ࢖༻͢ΔྖҬͷαΠζ (ϖʔδαΠζͷ੔਺ഒ)

Slide 72

Slide 72 text

log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) } else { mlog_props props; mpool_mlog *raw_log = nullptr; SAFE_CALL( mpool_mlog_find_get( pool.get(), params[ "object" ].as(), &props, &raw_log ) ) log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) mlogͷAPI طʹ͋ΔmlogΛ୳࣌͢͸mpool_mlog_find_get mpool_mlog_alloc΍mpool_mlog_find_get͸ mlog_propsΛฦ͢

Slide 73

Slide 73 text

SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) } else { mlog_props props; mpool_mlog *raw_log = nullptr; SAFE_CALL( mpool_mlog_find_get( pool.get(), params[ "object" ].as(), &props, &raw_log ) ) log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), mlogͷAPI mpool_mlogΛ࢖ͬͯ mpool_mlog_openͰϩάΛ։͘

Slide 74

Slide 74 text

mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) ) else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), mlogͷAPI mpool_mlog_append_dataͰmlogʹόΠτྻΛ௥Ճ͢Δ ॻ͖ࠐΉόΠτྻ͸ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍

Slide 75

Slide 75 text

} uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) ) else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), erase_log ) ) bool empty = false; SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) ) std::cout << "empty: " << empty << std::endl; mlogͷAPI mpool_mlog_commitͰมߋΛ֬ఆ͢Δ ͜ͷؔ਺ʹ౸ୡ͠ͳ͔ͬͨ৔߹ͦ͜·Ͱͷ mpool_mlog_append_data͸ແ͔ͬͨ͜ͱʹͳΔ mpool_mlog_abortͰͦ͜·ͰͷมߋΛ ໌ࣔతʹແ͔ͬͨ͜ͱʹ͢Δ

Slide 76

Slide 76 text

SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) ) else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), erase_log ) ) bool empty = false; SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) ) std::cout << "empty: " << empty << std::endl; size_t len = 0; SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) ) mlogͷAPI mpool_mlog_eraseΛ࢖͑͹ ࢦఆͨ͠mlogΛؙ͝ͱ࡟আͰ͖Δ

Slide 77

Slide 77 text

else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), erase_log ) ) bool empty = false; SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) ) std::cout << "empty: " << empty << std::endl; size_t len = 0; SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) ) std::cout << "length: " << len << std::endl; SAFE_CALL( mpool_mlog_read_data_init( pool.get(), log.get() ) ) while( 1 ) { std::array< char, 1024u > buf; size_t length = 0u; SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(), buf.data(), buf.size() - 1, &length ) ); if( !length ) break; buf[ length ] = '\0'; mlogͷAPI mpool_mlog_emptyͰ mlog͕ۭ͔Ͳ͏͔Λ֬ೝͰ͖Δ mpool_mlog_lenͰ mlogͷ࢖༻ࡁΈͷྖҬͷαΠζΛऔಘͰ͖Δ

Slide 78

Slide 78 text

std::cout << "empty: " << empty << std::endl; size_t len = 0; SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) ) std::cout << "length: " << len << std::endl; SAFE_CALL( mpool_mlog_read_data_init( pool.get(), log.get() ) ) while( 1 ) { std::array< char, 1024u > buf; size_t length = 0u; SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(), buf.data(), buf.size() - 1, &length ) ); if( !length ) break; buf[ length ] = '\0'; std::cout << "data: " << buf.data() << std::endl; } SAFE_CALL( mpool_mlog_flush( pool.get(), log.get() ) ) SAFE_CALL( mpool_mlog_close( pool.get(), log.get() ) ) if( delete_log ) SAFE_CALL( mpool_mlog_delete( pool.get(), log.get() ) ) mlogͷAPI mpool_mlog_read_data_initͰಡΈग़͠ͷ༻ҙΛͯ͠ mpool_mlog_read_data_nextͰઌ಄͔Βॱ൪ʹ ॻ͖ࠐ·Εͨ಺༰ΛಡΊΔ

Slide 79

Slide 79 text

SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(), buf.data(), buf.size() - 1, &length ) ); if( !length ) break; buf[ length ] = '\0'; std::cout << "data: " << buf.data() << std::endl; } SAFE_CALL( mpool_mlog_flush( pool.get(), log.get() ) ) SAFE_CALL( mpool_mlog_close( pool.get(), log.get() ) ) if( delete_log ) SAFE_CALL( mpool_mlog_delete( pool.get(), log.get() ) ) mlogͷAPI mlogΛ࡟আ͢Δͱ͖͸mpool_mlog_delete

Slide 80

Slide 80 text

mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t log1 = 0; uint64_t log2 = 0; if( !params.count( "object" ) ) { mdc_capacity cap; mdcͷAPI MetaData Containerུͯ͠MDC mlogΛ2ຊ૊Έ߹Θͤͯ ΨϕʔδίϨΫγϣϯͰ͖ΔΑ͏ʹͨ͠΋ͷ mpool_openͰmpoolσόΠεΛ։͘ͷ͸mlogͱҰॹ

Slide 81

Slide 81 text

SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t log1 = 0; uint64_t log2 = 0; if( !params.count( "object" ) ) { mdc_capacity cap; memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) ); cap.mdt_captgt = 4 * 1024 * 1024; SAFE_CALL( mpool_mdc_alloc( pool.get(), &log1, &log2, MP_MED_CAPACITY, &cap, nullptr ) ); std::cout << "object id: " << log1 << ":" << log2 << std::endl; SAFE_CALL( mpool_mdc_commit( pool.get(), log1, log2 ) ) } else { auto v = params[ "object" ].as< std::string >(); boost::fusion::vector< uint64_t, uint64_t > parsed; namespace qi = boost::spirit::qi; if( !qi::parse( v.begin(), v.end(), qi::ulong_long >> ':' >> qi::ulong_long, parsed ) ) { mdcͷAPI mpool_mdc_allocͰmdcΛ࡞Δ 2ຊͷmlog͕࡞ΒΕͯobject id͕2ͭฦͬͯ͘Δ Ҿ਺ͷmdc_capacityͰmlog1ຊ͋ͨΓͷαΠζΛࢦఆ͢Δ

Slide 82

Slide 82 text

boost::fusion::vector< uint64_t, uint64_t > parsed; namespace qi = boost::spirit::qi; if( !qi::parse( v.begin(), v.end(), qi::ulong_long >> ':' >> qi::ulong_long, parsed ) ) { std::cerr << "invalid object id" << std::endl; return 1; } log1 = boost::fusion::at_c< 0 >( parsed ); log2 = boost::fusion::at_c< 1 >( parsed ); } mpool_mdc *raw_log = nullptr; SAFE_CALL( mpool_mdc_open( pool.get(), log1, log2, 0, &raw_log ) ); std::shared_ptr< mpool_mdc > log( raw_log, [pool]( mpool_mdc *p ) { if( p ) mpool_mdc_close( p ); } ); if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( params.count( "compact" ) ) { auto v = params[ "compact" ].as< std::vector< std::string > >(); mdcͷAPI mpool_mdc_openͰmdcΛ։͘ ։͍ͨmdc͸mpool_mdc_closeͰด͡Δ

Slide 83

Slide 83 text

return 1; } log1 = boost::fusion::at_c< 0 >( parsed ); log2 = boost::fusion::at_c< 1 >( parsed ); } mpool_mdc *raw_log = nullptr; SAFE_CALL( mpool_mdc_open( pool.get(), log1, log2, 0, &raw_log ) ); std::shared_ptr< mpool_mdc > log( raw_log, [pool]( mpool_mdc *p ) { if( p ) mpool_mdc_close( p ); } ); if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( params.count( "compact" ) ) { auto v = params[ "compact" ].as< std::vector< std::string > >(); std::sort( v.begin(), v.end() ); std::vector< std::vector< char > > bufs; SAFE_CALL( mpool_mdc_rewind( log.get() ) ) while( 1 ) { std::vector< char > buf( 4096, 0 ); size_t size = 0; mdcͷAPI mpool_mdc_append_dataͰactiveͳํͷmlogʹ όΠτྻΛ௥Ճ͢Δ ॻ͖ࠐΉόΠτྻ͸ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍ mdc mlog mlog 1 2 3 4 5 2ຊͷmlogͷ͏ͪҰํ͚͕ͩactiveʹͳ͍ͬͯΔ

Slide 84

Slide 84 text

} SAFE_CALL( mpool_mdc_cend( log.get() ) ) } SAFE_CALL( mpool_mdc_rewind( log.get() ) ) while( 1 ) { std::vector< char > buf( 4096, 0 ); size_t size = 0; auto e = mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ); if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; std::cout << "data: " << buf.data() << std::endl; } if( delete_log ) { log.reset(); SAFE_CALL( mpool_mdc_destroy( pool.get(), log1, log2 ) ) } mdcͷAPI mpool_mdc_rewindͰactiveͳϩάͷઌ಄ʹҠಈ mpool_mdc_readΛݺͿ౓ʹϩά͕ॱ൪ʹฦͬͯ͘Δ

Slide 85

Slide 85 text

if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( params.count( "compact" ) ) { auto v = params[ "compact" ].as< std::vector< std::string > >(); std::sort( v.begin(), v.end() ); std::vector< std::vector< char > > bufs; SAFE_CALL( mpool_mdc_rewind( log.get() ) ) while( 1 ) { std::vector< char > buf( 4096, 0 ); size_t size = 0; auto e = mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ); if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) mdcͷAPI mdc mlog mlog 1 2 3 4 5 1 3 ΨϕʔδίϨΫγϣϯΛߦ͏ʹ͸ ·ͣ༗ޮͳϩάΛಡΈग़͢

Slide 86

Slide 86 text

if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; if( std::binary_search( v.begin(), v.end(), std::string( buf.data() ) ) ) { buf.resize( size ); bufs.emplace_back( std::move( buf ) ); } } SAFE_CALL( mpool_mdc_cstart( log.get() ) ) for( const auto &buf: bufs ) { SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( buf.data() ) ), buf.size(), 0 ) ) } SAFE_CALL( mpool_mdc_cend( log.get() ) ) } SAFE_CALL( mpool_mdc_rewind( log.get() ) ) mdcͷAPI mdc mlog mlog 1 2 3 4 5 1 3 1 3 mpool_mdc_cstartͰactiveͳmlogΛ੾Γସ͑ ͦͷޙmpool_mdc_appendͰ༗ޮͳϩάͷॻ͖ࠐΈ

Slide 87

Slide 87 text

if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; if( std::binary_search( v.begin(), v.end(), std::string( buf.data() ) ) ) { buf.resize( size ); bufs.emplace_back( std::move( buf ) ); } } SAFE_CALL( mpool_mdc_cstart( log.get() ) ) for( const auto &buf: bufs ) { SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( buf.data() ) ), buf.size(), 0 ) ) } SAFE_CALL( mpool_mdc_cend( log.get() ) ) } SAFE_CALL( mpool_mdc_rewind( log.get() ) ) mdcͷAPI mdc mlog mlog 1 3 ࠷ޙʹmpool_mdc_cendͰinactiveͳϩάΛTRIM

Slide 88

Slide 88 text

if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; std::cout << "data: " << buf.data() << std::endl; } if( delete_log ) { log.reset(); SAFE_CALL( mpool_mdc_destroy( pool.get(), log1, log2 ) ) } mpool_mdc_destroyͰ2ͭͷmlogΛ·ͱΊͯ࡟আ mdcͷAPI

Slide 89

Slide 89 text

mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); std::vector< uint64_t > object_ids = params[ "object" ].as< std::vector< uint64_t > >(); mcacheͷAPI mblock͸ϖʔδΩϟογϡΛ࣋ͨͳ͍ Կ౓΋ಡΉσʔλΛϝϞϦʹஔ͍͓͖͍ͯͨ৔߹͸ mcacheͰϖʔδΩϟογϡΛ࡞Δ ͱΓ͋͑ͣmpool_openͰmpoolσόΠεΛ։͘

Slide 90

Slide 90 text

uint64_t block_id = 0; SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id, &block_id, &props ) ) return props; } ); { mpool_mcache_map *raw_map; SAFE_CALL( mpool_mcache_mmap( pool.get(), object_ids.size(), object_ids.data(), MPC_VMA_WARM, &raw_map ) ); std::shared_ptr< mpool_mcache_map > map( raw_map, [pool] ( mpool_mcache_map *p ) { if( p ) mpool_mcache_munmap( p ); } ); for( uint64_t cache_id = 0; cache_id != object_ids.size(); + +cache_id ) { SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0, props[ cache_id ].mpr_write_len, MADV_WILLNEED ) ) size_t offset = 0u; mcacheͷAPI mpool_mcache_mmapͰmcacheʹ৐͍ͤͨmblockΛ object idͰࢦఆ͢Δ ΩϟογϡΛ΍ΊΔͱ͖͸mpool_mcache_munmap

Slide 91

Slide 91 text

} ); for( uint64_t cache_id = 0; cache_id != object_ids.size(); + +cache_id ) { SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0, props[ cache_id ].mpr_write_len, MADV_WILLNEED ) ) size_t offset = 0u; void *page = nullptr; SAFE_CALL( mpool_mcache_getpages( map.get(), 1, cache_id, &offset, &page ) ); char *data = reinterpret_cast< char* >( page ); std::cout << "length: " << props[ cache_id ].mpr_write_len << std::endl; std::cout << "data: " << data << std::endl; } } mcacheͷAPI mpool_mcache_madviseͰ cache id൪໨ͷmblock͕ۙ͘ඞཁʹͳΔ͜ͱΛ௨஌ mpool_mcache_getpagesͰϖʔδΩϟογϡͷΞυϨεΛऔಘ

Slide 92

Slide 92 text

if( p ) mpool_mcache_munmap( p ); } ); for( uint64_t cache_id = 0; cache_id != object_ids.size(); + +cache_id ) { SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0, props[ cache_id ].mpr_write_len, MADV_WILLNEED ) ) size_t offset = 0u; void *page = nullptr; SAFE_CALL( mpool_mcache_getpages( map.get(), 1, cache_id, &offset, &page ) ); char *data = reinterpret_cast< char* >( page ); std::cout << "length: " << props[ cache_id ].mpr_write_len << std::endl; std::cout << "data: " << data << std::endl; } } mcacheͷAPI ϙΠϯτ mcacheͷ࡞੒ͱഁغͷλΠϛϯά͸ ΞϓϦέʔγϣϯ͕ίϯτϩʔϧͰ͖ΔͨΊ ͜ͷΩϟογϡΛͦͷ·· ετϨʔδΤϯδϯͷΩϟογϡʹ࢖͑Δ

Slide 93

Slide 93 text

switch (cmd) { case MPIOC_MP_CREATE: case MPIOC_MP_ACTIVATE: case MPIOC_MP_DESTROY: case MPIOC_MP_RENAME: err = mpioc_mp_cmd(unit, cmd, argp); break; case MPIOC_MP_DEACTIVATE: err = mpioc_mp_deactivate(unit, cmd, argp); break; case MPIOC_DRV_ADD: err = mpioc_mp_add(unit, cmd, argp); break; case MPIOC_PARAMS_SET: err = mpioc_params_set(unit, cmd, argp); break; case MPIOC_PARAMS_GET: err = mpioc_params_get(unit, cmd, argp); break; case MPIOC_MP_MCLASS_GET: err = mpioc_mp_mclass_get(unit, cmd, argp); break; case MPIOC_PROP_GET: err = mpioc_proplist_get(unit, cmd, argp); break; case MPIOC_DEVPROPS_GET: err = mpioc_devprops_get(unit, argp); break; case MPIOC_MB_ALLOC: mpool-kmod/src/mpctl.c static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg) mdcΛআ͘mpoolͷૢ࡞͸ ͦͷ··ioctlʹϚοϓ͞Εͯ Χʔωϧۭؒͷؔ਺ͷ ݺͼग़͠ʹͳ͍ͬͯΔ

Slide 94

Slide 94 text

mpoolͷεʔύʔϒϩοΫ object idͱετϨʔδ্ͷ഑ஔͷରԠ͸ Χʔωϧͷ੺ࠇ໦(rbtree)Λ࢖ͬͯอ࣋͢Δ rbtree 2 3 1 1 3 2

Slide 95

Slide 95 text

mpoolͷεʔύʔϒϩοΫ ͜ͷ੺ࠇ໦ʹର͢Δมߋ͸ mpoolͷઌ಄ʹஔ͔Εͨmdcʹه࿥͞ΕΔ rbtree 2 3 1 1 3 2 mdc0 mpoolͷactivate࣌͸͜ͷϩάΛᢞΊͯrbtreeΛߏங͢Δ

Slide 96

Slide 96 text

1 3 2 mdc0 ϙΠϯτ ϑΝΠϧγεςϜͷϝλσʔλͱҟͳΓ mdc0ʹ͸object idɺҐஔɺαΠζ͘Β͍ͷ৘ใ͔͠ͳ͍ ͜ͷͨΊmdc0Ҏ֎ͷmdcʹͲΜͳʹมߋΛՃ͑ͯ΋ mdc0ʹϩά͕ॻ͖଍͞ΕΔࣄ͸ͳ͍ ଟஈϩάΛճආͰ͖Δ

Slide 97

Slide 97 text

͜͏͢ΔͱετϨʔδΤϯδϯʹͳΔ 0 1 2' 4 5 root 1 2' ͕ʹͳΔ ϩά i i mcache mblock mdc

Slide 98

Slide 98 text

HSE_SAFE_CALL( hse_kvdb_init() ); std::shared_ptr< void > context( nullptr, []( void* ) { hse_kvdb_fini(); } ); const std::string pool_name = params[ "pool" ].as< std::string >(); if( create_kvdb ) HSE_SAFE_CALL( hse_kvdb_make( pool_name.c_str(), nullptr ) ); hse_kvdb *raw_kvdb = nullptr; HSE_SAFE_CALL( hse_kvdb_open( pool_name.c_str(), nullptr, &raw_kvdb ) ); std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb *p ) { if( p ) hse_kvdb_close( p ); } ); const std::string kvs_name = params[ "kvs" ].as< std::string >(); if( create_kvs ) HSEͷAPI hse_kvdb_initͰHSEΛ࢖͏ͨΊͷ४උΛ͢Δ ย෇͚Δͱ͖͸hse_kvdb_fini

Slide 99

Slide 99 text

std::shared_ptr< void > context( nullptr, []( void* ) { hse_kvdb_fini(); } ); const std::string pool_name = params[ "pool" ].as< std::string >(); if( create_kvdb ) HSE_SAFE_CALL( hse_kvdb_make( pool_name.c_str(), nullptr ) ); hse_kvdb *raw_kvdb = nullptr; HSE_SAFE_CALL( hse_kvdb_open( pool_name.c_str(), nullptr, &raw_kvdb ) ); std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb *p ) { if( p ) hse_kvdb_close( p ); } ); const std::string kvs_name = params[ "kvs" ].as< std::string >(); if( create_kvs ) HSE_SAFE_CALL( hse_kvdb_kvs_make( kvdb.get(), kvs_name.c_str(), nullptr ) ); hse_kvs *raw_kvs; HSE_SAFE_CALL( hse_kvdb_kvs_open( kvdb.get(), kvs_name.c_str(), nullptr, &raw_kvs ) ); HSEͷAPI hse_kvdb_makeͰࢦఆͨ͠mpoolʹkvdbΛ࡞Δ hse_kvdb_openͰkvdbΛ։͘ mpool kvdb kvs kvs Ωʔ σʔλ Ωʔ σʔλ kvs Ωʔ σʔλ kvdbͷதʹෳ਺ͷkvs(ςʔϒϧ)Λ࡞Δ͜ͱ͕Ͱ͖Δ ͜Ε

Slide 100

Slide 100 text

std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb *p ) { if( p ) hse_kvdb_close( p ); } ); const std::string kvs_name = params[ "kvs" ].as< std::string >(); if( create_kvs ) HSE_SAFE_CALL( hse_kvdb_kvs_make( kvdb.get(), kvs_name.c_str(), nullptr ) ); hse_kvs *raw_kvs; HSE_SAFE_CALL( hse_kvdb_kvs_open( kvdb.get(), kvs_name.c_str(), nullptr, &raw_kvs ) ); std::shared_ptr< hse_kvs > kvs( raw_kvs, [kvdb]( hse_kvs *p ) { if( p ) hse_kvdb_kvs_close( p ); } ); hse_kvdb_opspec os; HSE_KVDB_OPSPEC_INIT( &os ); std::shared_ptr< hse_kvdb_txn > transaction( hse_kvdb_txn_alloc( kvdb.get() ), [kvdb]( hse_kvdb_txn *p ) { if( p ) hse_kvdb_txn_free( kvdb.get(), p ); } ); os.kop_txn = transaction.get(); HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) ); HSEͷAPI hse_kvdb_kvs_makeͰࢦఆͨ͠kvdbʹkvsΛ࡞Δ hse_kvdb_kvs_openͰkvsΛ։͘ mpool kvdb kvs Ωʔ σʔλ Ωʔ σʔλ ͜Ε

Slide 101

Slide 101 text

std::shared_ptr< hse_kvs > kvs( raw_kvs, [kvdb]( hse_kvs *p ) { if( p ) hse_kvdb_kvs_close( p ); } ); hse_kvdb_opspec os; HSE_KVDB_OPSPEC_INIT( &os ); std::shared_ptr< hse_kvdb_txn > transaction( hse_kvdb_txn_alloc( kvdb.get() ), [kvdb]( hse_kvdb_txn *p ) { if( p ) hse_kvdb_txn_free( kvdb.get(), p ); } ); os.kop_txn = transaction.get(); HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) ); for( const auto &v: put_value ) { HSE_SAFE_CALL( hse_kvs_put( kvs.get(), &os, v.first.data(), v.first.size(), v.second.data(), v.second.size() ) ); } for( const auto &v: get_value ) { std::array< char, 100 > data{ 0 }; bool found = false; size_t length = 0; HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(), v.size(), &found, data.data(), data.size(), &length ) ); HSEͷAPI hse_kvdb_txn_allocͰ৽͍͠τϥϯβΫγϣϯΛ࡞Δ root root(1) ͜Ε ϩά ࣺͯΔͱ͖͸hse_kvdb_txn_free

Slide 102

Slide 102 text

hse_kvdb_txn_free( kvdb.get(), p ); } ); os.kop_txn = transaction.get(); HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) ); for( const auto &v: put_value ) { HSE_SAFE_CALL( hse_kvs_put( kvs.get(), &os, v.first.data(), v.first.size(), v.second.data(), v.second.size() ) ); } for( const auto &v: get_value ) { std::array< char, 100 > data{ 0 }; bool found = false; size_t length = 0; HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(), v.size(), &found, data.data(), data.size(), &length ) ); if( found ) std::cout << v << "=" << data.data() << std::endl; } if( abort_transaction ) { HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) ); HSEͷAPI hse_kvdb_txn_beginͰτϥϯβΫγϣϯΛ։࢝ root hse_kvs_putͰΩʔͱ஋ͷϖΞΛॻ͘ root(1) Ωʔ σʔλ Ωʔ σʔλ ͜Ε ϩά

Slide 103

Slide 103 text

v.first.size(), v.second.data(), v.second.size() ) ); } for( const auto &v: get_value ) { std::array< char, 100 > data{ 0 }; bool found = false; size_t length = 0; HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(), v.size(), &found, data.data(), data.size(), &length ) ); if( found ) std::cout << v << "=" << data.data() << std::endl; } if( abort_transaction ) { HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) ); } else { HSE_SAFE_CALL( hse_kvdb_txn_commit( kvdb.get(), os.kop_txn ) ); } HSEͷAPI hse_kvs_getͰΩʔʹରԠ͢Δ஋Λऔಘ root(1) Ωʔ σʔλ Ωʔ σʔλ root ϩά

Slide 104

Slide 104 text

v.size(), &found, data.data(), data.size(), &length ) ); if( found ) std::cout << v << "=" << data.data() << std::endl; } if( abort_transaction ) { HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) ); } else { HSE_SAFE_CALL( hse_kvdb_txn_commit( kvdb.get(), os.kop_txn ) ); } HSEͷAPI hse_kvdb_txn_commitͰॻ͖ࠐΈΛ֬ఆ hse_kvdb_txn_abortͰ͜͜·Ͱͷॻ͖ࠐΈΛऔΓফ͠ root(1) Ωʔ σʔλ Ωʔ σʔλ root ஋Λૠೖ ϩά ஋Λૠೖ ͜Ε

Slide 105

Slide 105 text

Heterogeneous-Memory Storage Engine HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ 1. ݹయతͳSSD 2. Zoned NamespaceΛ࣋ͭNVMe SSD 3. ෆشൃϝϞϦσόΠε

Slide 106

Slide 106 text

Heterogeneous-Memory Storage Engine HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ 1. ݹయతͳSSD 2. Zoned NamespaceΛ࣋ͭNVMe SSD 3. ෆشൃϝϞϦσόΠε όʔδϣϯ1.7ͷ࣌఺Ͱར༻Մೳ ະ࣮૷ ະ࣮૷

Slide 107

Slide 107 text

Heterogeneous-Memory Storage Engine HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ 1. ݹయతͳSSD 2. Zoned NamespaceΛ࣋ͭNVMe SSD 3. ෆشൃϝϞϦσόΠε όʔδϣϯ1.7ͷ࣌఺Ͱར༻Մೳ ະ࣮૷ ະ࣮૷ ෆشൃϝϞϦσόΠεʹ͍ͭͯ͸ ҎલͷΧʔωϧ/VM޲͚ʹ༻ҙͨ͠ղઆ͕͋ΔͷͰ ͦͪΒΛ͝ཡ͍ͩ͘͞ https://speakerdeck.com/fadis/dian-yuan-woqie-tutemoxiao-enaimemoritofalsefu-kihe-ifang

Slide 108

Slide 108 text

Zoned Namespace ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5 6 7 ϒϩοΫ2 ۭ ۭ ۭ ۭ ม׵ද 2->8 SSDͷ༰ྔ͕େ͖͘ͳΔͱม׵ද΋େ͖͘ͳΔ ͜Ε ͜ͷม׵දͷͨΊʹSSDͷ༰ྔͷ ఔ౓ͷRAM͕ඞཁ 1 1000 େ༰ྔͷSSDͷίϯτϩʔϥʹ͸ େ༰ྔͷRAMΛඋ͑Δඞཁ͕͋Δ ͭΒ͍

Slide 109

Slide 109 text

Zoned Namespace ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5 6 7 ϒϩοΫ2 ۭ ۭ ۭ ͜ͷαΠζ୯ҐͰΞυϨεΛม׵͢Δͱ ม׵ද͕େ͖͘ͳΓ͗͢Δ த్൒୺ʹTRIM͞ΕͨϒϩοΫ͕ੜ͡Δ ͜ͷαΠζ୯ҐͰ Ͳ͜ʹׂΓ౰͔ͯͨͱઌ಄͔ΒͲ͜·Ͱ࢖͔͚ͬͨͩΛ ͓֮͑ͯ͜͏ TRIM͸ৗʹϒϩοΫؙ͝ͱ ۭ

Slide 110

Slide 110 text

Zoned Namespace ϒϩοΫ1 ϒϩοΫ0 ࢖༻த ϒϩοΫ2 ࢖༻த 100MBΦʔμʔͷڊେͳϒϩοΫαΠζΛ༻͍Δ ϒϩοΫʹ͸ۭ͖͕͋ΔݶΓ௥ه͕Ͱ͖Δ ॻ͍ͨ಺༰Λফ͍ͨ͠৔߹͸ϒϩοΫؙ͝ͱ࡟আ͢Δඞཁ͕͋Δ FTLͷ࢓ࣄΛݮΒ͠ ΞϓϦέʔγϣϯʹNANDͷ੍໿ͷҰ෦Λ௚઀ݟͤΔ

Slide 111

Slide 111 text

Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ dm-zoned ϖʔδαΠζ,J# ϖʔδαΠζ.J# dm-zoned Linuxͷ Zoned Namespace΁ͷ ରԠ 4KiBϖʔδ͕ ͋Δ͔ͷΑ͏ʹݟͤΔ

Slide 112

Slide 112 text

Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ ߏ଄Խϩά͕૿͑ͨ dm-zoned ϩά ϩά ϩά ϩά

Slide 113

Slide 113 text

Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ HSEͷૂ͍ dm-zoned HSE mpool ϖʔδαΠζ,J# ϖʔδαΠζ.J# ϖʔδαΠζ.J#

Slide 114

Slide 114 text

·ͱΊ SSD͕πϯσϨ ͏·͘ੑೳΛҾ͖ग़͢ʹ͸ ΧʔωϧͷϨΠϠʔ͔Β࢖͍ํΛݟ௚͢ඞཁ͕͋Δ HSE͸͜ΕΛ΍ͬͯߴ͍ੑೳΛ࣮ݱͨ͠KVS