Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HSEとは何か

Fadis
June 06, 2020

 HSEとは何か

Heterogeneous-Memory Storage Engineについて解説します
これは2020年6月6日に行われた カーネル/VM探検隊 online part1での発表資料です

参考文献
Heterogeneous-Memory
Storage Engine: https://www.micron.com/hse
Don't stack your log on my log: https://www.usenix.org/node/187064
電源を切っても消えないメモリとの付き合い方: https://speakerdeck.com/fadis/dian-yuan-woqie-tutemoxiao-enaimemoritofalsefu-kihe-ifang
この資料のサンプルコード: https://github.com/Fadis/hse_demo
カーネル/VM探検隊 online part1: https://connpass.com/event/175388/

Fadis

June 06, 2020
Tweet

More Decks by Fadis

Other Decks in Programming

Transcript

  1. HSEͱ͸Կ͔
    NAOMASA MATSUBAYASHI

    View full-size slide

  2. Heterogeneous-Memory Storage Engine
    https://www.micron.com/hse
    2020೥4݄ʹMicron͕ൃදͨ͠
    Φʔϓϯιʔεͳ
    Key-Value Store

    View full-size slide

  3. Heterogeneous-Memory Storage Engine
    https://github.com/hse-project/hse/wiki
    "HSE͸NANDϑϥογϡ·ͨ͸ෆشൃϝϞϦΛ༻͍ΔSSDͷͨΊʹ࡞ΒΕͨ
    ૊ΈࠐΈՄೳͳkey-value storeͰ͢ɻHSE͸DRAM͔ΒଟछͷSSD·ͨ͸ͦͷଞͷsolid-
    stateετϨʔδ·Ͱͷσʔλͷ഑ஔΛ޻෉͢Δ͜ͱͰɺੑೳͱ଱ٱੑΛվળͤ͞·͢ɻ"
    https://www.micron.com/hse
    Φʔϓϯιʔεͳ
    Key-Value Store

    View full-size slide

  4. Heterogeneous-Memory Storage Engine
    MongoDBͷWiredTigerΛHSEͰஔ͖׵͑Δͱ
    YCSBϕϯνϚʔΫͷεϧʔϓοτ͕2ഒ͔Β8ഒʹͳΔΒ͍͠
    https://github.com/hse-project/hse/wiki/MongoDB
    https://github.com/hse-project/hse/wiki
    "HSE͸NANDϑϥογϡ·ͨ͸ෆشൃϝϞϦΛ༻͍ΔSSDͷͨΊʹ࡞ΒΕͨ
    ૊ΈࠐΈՄೳͳkey-value storeͰ͢ɻHSE͸DRAM͔ΒଟछͷSSD·ͨ͸ͦͷଞͷsolid-
    stateετϨʔδ·Ͱͷσʔλͷ഑ஔΛ޻෉͢Δ͜ͱͰɺੑೳͱ଱ٱੑΛվળͤ͞·͢ɻ"
    https://www.micron.com/hse

    View full-size slide

  5. ετϨʔδΤϯδϯ
    Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    IOεέδϡʔϥ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    ͜Ε
    InnoDB

    View full-size slide

  6. ετϨʔδΤϯδϯͷ໾ׂ
    ετϨʔδσόΠε্Ͱͷه࿥Ґஔͷܾఆ

    View full-size slide

  7. ετϨʔδΤϯδϯͷ໾ׂ
    සൟʹΞΫηε͞ΕΔσʔλͷΩϟογϡ

    View full-size slide

  8. ετϨʔδΤϯδϯͷ໾ׂ
    PUT
    ABORT
    PUT
    COMMIT
    BEGIN
    PUT
    COMMIT
    GET
    GET
    ͜Ε͕ݟ͑Δ
    GET GET
    ͜Ε͕ݟ͑Δ
    GET
    ͜Ε͕ݟ͑Δ
    τϥϯβΫγϣϯ

    View full-size slide

  9. ετϨʔδΤϯδϯͷ໾ׂ
    COMMIT
    PUT
    PUT
    PUT
    COMMIT
    ͜͜ͰΫϥογϡͨ͠Β
    ࠶ىಈޙ͜͜ͷঢ়ଶʹͳΔ
    ͜͜ͰΫϥογϡͨ͠Β
    ࠶ىಈޙ͜͜ͷঢ়ଶʹͳΔ
    ॲཧ͕தஅ͞Εͯ΋σʔλ͕ෆਖ਼ͳঢ়ଶʹͳΒͳ͍

    View full-size slide

  10. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    0 1 2 4 5
    ϩά
    i
    8JSFE5JHFSͷ৔߹

    View full-size slide

  11. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    0 1 2 4 5
    1ͱ2ΛGET
    ηογϣϯ0
    ϩά
    i
    root
    1 2
    i
    8JSFE5JHFSͷ৔߹

    View full-size slide

  12. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    0 1 2 4 5
    root
    1 2
    2'ΛPUT
    ηογϣϯ1
    root(1)
    2'
    ϩά
    i
    i
    8JSFE5JHFSͷ৔߹
    ηογϣϯ0

    View full-size slide

  13. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    0 1 2 4 5
    root
    1 2
    ηογϣϯ0
    2ΛGET
    ηογϣϯ1
    root(1)
    2'
    2ΛGET
    ϩά
    i
    i
    8JSFE5JHFSͷ৔߹

    View full-size slide

  14. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    i 0 1 2 4 5
    root
    1
    i
    COMMIT
    ηογϣϯ1
    root(1)
    2'
    ͕ʹͳΔ
    ϩά
    8JSFE5JHFSͷ৔߹
    ηογϣϯ0

    View full-size slide

  15. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    0 1 2 4 5
    root
    1
    ηογϣϯ0 ηογϣϯ1
    2'
    ͕ʹͳΔ
    ϩά
    2ΛGET
    2ΛGET
    i
    i
    8JSFE5JHFSͷ৔߹

    View full-size slide

  16. ετϨʔδΤϯδϯͷΨϕʔδίϨΫλ
    0 1 2' 4 5
    root
    1 2'
    ͕ʹͳΔ
    ϩά
    i
    i
    8JSFE5JHFSͷ৔߹

    View full-size slide

  17. ετϨʔδΤϯδϯͷΨϕʔδίϨΫλ
    0 1 2' 4 5
    root
    1 2'
    ϩά
    i
    i
    8JSFE5JHFSͷ৔߹

    View full-size slide

  18. ετϨʔδΤϯδϯͷ࣮૷ํ๏
    ετϨʔδΤϯδϯ ϑΝΠϧγεςϜ
    σʔλͷ୳͠ํ Ωʔ ϑΝΠϧύε
    ه࿥ҐஔΛܾΊΔ ͢Δ ͢Δ
    Ωϟογϡ ͋Δ ͋Δ
    τϥϯβΫγϣϯ ΞϓϦέʔγϣϯ੍͕ޚ ͳ͍
    தஅޙͷঢ়ଶ
    ׬ྃͨ͠τϥϯβΫγϣϯ͚͕ͩ
    ൓ө͞Εͨঢ়ଶ
    ϑΝΠϧγεςϜ͕
    յΕ͍ͯͳ͍ԿΒ͔ͷঢ়ଶ
    தஅޙͷঢ়ଶͷ
    ճ෮ํ๏
    ߏ଄Խϩά ߏ଄Խϩά
    ϑΝΠϧγεςϜͷδϟʔφϧͱࣅ͍ͯΔ͕
    ϑΝΠϧγεςϜ͸τϥϯβΫγϣϯΛఏڙ͠ͳ͍

    View full-size slide

  19. ϩάͷ্ʹϩά͕৐ͬͨঢ়ଶʹͳΔ
    Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    IOεέδϡʔϥ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB ϩά
    ϩά

    View full-size slide

  20. σʔλϕʔεΛSSDͷ্Ͱಈ͔͢͜ͱ͸௝͘͠ͳ͘ͳͬͨ

    View full-size slide

  21. ͷ੍໿
    0 0 0 0 0
    20V 20V
    20V 20V 20V 20V 20V
    0V 0V 0V 0V 0V
    ಉҰϒϩοΫͷ͢΂ͯͷηϧ͔ΒిՙΛൈ͘
    (=ϒϩοΫΛؙ͝ͱθϩΫϦΞ͢Δ)
    ௨ৗ1ϒϩοΫ͸ෳ਺ͷϖʔδ͔ΒͳΔͨΊ
    1ϖʔδ͚ͩθϩΫϦΞ͢Δ͜ͱ͸ग़དྷͳ͍

    View full-size slide

  22. ͷ੍໿
    0 0 1 0 1
    0V 0V
    0V 0V 0V 0V 0V
    0V 0V 20V 0V 20V
    ͦͷޙ1ʹ͍ͨ͠ηϧʹిՙΛஷΊΔ
    ͜ͷిՙͷग़ೖΓͷͨͼʹηϧͷτϯωϧࢎԽບ͕ফ໣͢Δ

    View full-size slide

  23. ϒϩοΫ1
    ϒϩοΫ0
    ͷ੍໿
    0 1 2 3 4 5 6 7
    2' Λॻ͖׵͍͑ͨ
    ϋʔυσΟεΫͷΑ͏ʹ௚઀ Λॻ͖׵͑Δͱ
    2
    ΛθϩΫϦΞ
    ϒϩοΫ0
    0 1 2 3
    Λॻ͖ࠐΈ
    0 1 2' 3
    1.
    2.
    ௒஗͍

    View full-size slide

  24. ϒϩοΫ1
    ϒϩοΫ0
    ͷ੍໿
    0 1 2 3 4 5 6 7
    2'
    ϋʔυσΟεΫͷΑ͏ʹ௚઀ Λॻ͖׵͑Δͱ
    2
    ͕ۃ୺ʹফ໣ͯ͠࢖͑ͳ͘ͳΔ
    ϒϩοΫ0
    0 1 2 3
    ௒੬͍
    2' 2' ಉ͡ϖʔδʹԿ౓΋ॻ͖ࠐΈ

    View full-size slide

  25. Flash Translation Layer
    ϒϩοΫ1
    ϒϩοΫ0
    0 1 2 3 4 5 6 7
    ϒϩοΫ2
    ۭ ۭ ۭ ۭ
    SSDͷίϯτϩʔϥ͸
    Ջͳ࣌ʹۭ͖ྖҬΛͲΜͲΜθϩΫϦΞ͢Δ

    View full-size slide

  26. Flash Translation Layer
    ϒϩοΫ1
    ϒϩοΫ0
    0 1 2 3 4 5 6 7
    ϒϩοΫ2
    2' ۭ ۭ ۭ
    2'
    ॻ͖ࠐΈཁٻ͕དྷͨΒ
    θϩΫϦΞࡁΈͷϖʔδʹॻ͖ࠐΉ
    Λॻ͖׵͍͑ͨ

    View full-size slide

  27. Flash Translation Layer
    ϒϩοΫ1
    ϒϩοΫ0
    0 1 2 3 4 5 6 7
    ϒϩοΫ2
    2' ۭ ۭ ۭ
    SSD͸͋ΔLBAͷϖʔδ͕
    Ͳͷ෺ཧΞυϨεʹه࿥͞Ε͍ͯΔ͔Λද͢
    ม׵දΛ͍࣋ͬͯΔ
    ม׵ද
    2->8

    View full-size slide

  28. ม׵ද
    LBAͱ෺ཧΞυϨεͷม׵
    LBA2͸෺ཧΞυϨε8ʹͳͬͨ
    LBA5͸෺ཧΞυϨε9ʹͳͬͨ
    LBA1͸ΞυϨε10ʹͳͬͨ
    LBA1͸TRIM͞Εͨ
    LBA2͸෺ཧΞυϨε11ʹͳͬͨ
    LBA3͸෺ཧΞυϨε12ʹͳͬͨ
    LBA ෺ཧΞυϨε
    2 11
    3 12
    5 9
    ม׵ද͸σόΠεͷRAMͱϑϥογϡϝϞϦͷ྆ํʹஔ͔ΕΔ
    ϑϥογϡϝϞϦ͸ߦ͝ͱʹॻ͖׵͑ΒΕͳ͍ͷͰ
    ͢΂ͯͷมߋ͕ߏ଄ԽϩάͰ௥ه͞ΕΔ

    View full-size slide

  29. ม׵ද
    LBAͱ෺ཧΞυϨεͷม׵
    LBA2͸෺ཧΞυϨε8ʹͳͬͨ
    LBA5͸෺ཧΞυϨε9ʹͳͬͨ
    LBA1͸ΞυϨε10ʹͳͬͨ
    LBA1͸TRIM͞Εͨ
    LBA2͸෺ཧΞυϨε11ʹͳͬͨ
    LBA3͸෺ཧΞυϨε12ʹͳͬͨ
    LBA ෺ཧΞυϨε
    2 11
    3 12
    5 9
    ϒϩοΫ3
    ϒϩοΫ2
    2 5 1 2 3 ۭ ۭ ۭ
    ΛಡΈ͍ͨ
    2

    View full-size slide

  30. ม׵ද
    LBAͱ෺ཧΞυϨεͷม׵
    LBA2͸෺ཧΞυϨε8ʹͳͬͨ
    LBA5͸෺ཧΞυϨε9ʹͳͬͨ
    LBA1͸ΞυϨε10ʹͳͬͨ
    LBA1͸TRIM͞Εͨ
    LBA2͸෺ཧΞυϨε11ʹͳͬͨ
    LBA3͸෺ཧΞυϨε12ʹͳͬͨ
    LBA2͸෺ཧΞυϨε13ʹͳͬͨ
    LBA ෺ཧΞυϨε
    2 13
    3 12
    5 9
    ϒϩοΫ3
    ϒϩοΫ2
    2 5 1 3
    2 2 ۭ ۭ
    ʹॻ͖͍ͨ
    2

    View full-size slide

  31. ม׵ද
    LBAͱ෺ཧΞυϨεͷม׵
    LBA2͸෺ཧΞυϨε8ʹͳͬͨ
    LBA5͸෺ཧΞυϨε9ʹͳͬͨ
    LBA1͸ΞυϨε10ʹͳͬͨ
    LBA1͸TRIM͞Εͨ
    LBA2͸෺ཧΞυϨε11ʹͳͬͨ
    LBA3͸෺ཧΞυϨε12ʹͳͬͨ
    LBA2͸෺ཧΞυϨε13ʹͳͬͨ
    LBA5͸TRIM͞Εͨ
    LBA ෺ཧΞυϨε
    2 13
    3 12
    ϒϩοΫ3
    ϒϩοΫ2
    2 5 1 3
    2 2 ۭ ۭ
    ΛTRIM
    5

    View full-size slide

  32. ม׵ද
    FTLͷΨϕʔδίϨΫλ
    LBA2͸෺ཧΞυϨε8ʹͳͬͨ
    LBA5͸෺ཧΞυϨε9ʹͳͬͨ
    LBA1͸ΞυϨε10ʹͳͬͨ
    LBA1͸TRIM͞Εͨ
    LBA2͸෺ཧΞυϨε11ʹͳͬͨ
    LBA3͸෺ཧΞυϨε12ʹͳͬͨ
    LBA2͸෺ཧΞυϨε13ʹͳͬͨ
    LBA5͸TRIM͞Εͨ
    LBA ෺ཧΞυϨε
    2 13
    3 12
    ϒϩοΫ3
    3 2 ۭ ۭ
    ϒϩοΫ2
    ۭ ۭ ۭ ۭ
    SSDͷίϯτϩʔϥ͸
    શͯͷϖʔδ͕ม׵ද͔Βࢀর͞Εͳ͘ͳͬͨϒϩοΫΛ
    Ջͳ࣌ʹθϩΫϦΞ͢Δ

    View full-size slide

  33. FTLͷΨϕʔδίϨΫλ
    ϒϩοΫ3
    3 2 ̐ ̐
    ϒϩοΫ2
    1 2 1 2
    θϩΫϦΞ͞Εͨϖʔδ͕ݮ͖͍ͬͯͯΔ͕
    ͲͷϒϩοΫ΋த్൒୺ʹ࢖ΘΕ͍ͯΔ৔߹

    View full-size slide

  34. FTLͷΨϕʔδίϨΫλ
    ϒϩοΫ3
    3 2
    ϒϩοΫ2
    ̐ ̐
    1 2 1 2
    த్൒୺ʹ࢖ΘΕ͍ͯΔϒϩοΫͷ༗ޮͳϖʔδ͚ͩΛ
    ৽͍͠ϒϩοΫʹॻ͖ࠐΈ
    ϒϩοΫ4
    ۭ ۭ 1 ۭ

    View full-size slide

  35. FTLͷΨϕʔδίϨΫλ
    ϒϩοΫ3
    3 2
    ϒϩοΫ2
    ۭ ۭ ۭ ۭ
    ෆཁʹͳͬͨݩͷϒϩοΫΛθϩΫϦΞ
    ϒϩοΫ4
    ۭ ۭ 1 ۭ
    ̐ ̐

    View full-size slide

  36. Flash Translation Layer
    Flash Translation Layer ϑΝΠϧγεςϜ
    σʔλͷ୳͠ํ LBA ϑΝΠϧύε
    ه࿥ҐஔΛܾΊΔ ͢Δ ͢Δ
    Ωϟογϡ ͋Δ ͋Δ
    τϥϯβΫγϣϯ ͳ͍ ͳ͍
    தஅޙͷঢ়ଶ
    ΞυϨεม׵ද͕յΕ͍ͯͳ͍
    ԿΒ͔ͷঢ়ଶ
    ϑΝΠϧγεςϜ͕
    յΕ͍ͯͳ͍ԿΒ͔ͷঢ়ଶ
    தஅޙͷঢ়ଶͷ
    ճ෮ํ๏
    ߏ଄Խϩά ߏ଄Խϩά
    ϑΝΠϧγεςϜͷδϟʔφϧͱࣅ͍ͯΔ͕
    ϩάͷཻ౓͸ϖʔδ୯Ґ

    View full-size slide

  37. ϩάͷ্ʹϩά͕৐্ͬͨʹϩά͕৐ͬͨঢ়ଶʹͳΔ
    Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB ϩά
    ϩά
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    ϩά

    View full-size slide

  38. ϒϩοΫ1
    ۭ ۭ
    ϒϩοΫ0
    ۭ ۭ
    5 7 7 9
    ϒϩοΫ2
    ۭ ۭ ۭ ۭ
    ʹॻ͖͍ͨ
    1 2 3 4
    ΛTRIM͍ͨ͠
    1 2 3 4
    ϒϩοΫ1
    1 2
    ϒϩοΫ0
    3 4
    5 7 7 9
    ϒϩοΫ2
    ۭ ۭ ۭ ۭ
    ͙͢θϩΫϦΞͰ͖Δ
    ফڈ͸ϒϩοΫ୯ҐͰདྷΔͱخ͍͠

    View full-size slide

  39. ϩάߏ଄ԽϑΝΠϧγεςϜ
    ϑΝΠϧhogeΛ࡞ͬͨ
    ϑΝΠϧhogeͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧhogeͷ1ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧfugaΛ࡞ͬͨ
    ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧhogeΛ࡟আͨ͠
    ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧγεςϜʹ
    ىͬͨ͜ΠϕϯτΛ
    ࣌ܥྻॱʹ
    ετϨʔδʹه࿥

    ৽͍͠ૢ࡞͸ৗʹϩάͷઌ୺ʹ௥ه͞ΕΔ

    View full-size slide

  40. ϩάߏ଄ԽϑΝΠϧγεςϜͷΨϕʔδίϨΫλ
    ϑΝΠϧhogeΛ࡞ͬͨ
    ϑΝΠϧhogeͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧhogeͷ1ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧfugaΛ࡞ͬͨ
    ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧhogeΛ࡟আͨ͠
    ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    ϑΝΠϧγεςϜͷ
    ݱࡏͷঢ়ଶʹ
    Өڹ͠ͳ͍ΠϕϯτΛ
    ݟ͚ͭΔ
    ͦͷ··ॻ͖ଓ͚ΔͱετϨʔδͷྖҬΛ࢖͍੾Δ
    ϑΝΠϧfugaΛ࡞ͬͨ
    ϑΝΠϧfugaͷ0ϖʔδ໨ʹσʔλΛॻ͍ͨ
    Өڹͷ͋Δϩά͚ͩΛ
    ίϐʔͨ͠
    ৽͍͠ϩάΛ࡞Δ

    View full-size slide

  41. ϩάߏ଄ԽϑΝΠϧγεςϜͰ͸
    ΨϕʔδίϨΫλ͕૸ΔλΠϛϯάͰ
    ·ͱ·ͬͨྖҬ͕TRIM͞ΕΔ
    FTLͷΨϕʔδίϨΫλʹ΍͍͞͠
    ·ͱ·ͬͨྖҬ͕SSDͷ෺ཧΞυϨε্Ͱ΋·ͱ·͍ͬͯΔ৔߹
    ଈ࠲ʹϒϩοΫΛղ์Ͱ͖ΔՄೳੑ͕ߴ͍

    View full-size slide

  42. ϩά
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ
    Flash-Friendly File System (F2FS)
    SB CP SIT NAT SSA
    Main
    ϩά
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ
    η
    Ϋ
    γ

    ϯ

    ෳ਺ͷϩάΛ࣋ͭ
    ϩάߏ଄ԽϑΝΠϧγεςϜ
    ϩάʹ࢖͏ྖҬ͸
    ηΫγϣϯ୯ҐͰׂΓ౰ͯ
    ηΫγϣϯαΠζ͸
    ଟ෼ϒϩοΫαΠζͱҰக
    GC࣌ͷTRIM͕
    ϒϩοΫ୯Ґʹͳͬͯخ͍͠

    View full-size slide

  43. 3ͭͷಠཱʹಈ͘ΨϕʔδίϨΫλ͕ॏͳͬͨঢ়ଶ
    Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB GC
    GC
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    GC

    View full-size slide

  44. https://www.usenix.org/node/187064
    Don't Stack Your Log On My Log
    YANG, J., PLASSON, N., GILLIS, G., TALAGALA, N., AND SUNDARARAMAN, S. Don’t
    stack your log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating
    Systems and Workloads (INFLOW) (2014).

    View full-size slide

  45. https://www.usenix.org/node/187064
    YANG, J., PLASSON, N., GILLIS, G., TALAGALA, N., AND SUNDARARAMAN, S. Don’t
    stack your log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating
    Systems and Workloads (INFLOW) (2014).
    ߏ଄ԽϩάΛԿॏʹ΋ॏͶΔͱ
    NAND΁ͷॻ͖ࠐΈ͕ͲΜͲΜ૿͑ͯੑೳΨλམͪ
    ͱ͍͏࿦จ
    Don't Stack Your Log On My Log

    View full-size slide

  46. ϑΝΠϧγεςϜ
    ετϨʔδΤϯδϯ
    σʔλ Λॻ͖͍ͨ
    ϝλ0 σʔλ Λॻ͖͍ͨ
    ߏ଄Խϩά͸ॻ͖͍ͨσʔλʹՃ͑ͯ
    ϝλσʔλΛॻ͘ඞཁ͕͋Δ
    Write Amplification

    View full-size slide

  47. ϑΝΠϧγεςϜ
    ετϨʔδΤϯδϯ
    Flash Translation Layer
    σʔλ Λॻ͖͍ͨ
    ϝλ0 σʔλ
    ϝλ0 σʔλ
    ϝλ1
    Λॻ͖͍ͨ
    Λॻ͖͍ͨ
    Write Amplification
    ϝλ2
    ϝλ1 σʔλ
    ϝλ3 ϝλ5
    ϝλ2 ϝλ0
    ϝλ4 Λॻ
    ্૚ͷϝλσʔλ͸
    Լ૚ʹͱͬͯ͸σʔλͳͷͰ
    ϝλσʔλʹϝλσʔλ͕෇͘

    View full-size slide

  48. ϑΝΠϧγεςϜ
    ηΫγϣϯ0 ηΫγϣϯ1 ηΫγϣϯ2
    ϩά
    Flash Translation Layer
    ϒϩοΫ2
    ϩά
    ηΫγϣϯ3
    ʹॻ͔Εͨϩά͕ෆཁʹͳͬͨͷͰTRIM
    ηΫγϣϯ1
    ϒϩοΫ1
    ηΫγϣϯαΠζͱϒϩοΫαΠζ͕ҟͳΔͱ
    ్த·ͰTRIM͞ΕͨϒϩοΫ͕ੜ͡Δ
    ϒϩοΫ0
    Write Amplification

    View full-size slide

  49. Flash Translation Layer
    ϒϩοΫ2
    ϩά ϒϩοΫ1
    ϒϩοΫ0
    ϒϩοΫ3 ϒϩοΫ4
    ΨϕʔδίϨΫλ͸్த·ͰTRIM͞ΕͨϒϩοΫ͔Β
    θϩΫϦΞ͞ΕͨྖҬΛ࡞ΔͨΊʹ
    ࢖༻தͷϖʔδΛ৽͍͠ϒϩοΫʹίϐʔ͢Δ
    Write Amplification

    View full-size slide

  50. Write Amplification
    ετϨʔδΤϯδϯ͸௨ৗϩά͕TRIMՄೳͰ͋Δ͜ͱΛ
    Լ૚ʹ௨஌͠ͳ͍
    ετϨʔδΤϯδϯ
    ϖʔδ0 ϖʔδ1 ϖʔδ2
    ϩά ϖʔδ3
    ࢖༻ࡁΈϩά
    ϑΝΠϧγεςϜ
    ϖʔδ0 ϖʔδ1 ϖʔδ2
    ϩά ϖʔδ3
    Flash Translation Layer
    ϑΝΠϧ͕͋Δ͔Β࢖༻த
    ϖʔδ͸࢖༻த͔ͩΒ
    ผͷϒϩοΫʹίϐʔ͢Δ

    View full-size slide

  51. Write Amplification
    0
    ϩά 1 2 3 4 5 6
    ߏ଄ԽϩάͷΨϕʔδίϨΫγϣϯ͸
    طଘͷϩά͔Β·ͩ༗ޮͳཁૉ͚ͩΛऔΓग़ͯ͠
    ৽͍͠ϩάʹίϐʔ͢Δ
    0
    ϩά 1 2 3 4 5 6
    ϩά 2 6
    ̎ ̒
    ͜Ε͸Լ૚ͷϩάʹ৽͍͠ॻ͖ࠐΈΛ࢈Ή

    Լ

    View full-size slide

  52. Write Amplification
    0
    ϩά 1 2 3 4 5 6
    ΋͠Լ૚ͷϩάͷΨϕʔδίϨΫλ͕
    ૸ͬͨ௚ޙʹ্૚ͷΨϕʔδίϨΫλ͕૸Δͱ
    0
    ϩά 1 2 3 5 6 1 2
    ϩά
    ΨϕʔδίϨΫγϣϯͰྖҬΛۭ͚ͨ͹͔ΓͷԼ૚ͷϩάʹ
    େྔͷॻ͖ࠐΈΛੜ্ͤͨ͡͞ʹ
    0 1 2 3 5 6
    3 5 6

    Լ
    0

    View full-size slide

  53. 0 1 2 3 5 6
    Write Amplification
    0
    ϩά 1 2 3 4 5 6
    0 1 2 3 5 6
    ϩά
    ΨϕʔδίϨΫγϣϯͰྖҬΛۭ͚ͨ͹͔ΓͷԼ૚ͷϩάʹ
    େྔͷΨϕʔδίϨΫγϣϯ଴ͪͷཁૉΛੜͤ͡͞Δ
    0 1 2 3 5 6 ্
    Լ
    ௚લͷԼ૚ͷΨϕʔδίϨΫγϣϯΛҰॠͰ୆ແ͠ʹ͢Δ
    ϩά 1 2 3 5 6
    0
    0 1 2 3 5 6

    View full-size slide

  54. Write Amplification
    ͜ΕΒͷޮՌ͕߹Θͬͨ݁͞Ռ
    ෳ਺ͷߏ଄Խϩά͕ॏͳͬͨঢ়گͰ͸
    ॻ͖ࠐΈΛཁٻͨ͠σʔλͷαΠζʹରͯ͠
    ࣮ࡍʹNANDʹॻ͔ΕΔσʔλͷαΠζ͕
    ࠅ͍έʔεͰ
    2ഒҎ্ʹ๲Ε্͕Δ

    View full-size slide

  55. Write Amplification
    ճආํ๏
    1.ߏ଄ԽϩάΛॏͶΔͳ
    2.Ͳ͏ͯ͠΋ॏͶΔඞཁ͕͋Δ৔߹͸
    ϒϩοΫαΠζΛἧ͑Ζ
    3.࢖͍ऴΘͬͨϩά͸TRIM͠Ζ

    View full-size slide

  56. ϑΝΠϧγεςϜΛ΍ΊΑ͏
    VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB
    Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    ϩά
    ϩά
    ϩά
    τϥϯβΫγϣϯΛ
    ࣮ݱ͢ΔͨΊʹඞཁ
    ϋʔυ΢ΣΞͷػೳ
    ࣺ͍ͯͨ

    View full-size slide

  57. Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    HSE
    mpool
    HSE͸ΧʔωϧϞδϡʔϧmpoolΛ࢖͏

    View full-size slide

  58. Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    HSE
    mpool
    mpool͸ϒϩοΫσόΠεͷ্Ͱಈ͘

    View full-size slide

  59. ϒϩοΫσόΠεΛࢦఆͯ͠mpoolσόΠεΛ࡞Δ
    root # modprobe mpool
    root # ls /dev/mpool*
    /dev/mpoolctl
    root # mpool create mp1 /dev/nvme0n1 uid=test gid=test mode=0600
    root # ls /dev/mpool*
    /dev/mpoolctl
    /dev/mpool:
    mp1
    root # mpool list
    MPOOL TOTAL USED AVAIL CAPACITY LABEL HEALTH
    mp1 466g 1.16g 441g 0.26% raw optimal

    View full-size slide

  60. mpool
    mpoolΧʔωϧϞδϡʔϧ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    mblock mlog mcache
    HSE
    ioctl ioctl ioctl
    mpool͸3ͭͷػೳΛఏڙ͢Δ
    mpool ϢʔβۭؒϥΠϒϥϦ
    mdc

    View full-size slide

  61. mpool *raw_pool = nullptr;
    SAFE_CALL( mpool_open( params[ "pool" ].as< std::string
    >().c_str(), O_RDWR, &raw_pool, nullptr ) );
    std::shared_ptr< mpool > pool( raw_pool, []( mpool *p )
    { if( p ) mpool_close( p ); } );
    uint64_t block_id = 0u;
    mblock_props props;
    mpool_openͰmpoolσόΠεΛ։͖
    mblock͸ϖʔδαΠζͷ੔਺ഒͷόΠτྻΛmpoolʹอଘ͢Δ
    mblock͸࡞੒࣌ʹҰ౓͚ͩॻ͘ࣄ͕Ͱ͖
    มߋ΍௥ه͸Ͱ͖ͳ͍͕࡟আ͸Ͱ͖Δ
    mblockͷAPI

    View full-size slide

  62. mpool *raw_pool = nullptr;
    SAFE_CALL( mpool_open( params[ "pool" ].as< std::string
    >().c_str(), O_RDWR, &raw_pool, nullptr ) );
    std::shared_ptr< mpool > pool( raw_pool, []( mpool *p )
    { if( p ) mpool_close( p ); } );
    uint64_t block_id = 0u;
    mblock_props props;
    size_t length = 0;
    if( !params.count( "object" ) ) {
    memset( reinterpret_cast< void* >( &props ), 0,
    sizeof( props ) );
    SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY,
    false, &block_id, &props ) )
    std::cout << "object id: " << props.mpr_objid << std::endl;
    std::string m = params[ "message" ].as< std::string >();
    size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() %
    PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE;
    mpool_mblock_allocͰ৽͍͠mblockΛ࡞੒͢Δ
    ͜͜ͰฦΔ64bitͷblock id͸
    ϑΝΠϧσΟεΫϦϓλͷΑ͏ͳ΋ͷ
    mblockͷAPI

    View full-size slide

  63. size_t length = 0;
    if( !params.count( "object" ) ) {
    memset( reinterpret_cast< void* >( &props ), 0,
    sizeof( props ) );
    SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY,
    false, &block_id, &props ) )
    std::cout << "object id: " << props.mpr_objid << std::endl;
    std::string m = params[ "message" ].as< std::string >();
    size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() %
    PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE;
    std::unique_ptr< char, free_deleter > buf( reinterpret_cast<
    char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) );
    if( !buf ) throw std::bad_alloc();
    memset( buf.get(), 0, buf_size );
    std::copy( m.begin(), m.end(), buf.get() );
    iovec iov;
    iov.iov_base = buf.get();
    iov.iov_len = buf_size; mblockͷAPI
    Ұํಉ࣌ʹಘΒΕΔobject id͸ϑΝΠϧ໊ͷΑ͏ͳ΋ͷ
    ͜ͷmblockΛ୳͢ͱ͖͸object idΛ࢖༻͢Δ

    View full-size slide

  64. mblock_props props;
    size_t length = 0;
    if( !params.count( "object" ) ) {
    memset( reinterpret_cast< void* >( &props ), 0,
    sizeof( props ) );
    SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY,
    false, &block_id, &props ) )
    std::cout << "object id: " << props.mpr_objid << std::endl;
    std::string m = params[ "message" ].as< std::string >();
    size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() %
    PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE;
    std::unique_ptr< char, free_deleter > buf( reinterpret_cast<
    char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) );
    if( !buf ) throw std::bad_alloc();
    memset( buf.get(), 0, buf_size );
    std::copy( m.begin(), m.end(), buf.get() );
    iovec iov;
    iov.iov_base = buf.get();
    mblockʹॻ͖ࠐΉσʔλ͸ϖʔδڥքʹ
    ΞϥΠϯ͞Ε͍ͯͳ͚Ε͹ͳΒͳ͍
    mblockͷAPI
    mpoolͷॻ͖ࠐΈʹ͸ϖʔδΩϟογϡ͕ແ͘
    Χʔωϧ͸͜͜Ͱ֬อͨ͠ϝϞϦΛ௚઀σόΠευϥΠόʹ౉͢

    View full-size slide

  65. SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY,
    false, &block_id, &props ) )
    std::cout << "object id: " << props.mpr_objid << std::endl;
    std::string m = params[ "message" ].as< std::string >();
    size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() %
    PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE;
    std::unique_ptr< char, free_deleter > buf( reinterpret_cast<
    char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) );
    if( !buf ) throw std::bad_alloc();
    memset( buf.get(), 0, buf_size );
    std::copy( m.begin(), m.end(), buf.get() );
    iovec iov;
    iov.iov_base = buf.get();
    iov.iov_len = buf_size;
    length = buf_size;
    SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 )
    )
    if( abort_transaction )
    mpool_mblock_writeͰmblockʹσʔλΛॻ͖ࠐΉ
    iovecΛෳ਺༻ҙ͢Δ͜ͱͰ
    ෳ਺ͷϝϞϦྖҬ͔ΒͷσʔλΛ૊Έ߹Θͤͯॻ͘͜ͱ΋Ͱ͖Δ
    mblockͷAPI

    View full-size slide

  66. PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE;
    std::unique_ptr< char, free_deleter > buf( reinterpret_cast<
    char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) );
    if( !buf ) throw std::bad_alloc();
    memset( buf.get(), 0, buf_size );
    std::copy( m.begin(), m.end(), buf.get() );
    iovec iov;
    iov.iov_base = buf.get();
    iov.iov_len = buf_size;
    length = buf_size;
    SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 )
    )
    if( abort_transaction )
    SAFE_CALL( mpool_mblock_abort( pool.get(), block_id ) )
    else
    SAFE_CALL( mpool_mblock_commit( pool.get(), block_id ) )
    }
    else {
    uint64_t object_id = params[ "object" ].as< uint64_t >();
    mpool_mblock_commitͰมߋΛ֬ఆ͢Δ
    ͜ͷؔ਺ʹ౸ୡ͠ͳ͔ͬͨ৔߹ͦ͜·Ͱͷ
    mpool_mblock_write͸ແ͔ͬͨ͜ͱʹͳΔ
    mblockͷAPI
    mpool_mblock_abortͰ
    ͦ͜·ͰͷมߋΛ໌ࣔతʹແ͔ͬͨ͜ͱʹ͢Δ

    View full-size slide

  67. iovec iov;
    iov.iov_base = buf.get();
    iov.iov_len = buf_size;
    length = buf_size;
    SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 )
    )
    if( abort_transaction )
    SAFE_CALL( mpool_mblock_abort( pool.get(), block_id ) )
    else
    SAFE_CALL( mpool_mblock_commit( pool.get(), block_id ) )
    }
    else {
    uint64_t object_id = params[ "object" ].as< uint64_t >();
    SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id,
    &block_id, &props ) )
    length = props.mpr_write_len;
    std::cout << "object id: " << object_id << std::endl;
    }
    طʹॻ͖ࠐ·ΕͨmblockΛ୳͢ʹ͸
    mpool_mblock_find_get
    mblockͷAPI

    View full-size slide

  68. SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id,
    &block_id, &props ) )
    length = props.mpr_write_len;
    std::cout << "object id: " << object_id << std::endl;
    }
    {
    size_t buf_size = length;
    std::unique_ptr< char, free_deleter > buf( reinterpret_cast<
    char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) );
    if( !buf ) throw std::bad_alloc();
    memset( buf.get(), 0, buf_size );
    iovec iov;
    iov.iov_base = buf.get();
    iov.iov_len = buf_size;
    SAFE_CALL( mpool_mblock_read( pool.get(), block_id, &iov, 1, 0
    ) )
    std::cout << "length: " << length << std::endl;
    std::cout << "data: " << buf.get() << std::endl;
    }
    mpool_mblock_readͰಡΉ
    ಡΉͱ͖ʹ࢖͏όοϑΝ΋
    ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯΔඞཁ͕͋Δ
    mblockͷAPI

    View full-size slide

  69. {
    size_t buf_size = length;
    std::unique_ptr< char, free_deleter > buf( reinterpret_cast<
    char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) );
    if( !buf ) throw std::bad_alloc();
    memset( buf.get(), 0, buf_size );
    iovec iov;
    iov.iov_base = buf.get();
    iov.iov_len = buf_size;
    SAFE_CALL( mpool_mblock_read( pool.get(), block_id, &iov, 1, 0
    ) )
    std::cout << "length: " << length << std::endl;
    std::cout << "data: " << buf.get() << std::endl;
    }
    if( delete_block )
    SAFE_CALL( mpool_mblock_delete( pool.get(), block_id ) )
    mpool_mblock_deleteΛ࢖͑͹
    ࢦఆͨ͠mblockΛؙ͝ͱ࡟আͰ͖Δ
    mblockͷAPI

    View full-size slide

  70. mpool *raw_pool = nullptr;
    SAFE_CALL( mpool_open( params[ "pool" ].as< std::string
    >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr ) );
    std::shared_ptr< mpool > pool( raw_pool, []( mpool *p )
    { if( p ) mpool_close( p ); } );
    mlog_capacity cap;
    memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) );
    mpool_openͰmpoolσόΠεΛ։͘ͷ͸mblockͱҰॹ
    mlog͸ޙ͔Β௥هͰ͖ΔόΠτྻΛmpoolʹอଘ͢Δ
    mlogͷ࠷େαΠζ͸࡞੒࣌ʹܾఆ͞Ε
    ࠷େαΠζ·Ͱ௥هͨ͠ΒͦΕҎ্ॻ͖ࠐΊͳ͘ͳΔ
    mlogͷAPI

    View full-size slide

  71. std::shared_ptr< mpool > pool( raw_pool, []( mpool *p )
    { if( p ) mpool_close( p ); } );
    mlog_capacity cap;
    memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) );
    std::shared_ptr< mpool_mlog > log;
    if( !params.count( "object" ) ) {
    cap.lcp_captgt = 4 * 1024 * 1024;
    mlog_props props;
    memset( reinterpret_cast< void* >( &props ), 0,
    sizeof( props ) );
    mpool_mlog *raw_log = nullptr;
    SAFE_CALL( mpool_mlog_alloc( pool.get(), &cap,
    MP_MED_CAPACITY, &props, &raw_log ) );
    log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p )
    mpool_mlog_close( pool.get(), p ); } );
    uint64_t object_id = props.lpr_objid;
    std::cout << "object id: " << object_id << std::endl;
    SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    mlogͷAPI
    mpool_mlog_allocͰ৽͍͠mlogΛ࡞੒
    ࢖༻͢ΔྖҬͷαΠζ
    (ϖʔδαΠζͷ੔਺ഒ)

    View full-size slide

  72. log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p )
    mpool_mlog_close( pool.get(), p ); } );
    uint64_t object_id = props.lpr_objid;
    std::cout << "object id: " << object_id << std::endl;
    SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    }
    else {
    mlog_props props;
    mpool_mlog *raw_log = nullptr;
    SAFE_CALL( mpool_mlog_find_get( pool.get(),
    params[ "object" ].as(), &props, &raw_log ) )
    log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p )
    mpool_mlog_close( pool.get(), p ); } );
    uint64_t object_id = props.lpr_objid;
    std::cout << "object id: " << object_id << std::endl;
    }
    uint64_t gen = 0;
    SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) )
    mlogͷAPI
    طʹ͋ΔmlogΛ୳࣌͢͸mpool_mlog_find_get
    mpool_mlog_alloc΍mpool_mlog_find_get͸
    mlog_propsΛฦ͢

    View full-size slide

  73. SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    }
    else {
    mlog_props props;
    mpool_mlog *raw_log = nullptr;
    SAFE_CALL( mpool_mlog_find_get( pool.get(),
    params[ "object" ].as(), &props, &raw_log ) )
    log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p )
    mpool_mlog_close( pool.get(), p ); } );
    uint64_t object_id = props.lpr_objid;
    std::cout << "object id: " << object_id << std::endl;
    }
    uint64_t gen = 0;
    SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) )
    if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector<
    std::string > >() )
    SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(),
    mlogͷAPI
    mpool_mlogΛ࢖ͬͯ
    mpool_mlog_openͰϩάΛ։͘

    View full-size slide

  74. mpool_mlog_close( pool.get(), p ); } );
    uint64_t object_id = props.lpr_objid;
    std::cout << "object id: " << object_id << std::endl;
    }
    uint64_t gen = 0;
    SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) )
    if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector<
    std::string > >() )
    SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(),
    const_cast< void* >( static_cast< const void* >( a.data() ) ),
    a.size(), 1 ) )
    if( abort_transaction )
    SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) )
    else
    SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    if( erase_log != std::numeric_limits< uint64_t >::max() )
    SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(),
    mlogͷAPI
    mpool_mlog_append_dataͰmlogʹόΠτྻΛ௥Ճ͢Δ
    ॻ͖ࠐΉόΠτྻ͸ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍

    View full-size slide

  75. }
    uint64_t gen = 0;
    SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) )
    if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector<
    std::string > >() )
    SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(),
    const_cast< void* >( static_cast< const void* >( a.data() ) ),
    a.size(), 1 ) )
    if( abort_transaction )
    SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) )
    else
    SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    if( erase_log != std::numeric_limits< uint64_t >::max() )
    SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(),
    erase_log ) )
    bool empty = false;
    SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) )
    std::cout << "empty: " << empty << std::endl;
    mlogͷAPI
    mpool_mlog_commitͰมߋΛ֬ఆ͢Δ
    ͜ͷؔ਺ʹ౸ୡ͠ͳ͔ͬͨ৔߹ͦ͜·Ͱͷ
    mpool_mlog_append_data͸ແ͔ͬͨ͜ͱʹͳΔ
    mpool_mlog_abortͰͦ͜·ͰͷมߋΛ
    ໌ࣔతʹແ͔ͬͨ͜ͱʹ͢Δ

    View full-size slide

  76. SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) )
    if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector<
    std::string > >() )
    SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(),
    const_cast< void* >( static_cast< const void* >( a.data() ) ),
    a.size(), 1 ) )
    if( abort_transaction )
    SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) )
    else
    SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    if( erase_log != std::numeric_limits< uint64_t >::max() )
    SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(),
    erase_log ) )
    bool empty = false;
    SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) )
    std::cout << "empty: " << empty << std::endl;
    size_t len = 0;
    SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) )
    mlogͷAPI
    mpool_mlog_eraseΛ࢖͑͹
    ࢦఆͨ͠mlogΛؙ͝ͱ࡟আͰ͖Δ

    View full-size slide

  77. else
    SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) )
    if( erase_log != std::numeric_limits< uint64_t >::max() )
    SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(),
    erase_log ) )
    bool empty = false;
    SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) )
    std::cout << "empty: " << empty << std::endl;
    size_t len = 0;
    SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) )
    std::cout << "length: " << len << std::endl;
    SAFE_CALL( mpool_mlog_read_data_init( pool.get(), log.get() ) )
    while( 1 ) {
    std::array< char, 1024u > buf;
    size_t length = 0u;
    SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(),
    buf.data(), buf.size() - 1, &length ) );
    if( !length ) break;
    buf[ length ] = '\0';
    mlogͷAPI
    mpool_mlog_emptyͰ
    mlog͕ۭ͔Ͳ͏͔Λ֬ೝͰ͖Δ
    mpool_mlog_lenͰ
    mlogͷ࢖༻ࡁΈͷྖҬͷαΠζΛऔಘͰ͖Δ

    View full-size slide

  78. std::cout << "empty: " << empty << std::endl;
    size_t len = 0;
    SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) )
    std::cout << "length: " << len << std::endl;
    SAFE_CALL( mpool_mlog_read_data_init( pool.get(), log.get() ) )
    while( 1 ) {
    std::array< char, 1024u > buf;
    size_t length = 0u;
    SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(),
    buf.data(), buf.size() - 1, &length ) );
    if( !length ) break;
    buf[ length ] = '\0';
    std::cout << "data: " << buf.data() << std::endl;
    }
    SAFE_CALL( mpool_mlog_flush( pool.get(), log.get() ) )
    SAFE_CALL( mpool_mlog_close( pool.get(), log.get() ) )
    if( delete_log )
    SAFE_CALL( mpool_mlog_delete( pool.get(), log.get() ) )
    mlogͷAPI
    mpool_mlog_read_data_initͰಡΈग़͠ͷ༻ҙΛͯ͠
    mpool_mlog_read_data_nextͰઌ಄͔Βॱ൪ʹ
    ॻ͖ࠐ·Εͨ಺༰ΛಡΊΔ

    View full-size slide

  79. SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(),
    buf.data(), buf.size() - 1, &length ) );
    if( !length ) break;
    buf[ length ] = '\0';
    std::cout << "data: " << buf.data() << std::endl;
    }
    SAFE_CALL( mpool_mlog_flush( pool.get(), log.get() ) )
    SAFE_CALL( mpool_mlog_close( pool.get(), log.get() ) )
    if( delete_log )
    SAFE_CALL( mpool_mlog_delete( pool.get(), log.get() ) )
    mlogͷAPI
    mlogΛ࡟আ͢Δͱ͖͸mpool_mlog_delete

    View full-size slide

  80. mpool *raw_pool = nullptr;
    SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(),
    O_RDWR|O_EXCL, &raw_pool, nullptr ) );
    std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p )
    mpool_close( p ); } );
    uint64_t log1 = 0;
    uint64_t log2 = 0;
    if( !params.count( "object" ) ) {
    mdc_capacity cap;
    mdcͷAPI
    MetaData Containerུͯ͠MDC
    mlogΛ2ຊ૊Έ߹Θͤͯ
    ΨϕʔδίϨΫγϣϯͰ͖ΔΑ͏ʹͨ͠΋ͷ
    mpool_openͰmpoolσόΠεΛ։͘ͷ͸mlogͱҰॹ

    View full-size slide

  81. SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(),
    O_RDWR|O_EXCL, &raw_pool, nullptr ) );
    std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p )
    mpool_close( p ); } );
    uint64_t log1 = 0;
    uint64_t log2 = 0;
    if( !params.count( "object" ) ) {
    mdc_capacity cap;
    memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) );
    cap.mdt_captgt = 4 * 1024 * 1024;
    SAFE_CALL( mpool_mdc_alloc( pool.get(), &log1, &log2, MP_MED_CAPACITY,
    &cap, nullptr ) );
    std::cout << "object id: " << log1 << ":" << log2 << std::endl;
    SAFE_CALL( mpool_mdc_commit( pool.get(), log1, log2 ) )
    }
    else {
    auto v = params[ "object" ].as< std::string >();
    boost::fusion::vector< uint64_t, uint64_t > parsed;
    namespace qi = boost::spirit::qi;
    if( !qi::parse( v.begin(), v.end(), qi::ulong_long >> ':' >>
    qi::ulong_long, parsed ) ) {
    mdcͷAPI
    mpool_mdc_allocͰmdcΛ࡞Δ
    2ຊͷmlog͕࡞ΒΕͯobject id͕2ͭฦͬͯ͘Δ
    Ҿ਺ͷmdc_capacityͰmlog1ຊ͋ͨΓͷαΠζΛࢦఆ͢Δ

    View full-size slide

  82. boost::fusion::vector< uint64_t, uint64_t > parsed;
    namespace qi = boost::spirit::qi;
    if( !qi::parse( v.begin(), v.end(), qi::ulong_long >> ':' >>
    qi::ulong_long, parsed ) ) {
    std::cerr << "invalid object id" << std::endl;
    return 1;
    }
    log1 = boost::fusion::at_c< 0 >( parsed );
    log2 = boost::fusion::at_c< 1 >( parsed );
    }
    mpool_mdc *raw_log = nullptr;
    SAFE_CALL( mpool_mdc_open( pool.get(), log1, log2, 0, &raw_log ) );
    std::shared_ptr< mpool_mdc > log( raw_log, [pool]( mpool_mdc *p ) { if( p
    ) mpool_mdc_close( p ); } );
    if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector< std::string >
    >() )
    SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void*
    >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) )
    if( params.count( "compact" ) ) {
    auto v = params[ "compact" ].as< std::vector< std::string > >();
    mdcͷAPI
    mpool_mdc_openͰmdcΛ։͘
    ։͍ͨmdc͸mpool_mdc_closeͰด͡Δ

    View full-size slide

  83. return 1;
    }
    log1 = boost::fusion::at_c< 0 >( parsed );
    log2 = boost::fusion::at_c< 1 >( parsed );
    }
    mpool_mdc *raw_log = nullptr;
    SAFE_CALL( mpool_mdc_open( pool.get(), log1, log2, 0, &raw_log ) );
    std::shared_ptr< mpool_mdc > log( raw_log, [pool]( mpool_mdc *p ) { if( p
    ) mpool_mdc_close( p ); } );
    if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector< std::string >
    >() )
    SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void*
    >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) )
    if( params.count( "compact" ) ) {
    auto v = params[ "compact" ].as< std::vector< std::string > >();
    std::sort( v.begin(), v.end() );
    std::vector< std::vector< char > > bufs;
    SAFE_CALL( mpool_mdc_rewind( log.get() ) )
    while( 1 ) {
    std::vector< char > buf( 4096, 0 );
    size_t size = 0;
    mdcͷAPI
    mpool_mdc_append_dataͰactiveͳํͷmlogʹ
    όΠτྻΛ௥Ճ͢Δ
    ॻ͖ࠐΉόΠτྻ͸ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍
    mdc
    mlog mlog
    1 2 3 4 5
    2ຊͷmlogͷ͏ͪҰํ͚͕ͩactiveʹͳ͍ͬͯΔ

    View full-size slide

  84. }
    SAFE_CALL( mpool_mdc_cend( log.get() ) )
    }
    SAFE_CALL( mpool_mdc_rewind( log.get() ) )
    while( 1 ) {
    std::vector< char > buf( 4096, 0 );
    size_t size = 0;
    auto e = mpool_mdc_read( log.get(), buf.data(), buf.size() - 1,
    &size );
    if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) {
    buf.resize( size + 1, 0 );
    SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1,
    &size ) );
    }
    else SAFE_CALL( e )
    if( !size ) break;
    std::cout << "data: " << buf.data() << std::endl;
    }
    if( delete_log ) {
    log.reset();
    SAFE_CALL( mpool_mdc_destroy( pool.get(), log1, log2 ) )
    }
    mdcͷAPI
    mpool_mdc_rewindͰactiveͳϩάͷઌ಄ʹҠಈ
    mpool_mdc_readΛݺͿ౓ʹϩά͕ॱ൪ʹฦͬͯ͘Δ

    View full-size slide

  85. if( params.count( "message" ) )
    for( const auto &a: params[ "message" ].as< std::vector< std::string >
    >() )
    SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void*
    >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) )
    if( params.count( "compact" ) ) {
    auto v = params[ "compact" ].as< std::vector< std::string > >();
    std::sort( v.begin(), v.end() );
    std::vector< std::vector< char > > bufs;
    SAFE_CALL( mpool_mdc_rewind( log.get() ) )
    while( 1 ) {
    std::vector< char > buf( 4096, 0 );
    size_t size = 0;
    auto e = mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size
    );
    if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) {
    buf.resize( size + 1, 0 );
    SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1,
    &size ) );
    }
    else SAFE_CALL( e )
    mdcͷAPI
    mdc
    mlog mlog
    1 2 3 4 5
    1 3
    ΨϕʔδίϨΫγϣϯΛߦ͏ʹ͸
    ·ͣ༗ޮͳϩάΛಡΈग़͢

    View full-size slide

  86. if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) {
    buf.resize( size + 1, 0 );
    SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1,
    &size ) );
    }
    else SAFE_CALL( e )
    if( !size ) break;
    if( std::binary_search( v.begin(), v.end(), std::string( buf.data() )
    ) ) {
    buf.resize( size );
    bufs.emplace_back( std::move( buf ) );
    }
    }
    SAFE_CALL( mpool_mdc_cstart( log.get() ) )
    for( const auto &buf: bufs ) {
    SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void*
    >( static_cast< const void* >( buf.data() ) ), buf.size(), 0 ) )
    }
    SAFE_CALL( mpool_mdc_cend( log.get() ) )
    }
    SAFE_CALL( mpool_mdc_rewind( log.get() ) )
    mdcͷAPI
    mdc
    mlog mlog
    1 2 3 4 5 1 3
    1 3
    mpool_mdc_cstartͰactiveͳmlogΛ੾Γସ͑
    ͦͷޙmpool_mdc_appendͰ༗ޮͳϩάͷॻ͖ࠐΈ

    View full-size slide

  87. if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) {
    buf.resize( size + 1, 0 );
    SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1,
    &size ) );
    }
    else SAFE_CALL( e )
    if( !size ) break;
    if( std::binary_search( v.begin(), v.end(), std::string( buf.data() )
    ) ) {
    buf.resize( size );
    bufs.emplace_back( std::move( buf ) );
    }
    }
    SAFE_CALL( mpool_mdc_cstart( log.get() ) )
    for( const auto &buf: bufs ) {
    SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void*
    >( static_cast< const void* >( buf.data() ) ), buf.size(), 0 ) )
    }
    SAFE_CALL( mpool_mdc_cend( log.get() ) )
    }
    SAFE_CALL( mpool_mdc_rewind( log.get() ) )
    mdcͷAPI
    mdc
    mlog mlog
    1 3
    ࠷ޙʹmpool_mdc_cendͰinactiveͳϩάΛTRIM

    View full-size slide

  88. if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) {
    buf.resize( size + 1, 0 );
    SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1,
    &size ) );
    }
    else SAFE_CALL( e )
    if( !size ) break;
    std::cout << "data: " << buf.data() << std::endl;
    }
    if( delete_log ) {
    log.reset();
    SAFE_CALL( mpool_mdc_destroy( pool.get(), log1, log2 ) )
    }
    mpool_mdc_destroyͰ2ͭͷmlogΛ·ͱΊͯ࡟আ
    mdcͷAPI

    View full-size slide

  89. mpool *raw_pool = nullptr;
    SAFE_CALL( mpool_open( params[ "pool" ].as< std::string
    >().c_str(), O_RDWR, &raw_pool, nullptr ) );
    std::shared_ptr< mpool > pool( raw_pool, []( mpool *p )
    { if( p ) mpool_close( p ); } );
    std::vector< uint64_t > object_ids = params[ "object" ].as<
    std::vector< uint64_t > >();
    mcacheͷAPI
    mblock͸ϖʔδΩϟογϡΛ࣋ͨͳ͍
    Կ౓΋ಡΉσʔλΛϝϞϦʹஔ͍͓͖͍ͯͨ৔߹͸
    mcacheͰϖʔδΩϟογϡΛ࡞Δ
    ͱΓ͋͑ͣmpool_openͰmpoolσόΠεΛ։͘

    View full-size slide

  90. uint64_t block_id = 0;
    SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id,
    &block_id, &props ) )
    return props;
    } );
    {
    mpool_mcache_map *raw_map;
    SAFE_CALL( mpool_mcache_mmap( pool.get(), object_ids.size(),
    object_ids.data(), MPC_VMA_WARM, &raw_map ) );
    std::shared_ptr< mpool_mcache_map > map( raw_map, [pool]
    ( mpool_mcache_map *p ) {
    if( p ) mpool_mcache_munmap( p );
    } );
    for( uint64_t cache_id = 0; cache_id != object_ids.size(); +
    +cache_id ) {
    SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0,
    props[ cache_id ].mpr_write_len, MADV_WILLNEED ) )
    size_t offset = 0u;
    mcacheͷAPI
    mpool_mcache_mmapͰmcacheʹ৐͍ͤͨmblockΛ
    object idͰࢦఆ͢Δ
    ΩϟογϡΛ΍ΊΔͱ͖͸mpool_mcache_munmap

    View full-size slide

  91. } );
    for( uint64_t cache_id = 0; cache_id != object_ids.size(); +
    +cache_id ) {
    SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0,
    props[ cache_id ].mpr_write_len, MADV_WILLNEED ) )
    size_t offset = 0u;
    void *page = nullptr;
    SAFE_CALL( mpool_mcache_getpages( map.get(), 1, cache_id,
    &offset, &page ) );
    char *data = reinterpret_cast< char* >( page );
    std::cout << "length: " << props[ cache_id ].mpr_write_len
    << std::endl;
    std::cout << "data: " << data << std::endl;
    }
    }
    mcacheͷAPI
    mpool_mcache_madviseͰ
    cache id൪໨ͷmblock͕ۙ͘ඞཁʹͳΔ͜ͱΛ௨஌
    mpool_mcache_getpagesͰϖʔδΩϟογϡͷΞυϨεΛऔಘ

    View full-size slide

  92. if( p ) mpool_mcache_munmap( p );
    } );
    for( uint64_t cache_id = 0; cache_id != object_ids.size(); +
    +cache_id ) {
    SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0,
    props[ cache_id ].mpr_write_len, MADV_WILLNEED ) )
    size_t offset = 0u;
    void *page = nullptr;
    SAFE_CALL( mpool_mcache_getpages( map.get(), 1, cache_id,
    &offset, &page ) );
    char *data = reinterpret_cast< char* >( page );
    std::cout << "length: " << props[ cache_id ].mpr_write_len
    << std::endl;
    std::cout << "data: " << data << std::endl;
    }
    }
    mcacheͷAPI
    ϙΠϯτ
    mcacheͷ࡞੒ͱഁغͷλΠϛϯά͸
    ΞϓϦέʔγϣϯ͕ίϯτϩʔϧͰ͖ΔͨΊ
    ͜ͷΩϟογϡΛͦͷ··
    ετϨʔδΤϯδϯͷΩϟογϡʹ࢖͑Δ

    View full-size slide

  93. switch (cmd) {
    case MPIOC_MP_CREATE:
    case MPIOC_MP_ACTIVATE:
    case MPIOC_MP_DESTROY:
    case MPIOC_MP_RENAME:
    err = mpioc_mp_cmd(unit, cmd, argp);
    break;
    case MPIOC_MP_DEACTIVATE:
    err = mpioc_mp_deactivate(unit, cmd, argp);
    break;
    case MPIOC_DRV_ADD:
    err = mpioc_mp_add(unit, cmd, argp);
    break;
    case MPIOC_PARAMS_SET:
    err = mpioc_params_set(unit, cmd, argp);
    break;
    case MPIOC_PARAMS_GET:
    err = mpioc_params_get(unit, cmd, argp);
    break;
    case MPIOC_MP_MCLASS_GET:
    err = mpioc_mp_mclass_get(unit, cmd, argp);
    break;
    case MPIOC_PROP_GET:
    err = mpioc_proplist_get(unit, cmd, argp);
    break;
    case MPIOC_DEVPROPS_GET:
    err = mpioc_devprops_get(unit, argp);
    break;
    case MPIOC_MB_ALLOC:
    mpool-kmod/src/mpctl.c
    static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
    mdcΛআ͘mpoolͷૢ࡞͸
    ͦͷ··ioctlʹϚοϓ͞Εͯ
    Χʔωϧۭؒͷؔ਺ͷ
    ݺͼग़͠ʹͳ͍ͬͯΔ

    View full-size slide

  94. mpoolͷεʔύʔϒϩοΫ
    object idͱετϨʔδ্ͷ഑ஔͷରԠ͸
    Χʔωϧͷ੺ࠇ໦(rbtree)Λ࢖ͬͯอ࣋͢Δ
    rbtree
    2 3
    1
    1 3 2

    View full-size slide

  95. mpoolͷεʔύʔϒϩοΫ
    ͜ͷ੺ࠇ໦ʹର͢Δมߋ͸
    mpoolͷઌ಄ʹஔ͔Εͨmdcʹه࿥͞ΕΔ
    rbtree
    2 3
    1
    1 3 2
    mdc0
    mpoolͷactivate࣌͸͜ͷϩάΛᢞΊͯrbtreeΛߏங͢Δ

    View full-size slide

  96. 1 3 2
    mdc0
    ϙΠϯτ
    ϑΝΠϧγεςϜͷϝλσʔλͱҟͳΓ
    mdc0ʹ͸object idɺҐஔɺαΠζ͘Β͍ͷ৘ใ͔͠ͳ͍
    ͜ͷͨΊmdc0Ҏ֎ͷmdcʹͲΜͳʹมߋΛՃ͑ͯ΋
    mdc0ʹϩά͕ॻ͖଍͞ΕΔࣄ͸ͳ͍
    ଟஈϩάΛճආͰ͖Δ

    View full-size slide

  97. ͜͏͢ΔͱετϨʔδΤϯδϯʹͳΔ
    0 1 2' 4 5
    root
    1 2'
    ͕ʹͳΔ
    ϩά
    i
    i
    mcache
    mblock mdc

    View full-size slide

  98. HSE_SAFE_CALL( hse_kvdb_init() );
    std::shared_ptr< void > context( nullptr, []( void* )
    { hse_kvdb_fini(); } );
    const std::string pool_name = params[ "pool" ].as< std::string
    >();
    if( create_kvdb )
    HSE_SAFE_CALL( hse_kvdb_make( pool_name.c_str(), nullptr ) );
    hse_kvdb *raw_kvdb = nullptr;
    HSE_SAFE_CALL( hse_kvdb_open( pool_name.c_str(), nullptr,
    &raw_kvdb ) );
    std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb
    *p ) { if( p ) hse_kvdb_close( p ); } );
    const std::string kvs_name = params[ "kvs" ].as< std::string
    >();
    if( create_kvs )
    HSEͷAPI
    hse_kvdb_initͰHSEΛ࢖͏ͨΊͷ४උΛ͢Δ
    ย෇͚Δͱ͖͸hse_kvdb_fini

    View full-size slide

  99. std::shared_ptr< void > context( nullptr, []( void* )
    { hse_kvdb_fini(); } );
    const std::string pool_name = params[ "pool" ].as< std::string
    >();
    if( create_kvdb )
    HSE_SAFE_CALL( hse_kvdb_make( pool_name.c_str(), nullptr ) );
    hse_kvdb *raw_kvdb = nullptr;
    HSE_SAFE_CALL( hse_kvdb_open( pool_name.c_str(), nullptr,
    &raw_kvdb ) );
    std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb
    *p ) { if( p ) hse_kvdb_close( p ); } );
    const std::string kvs_name = params[ "kvs" ].as< std::string
    >();
    if( create_kvs )
    HSE_SAFE_CALL( hse_kvdb_kvs_make( kvdb.get(),
    kvs_name.c_str(), nullptr ) );
    hse_kvs *raw_kvs;
    HSE_SAFE_CALL( hse_kvdb_kvs_open( kvdb.get(), kvs_name.c_str(),
    nullptr, &raw_kvs ) );
    HSEͷAPI
    hse_kvdb_makeͰࢦఆͨ͠mpoolʹkvdbΛ࡞Δ
    hse_kvdb_openͰkvdbΛ։͘
    mpool
    kvdb
    kvs kvs
    Ωʔ
    σʔλ
    Ωʔ
    σʔλ
    kvs
    Ωʔ
    σʔλ
    kvdbͷதʹෳ਺ͷkvs(ςʔϒϧ)Λ࡞Δ͜ͱ͕Ͱ͖Δ
    ͜Ε

    View full-size slide

  100. std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb
    *p ) { if( p ) hse_kvdb_close( p ); } );
    const std::string kvs_name = params[ "kvs" ].as< std::string
    >();
    if( create_kvs )
    HSE_SAFE_CALL( hse_kvdb_kvs_make( kvdb.get(),
    kvs_name.c_str(), nullptr ) );
    hse_kvs *raw_kvs;
    HSE_SAFE_CALL( hse_kvdb_kvs_open( kvdb.get(), kvs_name.c_str(),
    nullptr, &raw_kvs ) );
    std::shared_ptr< hse_kvs > kvs( raw_kvs, [kvdb]( hse_kvs *p )
    { if( p ) hse_kvdb_kvs_close( p ); } );
    hse_kvdb_opspec os;
    HSE_KVDB_OPSPEC_INIT( &os );
    std::shared_ptr< hse_kvdb_txn > transaction( hse_kvdb_txn_alloc(
    kvdb.get() ), [kvdb]( hse_kvdb_txn *p ) { if( p )
    hse_kvdb_txn_free( kvdb.get(), p ); } );
    os.kop_txn = transaction.get();
    HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) );
    HSEͷAPI
    hse_kvdb_kvs_makeͰࢦఆͨ͠kvdbʹkvsΛ࡞Δ
    hse_kvdb_kvs_openͰkvsΛ։͘
    mpool
    kvdb
    kvs
    Ωʔ
    σʔλ
    Ωʔ
    σʔλ
    ͜Ε

    View full-size slide

  101. std::shared_ptr< hse_kvs > kvs( raw_kvs, [kvdb]( hse_kvs *p )
    { if( p ) hse_kvdb_kvs_close( p ); } );
    hse_kvdb_opspec os;
    HSE_KVDB_OPSPEC_INIT( &os );
    std::shared_ptr< hse_kvdb_txn > transaction( hse_kvdb_txn_alloc(
    kvdb.get() ), [kvdb]( hse_kvdb_txn *p ) { if( p )
    hse_kvdb_txn_free( kvdb.get(), p ); } );
    os.kop_txn = transaction.get();
    HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) );
    for( const auto &v: put_value ) {
    HSE_SAFE_CALL( hse_kvs_put( kvs.get(), &os, v.first.data(),
    v.first.size(), v.second.data(), v.second.size() ) );
    }
    for( const auto &v: get_value ) {
    std::array< char, 100 > data{ 0 };
    bool found = false;
    size_t length = 0;
    HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(),
    v.size(), &found, data.data(), data.size(), &length ) );
    HSEͷAPI
    hse_kvdb_txn_allocͰ৽͍͠τϥϯβΫγϣϯΛ࡞Δ
    root root(1) ͜Ε
    ϩά
    ࣺͯΔͱ͖͸hse_kvdb_txn_free

    View full-size slide

  102. hse_kvdb_txn_free( kvdb.get(), p ); } );
    os.kop_txn = transaction.get();
    HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) );
    for( const auto &v: put_value ) {
    HSE_SAFE_CALL( hse_kvs_put( kvs.get(), &os, v.first.data(),
    v.first.size(), v.second.data(), v.second.size() ) );
    }
    for( const auto &v: get_value ) {
    std::array< char, 100 > data{ 0 };
    bool found = false;
    size_t length = 0;
    HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(),
    v.size(), &found, data.data(), data.size(), &length ) );
    if( found )
    std::cout << v << "=" << data.data() << std::endl;
    }
    if( abort_transaction ) {
    HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) );
    HSEͷAPI
    hse_kvdb_txn_beginͰτϥϯβΫγϣϯΛ։࢝
    root
    hse_kvs_putͰΩʔͱ஋ͷϖΞΛॻ͘
    root(1)
    Ωʔ
    σʔλ
    Ωʔ
    σʔλ
    ͜Ε
    ϩά

    View full-size slide

  103. v.first.size(), v.second.data(), v.second.size() ) );
    }
    for( const auto &v: get_value ) {
    std::array< char, 100 > data{ 0 };
    bool found = false;
    size_t length = 0;
    HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(),
    v.size(), &found, data.data(), data.size(), &length ) );
    if( found )
    std::cout << v << "=" << data.data() << std::endl;
    }
    if( abort_transaction ) {
    HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) );
    }
    else {
    HSE_SAFE_CALL( hse_kvdb_txn_commit( kvdb.get(),
    os.kop_txn ) );
    }
    HSEͷAPI
    hse_kvs_getͰΩʔʹରԠ͢Δ஋Λऔಘ
    root(1)
    Ωʔ
    σʔλ
    Ωʔ
    σʔλ
    root
    ϩά

    View full-size slide

  104. v.size(), &found, data.data(), data.size(), &length ) );
    if( found )
    std::cout << v << "=" << data.data() << std::endl;
    }
    if( abort_transaction ) {
    HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) );
    }
    else {
    HSE_SAFE_CALL( hse_kvdb_txn_commit( kvdb.get(),
    os.kop_txn ) );
    }
    HSEͷAPI
    hse_kvdb_txn_commitͰॻ͖ࠐΈΛ֬ఆ
    hse_kvdb_txn_abortͰ͜͜·Ͱͷॻ͖ࠐΈΛऔΓফ͠
    root(1)
    Ωʔ
    σʔλ
    Ωʔ
    σʔλ
    root
    ஋Λૠೖ
    ϩά
    ஋Λૠೖ
    ͜Ε

    View full-size slide

  105. Heterogeneous-Memory Storage Engine
    HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ
    ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ
    1. ݹయతͳSSD
    2. Zoned NamespaceΛ࣋ͭNVMe SSD
    3. ෆشൃϝϞϦσόΠε

    View full-size slide

  106. Heterogeneous-Memory Storage Engine
    HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ
    ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ
    1. ݹయతͳSSD
    2. Zoned NamespaceΛ࣋ͭNVMe SSD
    3. ෆشൃϝϞϦσόΠε
    όʔδϣϯ1.7ͷ࣌఺Ͱར༻Մೳ
    ະ࣮૷
    ະ࣮૷

    View full-size slide

  107. Heterogeneous-Memory Storage Engine
    HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ
    ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ
    1. ݹయతͳSSD
    2. Zoned NamespaceΛ࣋ͭNVMe SSD
    3. ෆشൃϝϞϦσόΠε
    όʔδϣϯ1.7ͷ࣌఺Ͱར༻Մೳ
    ະ࣮૷
    ະ࣮૷
    ෆشൃϝϞϦσόΠεʹ͍ͭͯ͸
    ҎલͷΧʔωϧ/VM޲͚ʹ༻ҙͨ͠ղઆ͕͋ΔͷͰ
    ͦͪΒΛ͝ཡ͍ͩ͘͞
    https://speakerdeck.com/fadis/dian-yuan-woqie-tutemoxiao-enaimemoritofalsefu-kihe-ifang

    View full-size slide

  108. Zoned Namespace
    ϒϩοΫ1
    ϒϩοΫ0
    0 1 2 3 4 5 6 7
    ϒϩοΫ2
    ۭ ۭ ۭ ۭ
    ม׵ද
    2->8
    SSDͷ༰ྔ͕େ͖͘ͳΔͱม׵ද΋େ͖͘ͳΔ
    ͜Ε
    ͜ͷม׵දͷͨΊʹSSDͷ༰ྔͷ ఔ౓ͷRAM͕ඞཁ
    1
    1000
    େ༰ྔͷSSDͷίϯτϩʔϥʹ͸
    େ༰ྔͷRAMΛඋ͑Δඞཁ͕͋Δ ͭΒ͍

    View full-size slide

  109. Zoned Namespace
    ϒϩοΫ1
    ϒϩοΫ0
    0 1 2 3 4 5 6 7
    ϒϩοΫ2
    ۭ ۭ ۭ
    ͜ͷαΠζ୯ҐͰΞυϨεΛม׵͢Δͱ
    ม׵ද͕େ͖͘ͳΓ͗͢Δ
    த్൒୺ʹTRIM͞ΕͨϒϩοΫ͕ੜ͡Δ
    ͜ͷαΠζ୯ҐͰ
    Ͳ͜ʹׂΓ౰͔ͯͨͱઌ಄͔ΒͲ͜·Ͱ࢖͔͚ͬͨͩΛ
    ͓֮͑ͯ͜͏
    TRIM͸ৗʹϒϩοΫؙ͝ͱ
    ۭ

    View full-size slide

  110. Zoned Namespace
    ϒϩοΫ1
    ϒϩοΫ0
    ࢖༻த
    ϒϩοΫ2
    ࢖༻த
    100MBΦʔμʔͷڊେͳϒϩοΫαΠζΛ༻͍Δ
    ϒϩοΫʹ͸ۭ͖͕͋ΔݶΓ௥ه͕Ͱ͖Δ
    ॻ͍ͨ಺༰Λফ͍ͨ͠৔߹͸ϒϩοΫؙ͝ͱ࡟আ͢Δඞཁ͕͋Δ
    FTLͷ࢓ࣄΛݮΒ͠
    ΞϓϦέʔγϣϯʹNANDͷ੍໿ͷҰ෦Λ௚઀ݟͤΔ

    View full-size slide

  111. Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    dm-zoned
    ϖʔδαΠζ,J#
    ϖʔδαΠζ.J#
    dm-zoned
    Linuxͷ
    Zoned Namespace΁ͷ
    ରԠ
    4KiBϖʔδ͕
    ͋Δ͔ͷΑ͏ʹݟͤΔ

    View full-size slide

  112. Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    ߏ଄Խϩά͕૿͑ͨ
    dm-zoned
    ϩά
    ϩά
    ϩά
    ϩά

    View full-size slide

  113. Ϣʔβۭؒ
    Χʔωϧۭؒ VFS
    ϑΝΠϧγεςϜ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    MySQL MongoDB
    WiredTiger
    InnoDB
    Χʔωϧۭؒ
    ϋʔυ΢ΣΞ Flash Translation Layer
    NANDϑϥογϡϝϞϦ
    HSEͷૂ͍
    dm-zoned
    HSE
    mpool
    ϖʔδαΠζ,J#
    ϖʔδαΠζ.J#
    ϖʔδαΠζ.J#

    View full-size slide

  114. ·ͱΊ
    SSD͕πϯσϨ
    ͏·͘ੑೳΛҾ͖ग़͢ʹ͸
    ΧʔωϧͷϨΠϠʔ͔Β࢖͍ํΛݟ௚͢ඞཁ͕͋Δ
    HSE͸͜ΕΛ΍ͬͯߴ͍ੑೳΛ࣮ݱͨ͠KVS

    View full-size slide