Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HSEとは何か

Fadis
June 06, 2020

 HSEとは何か

Heterogeneous-Memory Storage Engineについて解説します
これは2020年6月6日に行われた カーネル/VM探検隊 online part1での発表資料です

参考文献
Heterogeneous-Memory
Storage Engine: https://www.micron.com/hse
Don't stack your log on my log: https://www.usenix.org/node/187064
電源を切っても消えないメモリとの付き合い方: https://speakerdeck.com/fadis/dian-yuan-woqie-tutemoxiao-enaimemoritofalsefu-kihe-ifang
この資料のサンプルコード: https://github.com/Fadis/hse_demo
カーネル/VM探検隊 online part1: https://connpass.com/event/175388/

Fadis

June 06, 2020
Tweet

More Decks by Fadis

Other Decks in Programming

Transcript

  1. ετϨʔδΤϯδϯͷ໾ׂ PUT ABORT PUT COMMIT BEGIN PUT COMMIT GET GET

    ͜Ε͕ݟ͑Δ GET GET ͜Ε͕ݟ͑Δ GET ͜Ε͕ݟ͑Δ τϥϯβΫγϣϯ
  2. ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 root 1 2 2'ΛPUT

    ηογϣϯ1 root(1) 2' ϩά i i 8JSFE5JHFSͷ৔߹ ηογϣϯ0
  3. ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 root 1 2 ηογϣϯ0

    2ΛGET ηογϣϯ1 root(1) 2' 2ΛGET ϩά i i 8JSFE5JHFSͷ৔߹
  4. ετϨʔδΤϯδϯͷ࣮૷ํ๏ i 0 1 2 4 5 root 1 i

    COMMIT ηογϣϯ1 root(1) 2' ͕ʹͳΔ ϩά 8JSFE5JHFSͷ৔߹ ηογϣϯ0
  5. ετϨʔδΤϯδϯͷ࣮૷ํ๏ 0 1 2 4 5 root 1 ηογϣϯ0 ηογϣϯ1

    2' ͕ʹͳΔ ϩά 2ΛGET 2ΛGET i i 8JSFE5JHFSͷ৔߹
  6. ετϨʔδΤϯδϯͷ࣮૷ํ๏ ετϨʔδΤϯδϯ ϑΝΠϧγεςϜ σʔλͷ୳͠ํ Ωʔ ϑΝΠϧύε ه࿥ҐஔΛܾΊΔ ͢Δ ͢Δ Ωϟογϡ

    ͋Δ ͋Δ τϥϯβΫγϣϯ ΞϓϦέʔγϣϯ੍͕ޚ ͳ͍ தஅޙͷঢ়ଶ ׬ྃͨ͠τϥϯβΫγϣϯ͚͕ͩ ൓ө͞Εͨঢ়ଶ ϑΝΠϧγεςϜ͕ յΕ͍ͯͳ͍ԿΒ͔ͷঢ়ଶ தஅޙͷঢ়ଶͷ ճ෮ํ๏ ߏ଄Խϩά ߏ଄Խϩά ϑΝΠϧγεςϜͷδϟʔφϧͱࣅ͍ͯΔ͕ ϑΝΠϧγεςϜ͸τϥϯβΫγϣϯΛఏڙ͠ͳ͍
  7. ͷ੍໿ 0 0 0 0 0 20V 20V 20V 20V

    20V 20V 20V 0V 0V 0V 0V 0V ಉҰϒϩοΫͷ͢΂ͯͷηϧ͔ΒిՙΛൈ͘ (=ϒϩοΫΛؙ͝ͱθϩΫϦΞ͢Δ) ௨ৗ1ϒϩοΫ͸ෳ਺ͷϖʔδ͔ΒͳΔͨΊ 1ϖʔδ͚ͩθϩΫϦΞ͢Δ͜ͱ͸ग़དྷͳ͍
  8. ͷ੍໿ 0 0 1 0 1 0V 0V 0V 0V

    0V 0V 0V 0V 0V 20V 0V 20V ͦͷޙ1ʹ͍ͨ͠ηϧʹిՙΛஷΊΔ ͜ͷిՙͷग़ೖΓͷͨͼʹηϧͷτϯωϧࢎԽບ͕ফ໣͢Δ
  9. ϒϩοΫ1 ϒϩοΫ0 ͷ੍໿ 0 1 2 3 4 5 6

    7 2' Λॻ͖׵͍͑ͨ ϋʔυσΟεΫͷΑ͏ʹ௚઀ Λॻ͖׵͑Δͱ 2 ΛθϩΫϦΞ ϒϩοΫ0 0 1 2 3 Λॻ͖ࠐΈ 0 1 2' 3 1. 2. ௒஗͍
  10. ϒϩοΫ1 ϒϩοΫ0 ͷ੍໿ 0 1 2 3 4 5 6

    7 2' ϋʔυσΟεΫͷΑ͏ʹ௚઀ Λॻ͖׵͑Δͱ 2 ͕ۃ୺ʹফ໣ͯ͠࢖͑ͳ͘ͳΔ ϒϩοΫ0 0 1 2 3 ௒੬͍ 2' 2' ಉ͡ϖʔδʹԿ౓΋ॻ͖ࠐΈ
  11. Flash Translation Layer ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4

    5 6 7 ϒϩοΫ2 ۭ ۭ ۭ ۭ SSDͷίϯτϩʔϥ͸ Ջͳ࣌ʹۭ͖ྖҬΛͲΜͲΜθϩΫϦΞ͢Δ
  12. Flash Translation Layer ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4

    5 6 7 ϒϩοΫ2 2' ۭ ۭ ۭ 2' ॻ͖ࠐΈཁٻ͕དྷͨΒ θϩΫϦΞࡁΈͷϖʔδʹॻ͖ࠐΉ Λॻ͖׵͍͑ͨ
  13. Flash Translation Layer ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4

    5 6 7 ϒϩοΫ2 2' ۭ ۭ ۭ SSD͸͋ΔLBAͷϖʔδ͕ Ͳͷ෺ཧΞυϨεʹه࿥͞Ε͍ͯΔ͔Λද͢ ม׵දΛ͍࣋ͬͯΔ ม׵ද 2->8
  14. ม׵ද LBAͱ෺ཧΞυϨεͷม׵ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA ෺ཧΞυϨε

    2 11 3 12 5 9 ม׵ද͸σόΠεͷRAMͱϑϥογϡϝϞϦͷ྆ํʹஔ͔ΕΔ ϑϥογϡϝϞϦ͸ߦ͝ͱʹॻ͖׵͑ΒΕͳ͍ͷͰ ͢΂ͯͷมߋ͕ߏ଄ԽϩάͰ௥ه͞ΕΔ
  15. ม׵ද FTLͷΨϕʔδίϨΫλ LBA2͸෺ཧΞυϨε8ʹͳͬͨ LBA5͸෺ཧΞυϨε9ʹͳͬͨ LBA1͸ΞυϨε10ʹͳͬͨ LBA1͸TRIM͞Εͨ LBA2͸෺ཧΞυϨε11ʹͳͬͨ LBA3͸෺ཧΞυϨε12ʹͳͬͨ LBA2͸෺ཧΞυϨε13ʹͳͬͨ LBA5͸TRIM͞Εͨ

    LBA ෺ཧΞυϨε 2 13 3 12 ϒϩοΫ3 3 2 ۭ ۭ ϒϩοΫ2 ۭ ۭ ۭ ۭ SSDͷίϯτϩʔϥ͸ શͯͷϖʔδ͕ม׵ද͔Βࢀর͞Εͳ͘ͳͬͨϒϩοΫΛ Ջͳ࣌ʹθϩΫϦΞ͢Δ
  16. FTLͷΨϕʔδίϨΫλ ϒϩοΫ3 3 2 ̐ ̐ ϒϩοΫ2 1 2 1

    2 θϩΫϦΞ͞Εͨϖʔδ͕ݮ͖͍ͬͯͯΔ͕ ͲͷϒϩοΫ΋த్൒୺ʹ࢖ΘΕ͍ͯΔ৔߹
  17. FTLͷΨϕʔδίϨΫλ ϒϩοΫ3 3 2 ϒϩοΫ2 ̐ ̐ 1 2 1

    2 த్൒୺ʹ࢖ΘΕ͍ͯΔϒϩοΫͷ༗ޮͳϖʔδ͚ͩΛ ৽͍͠ϒϩοΫʹॻ͖ࠐΈ ϒϩοΫ4 ۭ ۭ 1 ۭ
  18. Flash Translation Layer Flash Translation Layer ϑΝΠϧγεςϜ σʔλͷ୳͠ํ LBA ϑΝΠϧύε

    ه࿥ҐஔΛܾΊΔ ͢Δ ͢Δ Ωϟογϡ ͋Δ ͋Δ τϥϯβΫγϣϯ ͳ͍ ͳ͍ தஅޙͷঢ়ଶ ΞυϨεม׵ද͕յΕ͍ͯͳ͍ ԿΒ͔ͷঢ়ଶ ϑΝΠϧγεςϜ͕ յΕ͍ͯͳ͍ԿΒ͔ͷঢ়ଶ தஅޙͷঢ়ଶͷ ճ෮ํ๏ ߏ଄Խϩά ߏ଄Խϩά ϑΝΠϧγεςϜͷδϟʔφϧͱࣅ͍ͯΔ͕ ϩάͷཻ౓͸ϖʔδ୯Ґ
  19. ϩάͷ্ʹϩά͕৐্ͬͨʹϩά͕৐ͬͨঢ়ଶʹͳΔ Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB

    WiredTiger InnoDB ϩά ϩά Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ ϩά
  20. ϒϩοΫ1 ۭ ۭ ϒϩοΫ0 ۭ ۭ 5 7 7 9

    ϒϩοΫ2 ۭ ۭ ۭ ۭ ʹॻ͖͍ͨ 1 2 3 4 ΛTRIM͍ͨ͠ 1 2 3 4 ϒϩοΫ1 1 2 ϒϩοΫ0 3 4 5 7 7 9 ϒϩοΫ2 ۭ ۭ ۭ ۭ ͙͢θϩΫϦΞͰ͖Δ ফڈ͸ϒϩοΫ୯ҐͰདྷΔͱخ͍͠
  21. ϩά η Ϋ γ ἀ ϯ η Ϋ γ ἀ

    ϯ Flash-Friendly File System (F2FS) SB CP SIT NAT SSA Main ϩά η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ η Ϋ γ ἀ ϯ ⋯ ෳ਺ͷϩάΛ࣋ͭ ϩάߏ଄ԽϑΝΠϧγεςϜ ϩάʹ࢖͏ྖҬ͸ ηΫγϣϯ୯ҐͰׂΓ౰ͯ ηΫγϣϯαΠζ͸ ଟ෼ϒϩοΫαΠζͱҰக GC࣌ͷTRIM͕ ϒϩοΫ୯Ґʹͳͬͯخ͍͠
  22. 3ͭͷಠཱʹಈ͘ΨϕʔδίϨΫλ͕ॏͳͬͨঢ়ଶ Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB

    WiredTiger InnoDB GC GC Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ GC
  23. https://www.usenix.org/node/187064 Don't Stack Your Log On My Log YANG, J.,

    PLASSON, N., GILLIS, G., TALAGALA, N., AND SUNDARARAMAN, S. Don’t stack your log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW) (2014).
  24. https://www.usenix.org/node/187064 YANG, J., PLASSON, N., GILLIS, G., TALAGALA, N., AND

    SUNDARARAMAN, S. Don’t stack your log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW) (2014). ߏ଄ԽϩάΛԿॏʹ΋ॏͶΔͱ NAND΁ͷॻ͖ࠐΈ͕ͲΜͲΜ૿͑ͯੑೳΨλམͪ ͱ͍͏࿦จ Don't Stack Your Log On My Log
  25. ϑΝΠϧγεςϜ ετϨʔδΤϯδϯ Flash Translation Layer σʔλ Λॻ͖͍ͨ ϝλ0 σʔλ ϝλ0

    σʔλ ϝλ1 Λॻ͖͍ͨ Λॻ͖͍ͨ Write Amplification ϝλ2 ϝλ1 σʔλ ϝλ3 ϝλ5 ϝλ2 ϝλ0 ϝλ4 Λॻ ্૚ͷϝλσʔλ͸ Լ૚ʹͱͬͯ͸σʔλͳͷͰ ϝλσʔλʹϝλσʔλ͕෇͘
  26. ϑΝΠϧγεςϜ ηΫγϣϯ0 ηΫγϣϯ1 ηΫγϣϯ2 ϩά Flash Translation Layer ϒϩοΫ2 ϩά

    ηΫγϣϯ3 ʹॻ͔Εͨϩά͕ෆཁʹͳͬͨͷͰTRIM ηΫγϣϯ1 ϒϩοΫ1 ηΫγϣϯαΠζͱϒϩοΫαΠζ͕ҟͳΔͱ ్த·ͰTRIM͞ΕͨϒϩοΫ͕ੜ͡Δ ϒϩοΫ0 Write Amplification
  27. Flash Translation Layer ϒϩοΫ2 ϩά ϒϩοΫ1 ϒϩοΫ0 ϒϩοΫ3 ϒϩοΫ4 ΨϕʔδίϨΫλ͸్த·ͰTRIM͞ΕͨϒϩοΫ͔Β

    θϩΫϦΞ͞ΕͨྖҬΛ࡞ΔͨΊʹ ࢖༻தͷϖʔδΛ৽͍͠ϒϩοΫʹίϐʔ͢Δ Write Amplification
  28. Write Amplification ετϨʔδΤϯδϯ͸௨ৗϩά͕TRIMՄೳͰ͋Δ͜ͱΛ Լ૚ʹ௨஌͠ͳ͍ ετϨʔδΤϯδϯ ϖʔδ0 ϖʔδ1 ϖʔδ2 ϩά ϖʔδ3

    ࢖༻ࡁΈϩά ϑΝΠϧγεςϜ ϖʔδ0 ϖʔδ1 ϖʔδ2 ϩά ϖʔδ3 Flash Translation Layer ϑΝΠϧ͕͋Δ͔Β࢖༻த ϖʔδ͸࢖༻த͔ͩΒ ผͷϒϩοΫʹίϐʔ͢Δ
  29. Write Amplification 0 ϩά 1 2 3 4 5 6

    ߏ଄ԽϩάͷΨϕʔδίϨΫγϣϯ͸ طଘͷϩά͔Β·ͩ༗ޮͳཁૉ͚ͩΛऔΓग़ͯ͠ ৽͍͠ϩάʹίϐʔ͢Δ 0 ϩά 1 2 3 4 5 6 ϩά 2 6 ̎ ̒ ͜Ε͸Լ૚ͷϩάʹ৽͍͠ॻ͖ࠐΈΛ࢈Ή ্ Լ
  30. Write Amplification 0 ϩά 1 2 3 4 5 6

    ΋͠Լ૚ͷϩάͷΨϕʔδίϨΫλ͕ ૸ͬͨ௚ޙʹ্૚ͷΨϕʔδίϨΫλ͕૸Δͱ 0 ϩά 1 2 3 5 6 1 2 ϩά ΨϕʔδίϨΫγϣϯͰྖҬΛۭ͚ͨ͹͔ΓͷԼ૚ͷϩάʹ େྔͷॻ͖ࠐΈΛੜ্ͤͨ͡͞ʹ 0 1 2 3 5 6 3 5 6 ্ Լ 0
  31. 0 1 2 3 5 6 Write Amplification 0 ϩά

    1 2 3 4 5 6 0 1 2 3 5 6 ϩά ΨϕʔδίϨΫγϣϯͰྖҬΛۭ͚ͨ͹͔ΓͷԼ૚ͷϩάʹ େྔͷΨϕʔδίϨΫγϣϯ଴ͪͷཁૉΛੜͤ͡͞Δ 0 1 2 3 5 6 ্ Լ ௚લͷԼ૚ͷΨϕʔδίϨΫγϣϯΛҰॠͰ୆ແ͠ʹ͢Δ ϩά 1 2 3 5 6 0 0 1 2 3 5 6
  32. ϑΝΠϧγεςϜΛ΍ΊΑ͏ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger InnoDB

    Flash Translation Layer NANDϑϥογϡϝϞϦ ϩά ϩά ϩά τϥϯβΫγϣϯΛ ࣮ݱ͢ΔͨΊʹඞཁ ϋʔυ΢ΣΞͷػೳ ࣺ͍ͯͨ
  33. Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger

    InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ HSE mpool HSE͸ΧʔωϧϞδϡʔϧmpoolΛ࢖͏
  34. Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger

    InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ HSE mpool mpool͸ϒϩοΫσόΠεͷ্Ͱಈ͘
  35. ϒϩοΫσόΠεΛࢦఆͯ͠mpoolσόΠεΛ࡞Δ root # modprobe mpool root # ls /dev/mpool* /dev/mpoolctl

    root # mpool create mp1 /dev/nvme0n1 uid=test gid=test mode=0600 root # ls /dev/mpool* /dev/mpoolctl /dev/mpool: mp1 root # mpool list MPOOL TOTAL USED AVAIL CAPACITY LABEL HEALTH mp1 466g 1.16g 441g 0.26% raw optimal
  36. mpool mpoolΧʔωϧϞδϡʔϧ Ϣʔβۭؒ Χʔωϧۭؒ mblock mlog mcache HSE ioctl ioctl

    ioctl mpool͸3ͭͷػೳΛఏڙ͢Δ mpool ϢʔβۭؒϥΠϒϥϦ mdc
  37. mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string

    >().c_str(), O_RDWR, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t block_id = 0u; mblock_props props; mpool_openͰmpoolσόΠεΛ։͖ mblock͸ϖʔδαΠζͷ੔਺ഒͷόΠτྻΛmpoolʹอଘ͢Δ mblock͸࡞੒࣌ʹҰ౓͚ͩॻ͘ࣄ͕Ͱ͖ มߋ΍௥ه͸Ͱ͖ͳ͍͕࡟আ͸Ͱ͖Δ mblockͷAPI
  38. mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string

    >().c_str(), O_RDWR, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t block_id = 0u; mblock_props props; size_t length = 0; if( !params.count( "object" ) ) { memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; mpool_mblock_allocͰ৽͍͠mblockΛ࡞੒͢Δ ͜͜ͰฦΔ64bitͷblock id͸ ϑΝΠϧσΟεΫϦϓλͷΑ͏ͳ΋ͷ mblockͷAPI
  39. size_t length = 0; if( !params.count( "object" ) ) {

    memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; mblockͷAPI Ұํಉ࣌ʹಘΒΕΔobject id͸ϑΝΠϧ໊ͷΑ͏ͳ΋ͷ ͜ͷmblockΛ୳͢ͱ͖͸object idΛ࢖༻͢Δ
  40. mblock_props props; size_t length = 0; if( !params.count( "object" )

    ) { memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); mblockʹॻ͖ࠐΉσʔλ͸ϖʔδڥքʹ ΞϥΠϯ͞Ε͍ͯͳ͚Ε͹ͳΒͳ͍ mblockͷAPI mpoolͷॻ͖ࠐΈʹ͸ϖʔδΩϟογϡ͕ແ͘ Χʔωϧ͸͜͜Ͱ֬อͨ͠ϝϞϦΛ௚઀σόΠευϥΠόʹ౉͢
  41. SAFE_CALL( mpool_mblock_alloc( pool.get(), MP_MED_CAPACITY, false, &block_id, &props ) ) std::cout

    << "object id: " << props.mpr_objid << std::endl; std::string m = params[ "message" ].as< std::string >(); size_t buf_size = ( m.size() / PAGE_SIZE + ( m.size() % PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; length = buf_size; SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 ) ) if( abort_transaction ) mpool_mblock_writeͰmblockʹσʔλΛॻ͖ࠐΉ iovecΛෳ਺༻ҙ͢Δ͜ͱͰ ෳ਺ͷϝϞϦྖҬ͔ΒͷσʔλΛ૊Έ߹Θͤͯॻ͘͜ͱ΋Ͱ͖Δ mblockͷAPI
  42. PAGE_SIZE ? 1 : 0 ) ) * PAGE_SIZE; std::unique_ptr<

    char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); std::copy( m.begin(), m.end(), buf.get() ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; length = buf_size; SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mblock_abort( pool.get(), block_id ) ) else SAFE_CALL( mpool_mblock_commit( pool.get(), block_id ) ) } else { uint64_t object_id = params[ "object" ].as< uint64_t >(); mpool_mblock_commitͰมߋΛ֬ఆ͢Δ ͜ͷؔ਺ʹ౸ୡ͠ͳ͔ͬͨ৔߹ͦ͜·Ͱͷ mpool_mblock_write͸ແ͔ͬͨ͜ͱʹͳΔ mblockͷAPI mpool_mblock_abortͰ ͦ͜·ͰͷมߋΛ໌ࣔతʹແ͔ͬͨ͜ͱʹ͢Δ
  43. iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; length =

    buf_size; SAFE_CALL( mpool_mblock_write( pool.get(), block_id, &iov, 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mblock_abort( pool.get(), block_id ) ) else SAFE_CALL( mpool_mblock_commit( pool.get(), block_id ) ) } else { uint64_t object_id = params[ "object" ].as< uint64_t >(); SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id, &block_id, &props ) ) length = props.mpr_write_len; std::cout << "object id: " << object_id << std::endl; } طʹॻ͖ࠐ·ΕͨmblockΛ୳͢ʹ͸ mpool_mblock_find_get mblockͷAPI
  44. SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id, &block_id, &props ) ) length =

    props.mpr_write_len; std::cout << "object id: " << object_id << std::endl; } { size_t buf_size = length; std::unique_ptr< char, free_deleter > buf( reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; SAFE_CALL( mpool_mblock_read( pool.get(), block_id, &iov, 1, 0 ) ) std::cout << "length: " << length << std::endl; std::cout << "data: " << buf.get() << std::endl; } mpool_mblock_readͰಡΉ ಡΉͱ͖ʹ࢖͏όοϑΝ΋ ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯΔඞཁ͕͋Δ mblockͷAPI
  45. { size_t buf_size = length; std::unique_ptr< char, free_deleter > buf(

    reinterpret_cast< char* >( aligned_alloc( PAGE_SIZE, buf_size ) ) ); if( !buf ) throw std::bad_alloc(); memset( buf.get(), 0, buf_size ); iovec iov; iov.iov_base = buf.get(); iov.iov_len = buf_size; SAFE_CALL( mpool_mblock_read( pool.get(), block_id, &iov, 1, 0 ) ) std::cout << "length: " << length << std::endl; std::cout << "data: " << buf.get() << std::endl; } if( delete_block ) SAFE_CALL( mpool_mblock_delete( pool.get(), block_id ) ) mpool_mblock_deleteΛ࢖͑͹ ࢦఆͨ͠mblockΛؙ͝ͱ࡟আͰ͖Δ mblockͷAPI
  46. mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string

    >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); mlog_capacity cap; memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) ); mpool_openͰmpoolσόΠεΛ։͘ͷ͸mblockͱҰॹ mlog͸ޙ͔Β௥هͰ͖ΔόΠτྻΛmpoolʹอଘ͢Δ mlogͷ࠷େαΠζ͸࡞੒࣌ʹܾఆ͞Ε ࠷େαΠζ·Ͱ௥هͨ͠ΒͦΕҎ্ॻ͖ࠐΊͳ͘ͳΔ mlogͷAPI
  47. std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) {

    if( p ) mpool_close( p ); } ); mlog_capacity cap; memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) ); std::shared_ptr< mpool_mlog > log; if( !params.count( "object" ) ) { cap.lcp_captgt = 4 * 1024 * 1024; mlog_props props; memset( reinterpret_cast< void* >( &props ), 0, sizeof( props ) ); mpool_mlog *raw_log = nullptr; SAFE_CALL( mpool_mlog_alloc( pool.get(), &cap, MP_MED_CAPACITY, &props, &raw_log ) ); log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) mlogͷAPI mpool_mlog_allocͰ৽͍͠mlogΛ࡞੒ ࢖༻͢ΔྖҬͷαΠζ (ϖʔδαΠζͷ੔਺ഒ)
  48. log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p )

    mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) } else { mlog_props props; mpool_mlog *raw_log = nullptr; SAFE_CALL( mpool_mlog_find_get( pool.get(), params[ "object" ].as<uint64_t>(), &props, &raw_log ) ) log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) mlogͷAPI طʹ͋ΔmlogΛ୳࣌͢͸mpool_mlog_find_get mpool_mlog_alloc΍mpool_mlog_find_get͸ mlog_propsΛฦ͢
  49. SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) } else { mlog_props

    props; mpool_mlog *raw_log = nullptr; SAFE_CALL( mpool_mlog_find_get( pool.get(), params[ "object" ].as<uint64_t>(), &props, &raw_log ) ) log.reset( raw_log, [pool]( mpool_mlog *p ) { if( p ) mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid; std::cout << "object id: " << object_id << std::endl; } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), mlogͷAPI mpool_mlogΛ࢖ͬͯ mpool_mlog_openͰϩάΛ։͘
  50. mpool_mlog_close( pool.get(), p ); } ); uint64_t object_id = props.lpr_objid;

    std::cout << "object id: " << object_id << std::endl; } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) ) else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), mlogͷAPI mpool_mlog_append_dataͰmlogʹόΠτྻΛ௥Ճ͢Δ ॻ͖ࠐΉόΠτྻ͸ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍
  51. } uint64_t gen = 0; SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0,

    &gen ) ) if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) ) else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), erase_log ) ) bool empty = false; SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) ) std::cout << "empty: " << empty << std::endl; mlogͷAPI mpool_mlog_commitͰมߋΛ֬ఆ͢Δ ͜ͷؔ਺ʹ౸ୡ͠ͳ͔ͬͨ৔߹ͦ͜·Ͱͷ mpool_mlog_append_data͸ແ͔ͬͨ͜ͱʹͳΔ mpool_mlog_abortͰͦ͜·ͰͷมߋΛ ໌ࣔతʹແ͔ͬͨ͜ͱʹ͢Δ
  52. SAFE_CALL( mpool_mlog_open( pool.get(), log.get(), 0, &gen ) ) if( params.count(

    "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mlog_append_data( pool.get(), log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( abort_transaction ) SAFE_CALL( mpool_mlog_abort( pool.get(), log.get() ) ) else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log != std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), erase_log ) ) bool empty = false; SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) ) std::cout << "empty: " << empty << std::endl; size_t len = 0; SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) ) mlogͷAPI mpool_mlog_eraseΛ࢖͑͹ ࢦఆͨ͠mlogΛؙ͝ͱ࡟আͰ͖Δ
  53. else SAFE_CALL( mpool_mlog_commit( pool.get(), log.get() ) ) if( erase_log !=

    std::numeric_limits< uint64_t >::max() ) SAFE_CALL( mpool_mlog_erase( pool.get(), log.get(), erase_log ) ) bool empty = false; SAFE_CALL( mpool_mlog_empty( pool.get(), log.get(), &empty ) ) std::cout << "empty: " << empty << std::endl; size_t len = 0; SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) ) std::cout << "length: " << len << std::endl; SAFE_CALL( mpool_mlog_read_data_init( pool.get(), log.get() ) ) while( 1 ) { std::array< char, 1024u > buf; size_t length = 0u; SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(), buf.data(), buf.size() - 1, &length ) ); if( !length ) break; buf[ length ] = '\0'; mlogͷAPI mpool_mlog_emptyͰ mlog͕ۭ͔Ͳ͏͔Λ֬ೝͰ͖Δ mpool_mlog_lenͰ mlogͷ࢖༻ࡁΈͷྖҬͷαΠζΛऔಘͰ͖Δ
  54. std::cout << "empty: " << empty << std::endl; size_t len

    = 0; SAFE_CALL( mpool_mlog_len( pool.get(), log.get(), &len ) ) std::cout << "length: " << len << std::endl; SAFE_CALL( mpool_mlog_read_data_init( pool.get(), log.get() ) ) while( 1 ) { std::array< char, 1024u > buf; size_t length = 0u; SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(), buf.data(), buf.size() - 1, &length ) ); if( !length ) break; buf[ length ] = '\0'; std::cout << "data: " << buf.data() << std::endl; } SAFE_CALL( mpool_mlog_flush( pool.get(), log.get() ) ) SAFE_CALL( mpool_mlog_close( pool.get(), log.get() ) ) if( delete_log ) SAFE_CALL( mpool_mlog_delete( pool.get(), log.get() ) ) mlogͷAPI mpool_mlog_read_data_initͰಡΈग़͠ͷ༻ҙΛͯ͠ mpool_mlog_read_data_nextͰઌ಄͔Βॱ൪ʹ ॻ͖ࠐ·Εͨ಺༰ΛಡΊΔ
  55. SAFE_CALL( mpool_mlog_read_data_next( pool.get(), log.get(), buf.data(), buf.size() - 1, &length )

    ); if( !length ) break; buf[ length ] = '\0'; std::cout << "data: " << buf.data() << std::endl; } SAFE_CALL( mpool_mlog_flush( pool.get(), log.get() ) ) SAFE_CALL( mpool_mlog_close( pool.get(), log.get() ) ) if( delete_log ) SAFE_CALL( mpool_mlog_delete( pool.get(), log.get() ) ) mlogͷAPI mlogΛ࡟আ͢Δͱ͖͸mpool_mlog_delete
  56. mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string

    >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t log1 = 0; uint64_t log2 = 0; if( !params.count( "object" ) ) { mdc_capacity cap; mdcͷAPI MetaData Containerུͯ͠MDC mlogΛ2ຊ૊Έ߹Θͤͯ ΨϕʔδίϨΫγϣϯͰ͖ΔΑ͏ʹͨ͠΋ͷ mpool_openͰmpoolσόΠεΛ։͘ͷ͸mlogͱҰॹ
  57. SAFE_CALL( mpool_open( params[ "pool" ].as< std::string >().c_str(), O_RDWR|O_EXCL, &raw_pool, nullptr

    ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); uint64_t log1 = 0; uint64_t log2 = 0; if( !params.count( "object" ) ) { mdc_capacity cap; memset( reinterpret_cast< void* >( &cap ), 0, sizeof( cap ) ); cap.mdt_captgt = 4 * 1024 * 1024; SAFE_CALL( mpool_mdc_alloc( pool.get(), &log1, &log2, MP_MED_CAPACITY, &cap, nullptr ) ); std::cout << "object id: " << log1 << ":" << log2 << std::endl; SAFE_CALL( mpool_mdc_commit( pool.get(), log1, log2 ) ) } else { auto v = params[ "object" ].as< std::string >(); boost::fusion::vector< uint64_t, uint64_t > parsed; namespace qi = boost::spirit::qi; if( !qi::parse( v.begin(), v.end(), qi::ulong_long >> ':' >> qi::ulong_long, parsed ) ) { mdcͷAPI mpool_mdc_allocͰmdcΛ࡞Δ 2ຊͷmlog͕࡞ΒΕͯobject id͕2ͭฦͬͯ͘Δ Ҿ਺ͷmdc_capacityͰmlog1ຊ͋ͨΓͷαΠζΛࢦఆ͢Δ
  58. boost::fusion::vector< uint64_t, uint64_t > parsed; namespace qi = boost::spirit::qi; if(

    !qi::parse( v.begin(), v.end(), qi::ulong_long >> ':' >> qi::ulong_long, parsed ) ) { std::cerr << "invalid object id" << std::endl; return 1; } log1 = boost::fusion::at_c< 0 >( parsed ); log2 = boost::fusion::at_c< 1 >( parsed ); } mpool_mdc *raw_log = nullptr; SAFE_CALL( mpool_mdc_open( pool.get(), log1, log2, 0, &raw_log ) ); std::shared_ptr< mpool_mdc > log( raw_log, [pool]( mpool_mdc *p ) { if( p ) mpool_mdc_close( p ); } ); if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( params.count( "compact" ) ) { auto v = params[ "compact" ].as< std::vector< std::string > >(); mdcͷAPI mpool_mdc_openͰmdcΛ։͘ ։͍ͨmdc͸mpool_mdc_closeͰด͡Δ
  59. return 1; } log1 = boost::fusion::at_c< 0 >( parsed );

    log2 = boost::fusion::at_c< 1 >( parsed ); } mpool_mdc *raw_log = nullptr; SAFE_CALL( mpool_mdc_open( pool.get(), log1, log2, 0, &raw_log ) ); std::shared_ptr< mpool_mdc > log( raw_log, [pool]( mpool_mdc *p ) { if( p ) mpool_mdc_close( p ); } ); if( params.count( "message" ) ) for( const auto &a: params[ "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( params.count( "compact" ) ) { auto v = params[ "compact" ].as< std::vector< std::string > >(); std::sort( v.begin(), v.end() ); std::vector< std::vector< char > > bufs; SAFE_CALL( mpool_mdc_rewind( log.get() ) ) while( 1 ) { std::vector< char > buf( 4096, 0 ); size_t size = 0; mdcͷAPI mpool_mdc_append_dataͰactiveͳํͷmlogʹ όΠτྻΛ௥Ճ͢Δ ॻ͖ࠐΉόΠτྻ͸ϖʔδڥքʹΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍ mdc mlog mlog 1 2 3 4 5 2ຊͷmlogͷ͏ͪҰํ͚͕ͩactiveʹͳ͍ͬͯΔ
  60. } SAFE_CALL( mpool_mdc_cend( log.get() ) ) } SAFE_CALL( mpool_mdc_rewind( log.get()

    ) ) while( 1 ) { std::vector< char > buf( 4096, 0 ); size_t size = 0; auto e = mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ); if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; std::cout << "data: " << buf.data() << std::endl; } if( delete_log ) { log.reset(); SAFE_CALL( mpool_mdc_destroy( pool.get(), log1, log2 ) ) } mdcͷAPI mpool_mdc_rewindͰactiveͳϩάͷઌ಄ʹҠಈ mpool_mdc_readΛݺͿ౓ʹϩά͕ॱ൪ʹฦͬͯ͘Δ
  61. if( params.count( "message" ) ) for( const auto &a: params[

    "message" ].as< std::vector< std::string > >() ) SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( a.data() ) ), a.size(), 1 ) ) if( params.count( "compact" ) ) { auto v = params[ "compact" ].as< std::vector< std::string > >(); std::sort( v.begin(), v.end() ); std::vector< std::vector< char > > bufs; SAFE_CALL( mpool_mdc_rewind( log.get() ) ) while( 1 ) { std::vector< char > buf( 4096, 0 ); size_t size = 0; auto e = mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ); if( mpool_errno( e ) == EOVERFLOW && size > buf.size() ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) mdcͷAPI mdc mlog mlog 1 2 3 4 5 1 3 ΨϕʔδίϨΫγϣϯΛߦ͏ʹ͸ ·ͣ༗ޮͳϩάΛಡΈग़͢
  62. if( mpool_errno( e ) == EOVERFLOW && size > buf.size()

    ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; if( std::binary_search( v.begin(), v.end(), std::string( buf.data() ) ) ) { buf.resize( size ); bufs.emplace_back( std::move( buf ) ); } } SAFE_CALL( mpool_mdc_cstart( log.get() ) ) for( const auto &buf: bufs ) { SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( buf.data() ) ), buf.size(), 0 ) ) } SAFE_CALL( mpool_mdc_cend( log.get() ) ) } SAFE_CALL( mpool_mdc_rewind( log.get() ) ) mdcͷAPI mdc mlog mlog 1 2 3 4 5 1 3 1 3 mpool_mdc_cstartͰactiveͳmlogΛ੾Γସ͑ ͦͷޙmpool_mdc_appendͰ༗ޮͳϩάͷॻ͖ࠐΈ
  63. if( mpool_errno( e ) == EOVERFLOW && size > buf.size()

    ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; if( std::binary_search( v.begin(), v.end(), std::string( buf.data() ) ) ) { buf.resize( size ); bufs.emplace_back( std::move( buf ) ); } } SAFE_CALL( mpool_mdc_cstart( log.get() ) ) for( const auto &buf: bufs ) { SAFE_CALL( mpool_mdc_append( log.get(), const_cast< void* >( static_cast< const void* >( buf.data() ) ), buf.size(), 0 ) ) } SAFE_CALL( mpool_mdc_cend( log.get() ) ) } SAFE_CALL( mpool_mdc_rewind( log.get() ) ) mdcͷAPI mdc mlog mlog 1 3 ࠷ޙʹmpool_mdc_cendͰinactiveͳϩάΛTRIM
  64. if( mpool_errno( e ) == EOVERFLOW && size > buf.size()

    ) { buf.resize( size + 1, 0 ); SAFE_CALL( mpool_mdc_read( log.get(), buf.data(), buf.size() - 1, &size ) ); } else SAFE_CALL( e ) if( !size ) break; std::cout << "data: " << buf.data() << std::endl; } if( delete_log ) { log.reset(); SAFE_CALL( mpool_mdc_destroy( pool.get(), log1, log2 ) ) } mpool_mdc_destroyͰ2ͭͷmlogΛ·ͱΊͯ࡟আ mdcͷAPI
  65. mpool *raw_pool = nullptr; SAFE_CALL( mpool_open( params[ "pool" ].as< std::string

    >().c_str(), O_RDWR, &raw_pool, nullptr ) ); std::shared_ptr< mpool > pool( raw_pool, []( mpool *p ) { if( p ) mpool_close( p ); } ); std::vector< uint64_t > object_ids = params[ "object" ].as< std::vector< uint64_t > >(); mcacheͷAPI mblock͸ϖʔδΩϟογϡΛ࣋ͨͳ͍ Կ౓΋ಡΉσʔλΛϝϞϦʹஔ͍͓͖͍ͯͨ৔߹͸ mcacheͰϖʔδΩϟογϡΛ࡞Δ ͱΓ͋͑ͣmpool_openͰmpoolσόΠεΛ։͘
  66. uint64_t block_id = 0; SAFE_CALL( mpool_mblock_find_get( pool.get(), object_id, &block_id, &props

    ) ) return props; } ); { mpool_mcache_map *raw_map; SAFE_CALL( mpool_mcache_mmap( pool.get(), object_ids.size(), object_ids.data(), MPC_VMA_WARM, &raw_map ) ); std::shared_ptr< mpool_mcache_map > map( raw_map, [pool] ( mpool_mcache_map *p ) { if( p ) mpool_mcache_munmap( p ); } ); for( uint64_t cache_id = 0; cache_id != object_ids.size(); + +cache_id ) { SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0, props[ cache_id ].mpr_write_len, MADV_WILLNEED ) ) size_t offset = 0u; mcacheͷAPI mpool_mcache_mmapͰmcacheʹ৐͍ͤͨmblockΛ object idͰࢦఆ͢Δ ΩϟογϡΛ΍ΊΔͱ͖͸mpool_mcache_munmap
  67. } ); for( uint64_t cache_id = 0; cache_id != object_ids.size();

    + +cache_id ) { SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0, props[ cache_id ].mpr_write_len, MADV_WILLNEED ) ) size_t offset = 0u; void *page = nullptr; SAFE_CALL( mpool_mcache_getpages( map.get(), 1, cache_id, &offset, &page ) ); char *data = reinterpret_cast< char* >( page ); std::cout << "length: " << props[ cache_id ].mpr_write_len << std::endl; std::cout << "data: " << data << std::endl; } } mcacheͷAPI mpool_mcache_madviseͰ cache id൪໨ͷmblock͕ۙ͘ඞཁʹͳΔ͜ͱΛ௨஌ mpool_mcache_getpagesͰϖʔδΩϟογϡͷΞυϨεΛऔಘ
  68. if( p ) mpool_mcache_munmap( p ); } ); for( uint64_t

    cache_id = 0; cache_id != object_ids.size(); + +cache_id ) { SAFE_CALL( mpool_mcache_madvise( map.get(), cache_id, 0, props[ cache_id ].mpr_write_len, MADV_WILLNEED ) ) size_t offset = 0u; void *page = nullptr; SAFE_CALL( mpool_mcache_getpages( map.get(), 1, cache_id, &offset, &page ) ); char *data = reinterpret_cast< char* >( page ); std::cout << "length: " << props[ cache_id ].mpr_write_len << std::endl; std::cout << "data: " << data << std::endl; } } mcacheͷAPI ϙΠϯτ mcacheͷ࡞੒ͱഁغͷλΠϛϯά͸ ΞϓϦέʔγϣϯ͕ίϯτϩʔϧͰ͖ΔͨΊ ͜ͷΩϟογϡΛͦͷ·· ετϨʔδΤϯδϯͷΩϟογϡʹ࢖͑Δ
  69. switch (cmd) { case MPIOC_MP_CREATE: case MPIOC_MP_ACTIVATE: case MPIOC_MP_DESTROY: case

    MPIOC_MP_RENAME: err = mpioc_mp_cmd(unit, cmd, argp); break; case MPIOC_MP_DEACTIVATE: err = mpioc_mp_deactivate(unit, cmd, argp); break; case MPIOC_DRV_ADD: err = mpioc_mp_add(unit, cmd, argp); break; case MPIOC_PARAMS_SET: err = mpioc_params_set(unit, cmd, argp); break; case MPIOC_PARAMS_GET: err = mpioc_params_get(unit, cmd, argp); break; case MPIOC_MP_MCLASS_GET: err = mpioc_mp_mclass_get(unit, cmd, argp); break; case MPIOC_PROP_GET: err = mpioc_proplist_get(unit, cmd, argp); break; case MPIOC_DEVPROPS_GET: err = mpioc_devprops_get(unit, argp); break; case MPIOC_MB_ALLOC: mpool-kmod/src/mpctl.c static long mpc_ioctl(struct file *fp, unsigned int cmd, unsigned long arg) mdcΛআ͘mpoolͷૢ࡞͸ ͦͷ··ioctlʹϚοϓ͞Εͯ Χʔωϧۭؒͷؔ਺ͷ ݺͼग़͠ʹͳ͍ͬͯΔ
  70. HSE_SAFE_CALL( hse_kvdb_init() ); std::shared_ptr< void > context( nullptr, []( void*

    ) { hse_kvdb_fini(); } ); const std::string pool_name = params[ "pool" ].as< std::string >(); if( create_kvdb ) HSE_SAFE_CALL( hse_kvdb_make( pool_name.c_str(), nullptr ) ); hse_kvdb *raw_kvdb = nullptr; HSE_SAFE_CALL( hse_kvdb_open( pool_name.c_str(), nullptr, &raw_kvdb ) ); std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb *p ) { if( p ) hse_kvdb_close( p ); } ); const std::string kvs_name = params[ "kvs" ].as< std::string >(); if( create_kvs ) HSEͷAPI hse_kvdb_initͰHSEΛ࢖͏ͨΊͷ४උΛ͢Δ ย෇͚Δͱ͖͸hse_kvdb_fini
  71. std::shared_ptr< void > context( nullptr, []( void* ) { hse_kvdb_fini();

    } ); const std::string pool_name = params[ "pool" ].as< std::string >(); if( create_kvdb ) HSE_SAFE_CALL( hse_kvdb_make( pool_name.c_str(), nullptr ) ); hse_kvdb *raw_kvdb = nullptr; HSE_SAFE_CALL( hse_kvdb_open( pool_name.c_str(), nullptr, &raw_kvdb ) ); std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb *p ) { if( p ) hse_kvdb_close( p ); } ); const std::string kvs_name = params[ "kvs" ].as< std::string >(); if( create_kvs ) HSE_SAFE_CALL( hse_kvdb_kvs_make( kvdb.get(), kvs_name.c_str(), nullptr ) ); hse_kvs *raw_kvs; HSE_SAFE_CALL( hse_kvdb_kvs_open( kvdb.get(), kvs_name.c_str(), nullptr, &raw_kvs ) ); HSEͷAPI hse_kvdb_makeͰࢦఆͨ͠mpoolʹkvdbΛ࡞Δ hse_kvdb_openͰkvdbΛ։͘ mpool kvdb kvs kvs Ωʔ σʔλ Ωʔ σʔλ kvs Ωʔ σʔλ kvdbͷதʹෳ਺ͷkvs(ςʔϒϧ)Λ࡞Δ͜ͱ͕Ͱ͖Δ ͜Ε
  72. std::shared_ptr< hse_kvdb > kvdb( raw_kvdb, [context]( hse_kvdb *p ) {

    if( p ) hse_kvdb_close( p ); } ); const std::string kvs_name = params[ "kvs" ].as< std::string >(); if( create_kvs ) HSE_SAFE_CALL( hse_kvdb_kvs_make( kvdb.get(), kvs_name.c_str(), nullptr ) ); hse_kvs *raw_kvs; HSE_SAFE_CALL( hse_kvdb_kvs_open( kvdb.get(), kvs_name.c_str(), nullptr, &raw_kvs ) ); std::shared_ptr< hse_kvs > kvs( raw_kvs, [kvdb]( hse_kvs *p ) { if( p ) hse_kvdb_kvs_close( p ); } ); hse_kvdb_opspec os; HSE_KVDB_OPSPEC_INIT( &os ); std::shared_ptr< hse_kvdb_txn > transaction( hse_kvdb_txn_alloc( kvdb.get() ), [kvdb]( hse_kvdb_txn *p ) { if( p ) hse_kvdb_txn_free( kvdb.get(), p ); } ); os.kop_txn = transaction.get(); HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) ); HSEͷAPI hse_kvdb_kvs_makeͰࢦఆͨ͠kvdbʹkvsΛ࡞Δ hse_kvdb_kvs_openͰkvsΛ։͘ mpool kvdb kvs Ωʔ σʔλ Ωʔ σʔλ ͜Ε
  73. std::shared_ptr< hse_kvs > kvs( raw_kvs, [kvdb]( hse_kvs *p ) {

    if( p ) hse_kvdb_kvs_close( p ); } ); hse_kvdb_opspec os; HSE_KVDB_OPSPEC_INIT( &os ); std::shared_ptr< hse_kvdb_txn > transaction( hse_kvdb_txn_alloc( kvdb.get() ), [kvdb]( hse_kvdb_txn *p ) { if( p ) hse_kvdb_txn_free( kvdb.get(), p ); } ); os.kop_txn = transaction.get(); HSE_SAFE_CALL( hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) ); for( const auto &v: put_value ) { HSE_SAFE_CALL( hse_kvs_put( kvs.get(), &os, v.first.data(), v.first.size(), v.second.data(), v.second.size() ) ); } for( const auto &v: get_value ) { std::array< char, 100 > data{ 0 }; bool found = false; size_t length = 0; HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(), v.size(), &found, data.data(), data.size(), &length ) ); HSEͷAPI hse_kvdb_txn_allocͰ৽͍͠τϥϯβΫγϣϯΛ࡞Δ root root(1) ͜Ε ϩά ࣺͯΔͱ͖͸hse_kvdb_txn_free
  74. hse_kvdb_txn_free( kvdb.get(), p ); } ); os.kop_txn = transaction.get(); HSE_SAFE_CALL(

    hse_kvdb_txn_begin( kvdb.get(), os.kop_txn ) ); for( const auto &v: put_value ) { HSE_SAFE_CALL( hse_kvs_put( kvs.get(), &os, v.first.data(), v.first.size(), v.second.data(), v.second.size() ) ); } for( const auto &v: get_value ) { std::array< char, 100 > data{ 0 }; bool found = false; size_t length = 0; HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(), v.size(), &found, data.data(), data.size(), &length ) ); if( found ) std::cout << v << "=" << data.data() << std::endl; } if( abort_transaction ) { HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) ); HSEͷAPI hse_kvdb_txn_beginͰτϥϯβΫγϣϯΛ։࢝ root hse_kvs_putͰΩʔͱ஋ͷϖΞΛॻ͘ root(1) Ωʔ σʔλ Ωʔ σʔλ ͜Ε ϩά
  75. v.first.size(), v.second.data(), v.second.size() ) ); } for( const auto &v:

    get_value ) { std::array< char, 100 > data{ 0 }; bool found = false; size_t length = 0; HSE_SAFE_CALL( hse_kvs_get( kvs.get(), &os, v.data(), v.size(), &found, data.data(), data.size(), &length ) ); if( found ) std::cout << v << "=" << data.data() << std::endl; } if( abort_transaction ) { HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) ); } else { HSE_SAFE_CALL( hse_kvdb_txn_commit( kvdb.get(), os.kop_txn ) ); } HSEͷAPI hse_kvs_getͰΩʔʹରԠ͢Δ஋Λऔಘ root(1) Ωʔ σʔλ Ωʔ σʔλ root ϩά
  76. v.size(), &found, data.data(), data.size(), &length ) ); if( found )

    std::cout << v << "=" << data.data() << std::endl; } if( abort_transaction ) { HSE_SAFE_CALL( hse_kvdb_txn_abort( kvdb.get(), os.kop_txn ) ); } else { HSE_SAFE_CALL( hse_kvdb_txn_commit( kvdb.get(), os.kop_txn ) ); } HSEͷAPI hse_kvdb_txn_commitͰॻ͖ࠐΈΛ֬ఆ hse_kvdb_txn_abortͰ͜͜·Ͱͷॻ͖ࠐΈΛऔΓফ͠ root(1) Ωʔ σʔλ Ωʔ σʔλ root ஋Λૠೖ ϩά ஋Λૠೖ ͜Ε
  77. Heterogeneous-Memory Storage Engine HSE͸ෳ਺ͷҟͳΔετϨʔδσόΠεΛ ڞ௨ͷΠϯλʔϑΣʔεͰαϙʔτ͢Δ͜ͱΛ໨ࢦ͍ͯ͠Δ 1. ݹయతͳSSD 2. Zoned NamespaceΛ࣋ͭNVMe

    SSD 3. ෆشൃϝϞϦσόΠε όʔδϣϯ1.7ͷ࣌఺Ͱར༻Մೳ ະ࣮૷ ະ࣮૷ ෆشൃϝϞϦσόΠεʹ͍ͭͯ͸ ҎલͷΧʔωϧ/VM޲͚ʹ༻ҙͨ͠ղઆ͕͋ΔͷͰ ͦͪΒΛ͝ཡ͍ͩ͘͞ https://speakerdeck.com/fadis/dian-yuan-woqie-tutemoxiao-enaimemoritofalsefu-kihe-ifang
  78. Zoned Namespace ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5

    6 7 ϒϩοΫ2 ۭ ۭ ۭ ۭ ม׵ද 2->8 SSDͷ༰ྔ͕େ͖͘ͳΔͱม׵ද΋େ͖͘ͳΔ ͜Ε ͜ͷม׵දͷͨΊʹSSDͷ༰ྔͷ ఔ౓ͷRAM͕ඞཁ 1 1000 େ༰ྔͷSSDͷίϯτϩʔϥʹ͸ େ༰ྔͷRAMΛඋ͑Δඞཁ͕͋Δ ͭΒ͍
  79. Zoned Namespace ϒϩοΫ1 ϒϩοΫ0 0 1 2 3 4 5

    6 7 ϒϩοΫ2 ۭ ۭ ۭ ͜ͷαΠζ୯ҐͰΞυϨεΛม׵͢Δͱ ม׵ද͕େ͖͘ͳΓ͗͢Δ த్൒୺ʹTRIM͞ΕͨϒϩοΫ͕ੜ͡Δ ͜ͷαΠζ୯ҐͰ Ͳ͜ʹׂΓ౰͔ͯͨͱઌ಄͔ΒͲ͜·Ͱ࢖͔͚ͬͨͩΛ ͓֮͑ͯ͜͏ TRIM͸ৗʹϒϩοΫؙ͝ͱ ۭ
  80. Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger

    InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ dm-zoned ϖʔδαΠζ,J# ϖʔδαΠζ.J# dm-zoned Linuxͷ Zoned Namespace΁ͷ ରԠ 4KiBϖʔδ͕ ͋Δ͔ͷΑ͏ʹݟͤΔ
  81. Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger

    InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ ߏ଄Խϩά͕૿͑ͨ dm-zoned ϩά ϩά ϩά ϩά
  82. Ϣʔβۭؒ Χʔωϧۭؒ VFS ϑΝΠϧγεςϜ σόΠευϥΠό ϖʔδΩϟογϡ bio MySQL MongoDB WiredTiger

    InnoDB Χʔωϧۭؒ ϋʔυ΢ΣΞ Flash Translation Layer NANDϑϥογϡϝϞϦ HSEͷૂ͍ dm-zoned HSE mpool ϖʔδαΠζ,J# ϖʔδαΠζ.J# ϖʔδαΠζ.J#