Upgrade to Pro — share decks privately, control downloads, hide ads and more …

電源を切っても消えないメモリとの付き合い方

Fadis
October 19, 2019

 電源を切っても消えないメモリとの付き合い方

メモリのように書けて永続化される次世代ストレージデバイスNVDIMMの扱い方を解説します
これは2019年10月19日に行われる予定だった カーネル/VM探検隊@北陸 5回目(台風の影響で中止) での発表資料です
サンプルコード: https://github.com/Fadis/kernelvm_20191019_samples

Fadis

October 19, 2019
Tweet

More Decks by Fadis

Other Decks in Programming

Transcript

  1. ిݯΛ੾ͬͯ΋ফ͑ͳ͍ϝϞϦͱͷ෇͖߹͍ํ
    NAOMASA MATSUBAYASHI
    https://github.com/Fadis/kernelvm_20191019_samples
    αϯϓϧίʔυ

    View Slide

  2. Ϩδελ
    L1 cache
    L2 cache
    L3 cache
    DRAM
    SSD
    ϋʔυσΟεΫ
    ߴ଎Ͱ
    ༰ྔ୯Ձ͕ߴ͍
    ௿଎Ͱ
    ༰ྔ୯Ձ͕͍҆

    View Slide

  3. 1013
    100 101 102 103 104 105 106 107 108 109 1010 1011 1012
    10−10
    10−9
    10−8
    10−7
    10−6
    10−5
    10−4
    10−3
    10−2
    ༰ྔ[bytes]
    ॻ͖ࠐΈͷϨΠςϯγ[ඵ]
    Ϩδελ
    L1 cache
    L2 cache
    10−1
    L3 cache
    DRAM
    SSD
    ϋʔυσΟεΫ
    ӬଓԽ͞Εͳ͍
    ӬଓԽ͞ΕΔ

    View Slide

  4. 1013
    100 101 102 103 104 105 106 107 108 109 1010 1011 1012
    10−10
    10−9
    10−8
    10−7
    10−6
    10−5
    10−4
    10−3
    10−2
    ༰ྔ[bytes]
    ॻ͖ࠐΈͷϨΠςϯγ[ඵ]
    Ϩδελ
    L1 cache
    L2 cache
    10−1
    L3 cache
    DRAM
    SSD
    ϋʔυσΟεΫ
    ӬଓԽ=͕͔͔࣌ؒ͘͢͝Δ

    View Slide

  5. ϑϥογϡϝϞϦͷ࢓૊Έ
    τϯωϧࢎԽບ
    N N
    1
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    ిՙ͸௨աͰ͖ͳ͍
    ຊؾΛग़͢ͱిՙ͕௨ա͢Δ
    ిքޮՌτϥϯδελʹ
    ϑϩʔςΟϯάήʔτ͕ڬ·ͬͨΑ͏ͳߏ଄

    View Slide

  6. 20V
    τϯωϧࢎԽບ
    N N
    1
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    0V
    ੍ޚήʔτʹߴిѹΛ͔͚
    ଞΛGNDʹ͢Δͱ
    ిՙ͕τϯωϧࢎԽບΛಥ͖ൈ͚
    ϑϩʔςΟϯάήʔτʹஷ·Δ
    0V 0V

    View Slide

  7. 5V
    τϯωϧࢎԽບ
    N N
    1
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    ϑϩʔςΟϯάήʔτʹ
    ిՙ͕ͨ·͍ͬͯΔͱ
    ੍ޚήʔτʹগʑిѹΛ͔͚ͯ΋
    P૚෇ۙͷిࢠ͕ރׇ͠ͳ͍ҝ
    νϟωϧ͕ܗ੒͞Εͳ͍
    ͜ͷঢ়ଶͰ
    N-P-NͰిྲྀ͕ྲྀΕΔͷʹ
    ඞཁͳ੍ޚήʔτͷిѹΛ
    ͱ͢Δ
    Vh

    View Slide

  8. 0V
    τϯωϧࢎԽບ
    N N
    1
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    ੍ޚήʔτΛGNDʹͯ͠
    ͦΕҎ֎ʹߴిѹΛ͔͚Δͱ
    ϑϩʔςΟϯάήʔτ͔Β
    ిՙ͕ൈ͚Δ
    20V
    20V 20V

    View Slide

  9. 5V
    τϯωϧࢎԽບ
    N N
    1
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    ϑϩʔςΟϯάήʔτʹ
    ిՙ͕ͨ·͍ͬͯͳ͍ͱ
    ੍ޚήʔτʹిѹΛ͔͚ͨ࣌ʹ
    P૚෇ۙͷిࢠ͕ރׇ͢Δҝ
    P૚ͷిࢠ͕ҾͬுΒΕͯ
    νϟωϧ͕ܗ੒͞ΕΔ
    ͜ͷঢ়ଶͰ
    N-P-NͰిྲྀ͕ྲྀΕΔͷʹ
    ඞཁͳ੍ޚήʔτͷిѹΛ
    ͱ͢Δ
    Vl

    View Slide

  10. ௚ྻʹ୔ࢁܨ͙
    ݸผʹ઀ଓ͢ΔΑΓूੵ౓Λ্͛Δࣄ͕Ͱ͖Δ
    ௚ྻʹܨ͕Εͨૉࢠ͸ݸผʹॻ͖׵͕͑Ͱ͖ͳ͍
    ར఺
    ܽ఺

    View Slide

  11. ಡΈ͍ͨηϧʹ ɺͦΕҎ֎ʹ Λ͔͚Δͱ
    ಡΈ͍ͨηϧͷ஋͕఍߅஋ͰಡΊΔ
    Vl
    Vh
    Vh
    Vh
    Vh
    Vh
    Vl
    ໰౴ແ༻Ͱ
    ྲྀΕΔ
    ঢ়ଶʹΑͬͯ͸
    ྲྀΕΔ
    ໰౴ແ༻Ͱ
    ྲྀΕΔ
    ໰౴ແ༻Ͱ
    ྲྀΕΔ
    ໰౴ແ༻Ͱ
    ྲྀΕΔ
    1 2 3 4 5 3൪ͷ஋͕
    ಡΊΔ
    ௚ྻʹ୔ࢁܨ͙

    View Slide

  12. 20V
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    ॻ͖ࠐΈ༻ͷߴిѹΛ࡞Δ
    νϟʔδϙϯϓ͸
    ݪཧ্ߴ଎ͳԠ౴͕Ͱ͖ͳ͍
    ʹ௿ϨΠςϯγΛٻΊΔͷ͸ແཧ͕͋Δ
    V V 0 2V
    ΫϩοΫͰ
    ੾Γସ͑
    ͜ΕΛඞཁͳిѹʹͳΔ·Ͱ܁Γฦ͢
    ͨΊΔ
    ͩ͢

    View Slide

  13. 20V
    ϑϩʔςΟϯάήʔτ
    ઈԑບ
    ੍ޚήʔτ
    ॻ͖ࠐΈ༻ͷߴిѹΛ࡞Δ
    νϟʔδϙϯϓ͸
    ݪཧ্ߴ଎ͳԠ౴͕Ͱ͖ͳ͍
    ௚ྻʹͳͬͨηϧͷ
    Ұ෦͚ͩΛॻ͖׵͍͑ͨ৔߹
    શͯͷηϧͷ஋ΛಡΈग़ͯ͠
    ॻ͖௚͢ඞཁ͕͋Δ
    ʹ௿ϨΠςϯγΛٻΊΔͷ͸ແཧ͕͋Δ

    View Slide

  14. ಉظI/O ඇಉظI/O
    Χʔωϧ/VM୳ݕୂ@ؔ੢ 9ճ໨ ۃΊͯ଎͍ετϨʔδͱͷ෇͖߹͍ํ ΑΓ
    ࠓ೔ͷSSD͸ ͷϨΠςϯγΛ
    େྔͷॻ͖ࠐΈΛಉ࣌ʹߦ͏͜ͱͰΧόʔ͍ͯ͠Δҝ
    ॻ͖ࠐΉ΋ͷ͕େྔʹͳ͍ͱੑೳ͕ग़ͳ͍

    View Slide

  15. 1013
    100 101 102 103 104 105 106 107 108 109 1010 1011 1012
    10−10
    10−9
    10−8
    10−7
    10−6
    10−5
    10−4
    10−3
    10−2
    ༰ྔ[bytes]
    ॻ͖ࠐΈͷϨΠςϯγ[ඵ]
    Ϩδελ
    L1 cache
    L2 cache
    10−1
    L3 cache
    DRAM
    SSD
    ϋʔυσΟεΫ
    ?
    ͜ͷลΓʹ
    ӬଓԽ͞ΕΔϠπ͕ཉ͍͠
    ϑϥογϡϝϞϦʹ୅ΘΔ
    ෆشൃϝϞϦͷݚڀ͕
    ଟํ໘ͰߦΘΕ͍ͯͨ

    View Slide

  16. 1013
    100 101 102 103 104 105 106 107 108 109 1010 1011 1012
    10−10
    10−9
    10−8
    10−7
    10−6
    10−5
    10−4
    10−3
    10−2
    ༰ྔ[bytes]
    ॻ͖ࠐΈͷϨΠςϯγ[ඵ]
    Ϩδελ
    L1 cache
    L2 cache
    10−1
    L3 cache
    DRAM
    SSD
    ϋʔυσΟεΫ
    ͜ͷลΓʹ
    ӬଓԽ͞ΕΔϠπ͕ཉ͍͠
    NVDIMM
    IntelɺϑϥογϡϝϞϦʹ୅ΘΔෆشൃϝϞϦΛ࠾༻ͨ͠
    Optane DC Persistent MemoryΛ੡඼Խ

    View Slide

  17. NVMe SSD Intel Optane DC DRAM
    300µsఔ౓ 500nsఔ౓ 50nsఔ౓
    ӬଓԽ͞ΕΔ ӬଓԽ͞ΕΔ ӬଓԽ͞Εͳ͍
    128GBͰ
    6000ԁ͘Β͍
    128GBͰ
    5ສԁ͘Β͍
    128GBͰ
    40ສԁ͘Β͍
    ϖʔδ୯ҐͰ͔͠
    ॻ͚ͳ͍
    ΩϟογϡϥΠϯ୯ҐͰ
    ॻ͚Δ
    ΩϟογϡϥΠϯ୯ҐͰ
    ॻ͚Δ
    ϨΠςϯγ
    ӬଓԽ
    ༰ྔ୯Ձ
    ॻ͖ࠐΈ
    ୯Ґ

    View Slide

  18. Ge1
    Sb2
    Te4
    SeAsGeSi
    https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/Proceedings_Chrono_2017.html
    Intel͸Optane DCʹ༻͍ͨෆشൃϝϞϦ3D XpointͷৄࡉΛެ։͍ͯ͠ͳ͍͕
    ൒ಋମͷ෼ੳΛઐ໳ͱ͢ΔاۀʹΑΔௐࠪ݁Ռ͕ൃද͞Ε͍ͯΔ
    3D XPoint: Current Implementations and Future Trends

    View Slide

  19. Ge1
    Sb2
    Te4
    SeAsGeSi
    ΦϘχοΫᮢ஋εΠον
    ిѹ͕ҰఆҎԼͷ৔߹͚ͩߴ͍఍߅஋Λࣔ͢෺࣭
    ࿙ΕిྲྀͰҙਤ͠ͳ͍ηϧ͕Ԡ౴͢ΔͷΛ๷͙
    https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/Proceedings_Chrono_2017.html
    Intel͸Optane DCʹ༻͍ͨෆشൃϝϞϦ3D XpointͷৄࡉΛެ։͍ͯ͠ͳ͍͕
    ൒ಋମͷ෼ੳΛઐ໳ͱ͢ΔاۀʹΑΔௐࠪ݁Ռ͕ൃද͞Ε͍ͯΔ
    3D XPoint: Current Implementations and Future Trends

    View Slide

  20. Ge1
    Sb2
    Te4
    SeAsGeSi
    ΦϘχοΫᮢ஋εΠον
    ిѹ͕ҰఆҎԼͷ৔߹͚ͩߴ͍఍߅஋Λࣔ͢෺࣭
    ࿙ΕిྲྀͰҙਤ͠ͳ͍ηϧ͕Ԡ౴͢ΔͷΛ๷͙
    https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/Proceedings_Chrono_2017.html
    Intel͸Optane DCʹ༻͍ͨෆشൃϝϞϦ3D XpointͷৄࡉΛެ։͍ͯ͠ͳ͍͕
    ൒ಋମͷ෼ੳΛઐ໳ͱ͢ΔاۀʹΑΔௐࠪ݁Ռ͕ൃද͞Ε͍ͯΔ
    3D XPoint: Current Implementations and Future Trends
    ͓ͦΒ͘ ͱ Λੵ૚ͨ͠
    ௒֨ࢠܕ૬มԽϝϞϦ
    GeTe Sb2
    Te3
    ͔͚ͨిѹʹΑͬͯ
    • มԽ͠ͳ͍(ࠓͷঢ়ଶ͕ಡΈग़ͤΔ)
    • ΞϞϧϑΝε(఍߅େ)ʹมԽ͢Δ
    • ݁থ(఍߅খ)ʹมԽ͢Δ
    ͷ3௨ΓͷৼΔ෣͍Λ͢Δ෺࣭

    View Slide

  21. Ge1
    Sb2
    Te4
    SeAsGeSi
    ϙΠϯτ
    2ઢͰશͯͷૢ࡞Λߦ͏ҝ
    ϑϥογϡϝϞϦͷΑ͏ʹݸผͷॻ͖׵͑Λ
    ٘ਜ਼ʹ͠ͳͯ͘΋ߴີ౓Խ͕Մೳ
    1VҎԼͷ௿ిѹͰॻ͖ࠐΈ͕Մೳͳҝ
    ߴిѹΛಘΔҝͷ͕͔͔࣌ؒΔ࢓૊Έ͕ෆཁ
    ͜ͷ݁Ռ
    ฒͷେ༰ྔͱ
    DRAMʹഭΔ௿ϨΠςϯγͱ
    ӬଓԽ͕શͯୡ੒͞ΕΔ

    View Slide

  22. ໰୊
    OS͸͜ͷσόΠεΛͲͷΑ͏ʹϢʔβۭؒʹݟͤΔ΂͖͔
    ϝ
    Ϟ
    Ϧ?
    ϒϩοΫσόΠε?

    View Slide

  23. DRAMͱҰॹʹ
    DIMMιέοτʹऔΓ෇͚Δ
    σόΠεͰ͸͋Δ͕
    σʔλ͸ӬଓԽ͞ΕΔҝ
    طଘͷΞϓϦέʔγϣϯ͸ͦ͜ʹϑΝΠϧΛஔ͖͍ͨ
    $ ls /dev/pmem0 -lha
    brw-rw---- 1 root disk 259, 1 10݄ 2 03:15 /dev/pmem0
    LinuxͰ͸NVDIMM͕͍ͬͯ͞͞Δͱ
    ͱΓ͋͑ͣϒϩοΫσόΠε͕ੜ͑ͯ͘Δ

    View Slide

  24. $ mkfs.xfs /dev/pmem0
    meta-data=/dev/pmem0 isize=512 agcount=4, agsize=128896 blks
    = sectsz=4096 attr=2, projid32bit=1
    = crc=1 finobt=1, sparse=1, rmapbt=0
    = reflink=0
    data = bsize=4096 blocks=515584, imaxpct=25
    = sunit=0 swidth=0 blks
    naming =version 2 bsize=4096 ascii-ci=0, ftype=1
    log =internal log bsize=4096 blocks=2560, version=2
    = sectsz=4096 sunit=1 blks, lazy-count=1
    realtime =none extsz=4096 blocks=0, rtextents=0
    $ mount -t xfs /dev/pmem0 /mnt/pmem/
    $ dmesg
    (ུ)
    [1506131.089817] XFS (pmem0): Mounting V5 Filesystem
    [1506131.094488] XFS (pmem0): Ending clean mount
    $ cd /mnt/pmem/
    $ echo 'Hello, World' >hoge
    $ ls
    hoge
    $ mount|grep pmem0
    /dev/pmem0 on /mnt/pmem type xfs (rw,relatime,attr2,inode64,noquota)
    ϑΝΠϧγεςϜΛ࡞ͬͯ
    Ϛ΢ϯτͯ͠ಡΈॻ͖

    View Slide

  25. Ϣʔβۭؒ
    Χʔωϧۭؒ
    ॻ͖ࠐΈΛཁٻ͢Δ
    ॻ͖ࠐΉϖʔδ͕͋Δఔ౓ͷྔʹͳΔ·ͰஷΊΔ
    ϑΝΠϧγεςϜͷҧ͍Λந৅Խ͢Δ
    ετϨʔδͷͲ͜ʹॻ͖ࠐΉ͔Λܾఆ͢Δ
    ཁٻΛޮ཰Α͘ॻ͖ࠐΊΔॱ൪ʹฒ΂׵͑Δ
    ࣮ࡍͷσόΠεʹॻ͖ࠐΈΛߦ͏
    ΞϓϦέʔγϣϯ
    VFS
    ϑΝΠϧγεςϜ
    IOεέδϡʔϥ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio ϋʔυ΢ΣΞͷҧ͍Λந৅Խ͢Δ
    Linux্Ͱ
    ϑΝΠϧͷॻ͖ࠐΈΛཁٻ͔ͯ͠Β
    ϋʔυσΟεΫʹॻ͖ࠐ·ΕΔ·Ͱ

    View Slide

  26. Ϣʔβۭؒ
    Χʔωϧۭؒ
    ΞϓϦέʔγϣϯ
    VFS
    ϑΝΠϧγεςϜ
    IOεέδϡʔϥ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio ॻ͖ࠐΈॱং͕
    ॻ͖ࠐΈ଎౓ʹӨڹΛ༩͑ͳ͍ͷͰ
    εέδϡʔϦϯά͸ཁΒͳ͍
    ͜Ε͸ NVMeͰ΋
    লུ͞Ε͍ͯͨ

    View Slide

  27. ඞཁͳσʔλΛ
    Ұ࣌తʹίϐʔ
    ίϐʔ͞Εͨ
    σʔλΛಡΉ
    DRAM্ͷσʔλΛ
    ॻ͖׵͑Δ
    ॻ͖׵Θͬͨ಺༰Λ
    σΟεΫʹಉظ͢Δ
    ϖʔδΩϟογϡ
    CPU͸ϋʔυσΟεΫͷ಺༰Λ
    ௚઀ಡΈॻ͖͸Ͱ͖ͳ͍
    σΟεΫͷ಺༰ͷ
    Ұ෦ͷίϐʔ
    ӬଓԽ͞Εͨ
    σʔλ

    View Slide

  28. ϖʔδΩϟογϡΛ
    Ϣʔβۭؒϓϩηεͷ
    Ծ૝ΞυϨεۭؒʹϚοϓͯ͠
    ಡΈॻ͖Ͱ͖ΔΑ͏ʹ͢Δ
    mmap
    σΟεΫͷ಺༰ͷ
    Ұ෦ͷίϐʔ
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    ӬଓԽ͞Εͨ
    σʔλ
    void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

    View Slide

  29. const auto fd = open( filename.c_str(), new_file ? O_RDWR|O_CREAT : O_RDWR, 0644 );
    if( fd < 0 ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    if( new_file ) ftruncate( fd, file_size );
    const auto raw = mmap(
    nullptr, file_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0
    );
    if( raw == nullptr ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< char, unmap > mapped(
    reinterpret_cast< char* >v( raw ), unmap( mapped_length )
    );
    if( vm.count( "write" ) ) {
    std::copy( new_value.begin(), new_value.end(), mapped.get() );
    mapped.get()[ new_value_size ] = '\0';
    msync( mapped.get(), file_size, MS_SYNC );
    }
    else std::cout << mapped.get() << std::endl;
    }
    ϑΝΠϧΛ։͍ͯ
    mmapͯ͠ಘͨΞυϨεʹ
    msync
    ஋Λॻ͖ࠐΜͰ

    View Slide

  30. [ 0.000000] reserve setup_data: [mem 0x0000000000058000-0x0000000000058fff] reserved
    [ 0.000000] reserve setup_data: [mem 0x0000000000059000-0x000000000009efff] usable
    [ 0.000000] reserve setup_data: [mem 0x000000000009f000-0x000000000009ffff] reserved
    [ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000009c4d6017] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009c4d6018-0x000000009c4e6c57] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009c4e6c58-0x000000009c4e7017] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009c4e7018-0x000000009c4f7057] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009c4f7058-0x000000009c4f8017] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009c4f8018-0x000000009c518057] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009c518058-0x000000009dc65fff] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009dc66000-0x000000009dc92fff] ACPI data
    [ 0.000000] reserve setup_data: [mem 0x000000009dc93000-0x000000009f7f7fff] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009f7f8000-0x000000009f7f8fff] ACPI NVS
    [ 0.000000] reserve setup_data: [mem 0x000000009f7f9000-0x000000009f822fff] reserved
    [ 0.000000] reserve setup_data: [mem 0x000000009f823000-0x000000009f8c7fff] usable
    [ 0.000000] reserve setup_data: [mem 0x000000009f8c8000-0x00000000a03d8fff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000a03d9000-0x00000000a5952fff] usable
    [ 0.000000] reserve setup_data: [mem 0x00000000a5953000-0x00000000a705afff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000a705b000-0x00000000a707cfff] ACPI data
    [ 0.000000] reserve setup_data: [mem 0x00000000a707d000-0x00000000a7236fff] usable
    [ 0.000000] reserve setup_data: [mem 0x00000000a7237000-0x00000000a786ffff] ACPI NVS
    [ 0.000000] reserve setup_data: [mem 0x00000000a7870000-0x00000000a7ffefff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000a7fff000-0x00000000a7ffffff] usable
    [ 0.000000] reserve setup_data: [mem 0x00000000a8000000-0x00000000a80fffff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000fe000000-0x00000000fe010fff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
    [ 0.000000] reserve setup_data: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
    [ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000037fffffff] usable
    [ 0.000000] reserve setup_data: [mem 0x0000000380000000-0x00000003ffffffff] persistent (type 12)
    [ 0.000000] reserve setup_data: [mem 0x0000000400000000-0x000000044dffffff] usable
    ىಈ࣌ͷ
    ΧʔωϧϩάͷҰ෦
    /7%*..͸શྖҬ͕෺ཧΞυϨεۭؒʹస͕͍ͬͯΔ
    [mem 0x0000000380000000-0x00000003ffffffff] persistent (type 12)

    View Slide

  31. CPU͸NVDIMMͷ಺༰Λ
    ௚઀ಡΈॻ͖Ͱ͖Δ
    NVDIMMͷϨΠςϯγ͕
    DRAMͷϨΠςϯγʹ͍ۙ৔߹
    NVDIMM্ͷσʔλΛ
    ϖʔδΩϟογϡʹ
    ίϐʔ͢Δͷ͸
    ແବ
    σΟεΫͷ಺༰ͷ
    Ұ෦ͷίϐʔ
    ӬଓԽ͞Εͨ
    σʔλ

    View Slide

  32. mmap࣌ʹ
    ϑΝΠϧ͕ஔ͔Εͨ෺ཧΞυϨεΛ
    ௚઀ϓϩηεͷԾ૝ΞυϨεۭؒʹ
    Ϛοϓ͢Δ
    Filesystem DAX
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    ӬଓԽ͞Εͨ
    σʔλ

    View Slide

  33. $ mount -t xfs -o dax /dev/pmem0 /mnt/pmem/
    $ dmesg
    (ུ)
    [1686537.353077] XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL,
    use at your own risk
    [1686537.356044] XFS (pmem0): Mounting V5 Filesystem
    [1686537.361297] XFS (pmem0): Ending clean mount
    $ cd /mnt/pmem/
    $ ls
    hoge
    $ mount|grep pmem0
    /dev/pmem0 on /mnt/pmem type xfs
    (rw,relatime,attr2,dax,inode64,noquota)
    Filesystem DAXʹରԠͨ͠ϑΝΠϧγεςϜͰ
    Ϛ΢ϯτ࣌ʹ-o daxΛ෇͚Δ
    Filesystem DAXΛ༗ޮʹ͢Δ

    View Slide

  34. Ϣʔβۭؒ
    Χʔωϧۭؒ
    ΞϓϦέʔγϣϯ
    VFS
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    ϑΝΠϧγεςϜ
    mmapͨ͠ྖҬͷಡΈॻ͖͸
    ΧʔωϧͷϒϩοΫϨΠϠʔΛᷖճͯ͠
    ௚઀σόΠεʹରͯ͠ߦΘΕΔ

    View Slide

  35. [mem 0x0000000380000000-0x00000003ffffffff] persistent (type 12)
    NVDIMMͷ෺ཧΞυϨε
    $ ./00_get_physical_address -p `pidof 00_mmap` -f /mnt/pmem/fuga
    /mnt/pmem/fuga: VirtualAddress=0x7f9e0086b000 PhysicalAddress=0x41d1d4000
    -o daxΛ෇͚͍ͯͳ͍৔߹
    -o daxΛ෇͚ͨ৔߹
    $ ./00_get_physical_address -p `pidof 00_mmap` -f /mnt/pmem/fuga
    /mnt/pmem/fuga: VirtualAddress=0x7fae9df25000 PhysicalAddress=0x38220d000
    mmapͷฦΓ஋ͷԾ૝ΞυϨεʹରԠ͢Δ෺ཧΞυϨε͸
    NVDIMMͷઌ಄͔Β35,704,832όΠτͷҐஔΛࢦ͍ͯ͠Δ
    mmapͷฦΓ஋ͷԾ૝ΞυϨεʹରԠ͢Δ෺ཧΞυϨε͸
    NVDIMMҎ֎ͷͲ͔͜Λࢦ͍ͯ͠Δ

    View Slide

  36. NNBQ͢ΔطଘͷΞϓϦέʔγϣϯ͕
    มߋͳ͠ͰΧʔωϧΛᷖճͯ͠ߴ଎
    ΍ͬͨʔ
    ʜͱ͸͍͔ͳ͍

    View Slide

  37. const auto fd = open( filename.c_str(), new_file ? O_RDWR|O_CREAT : O_RDWR, 0644 );
    if( fd < 0 ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    if( new_file ) ftruncate( fd, file_size );
    const auto raw = mmap(
    nullptr, file_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0
    );
    if( raw == nullptr ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< char, unmap > mapped(
    reinterpret_cast< char* >v( raw ), unmap( mapped_length )
    );
    if( vm.count( "write" ) ) {
    std::copy( new_value.begin(), new_value.end(), mapped.get() );
    mapped.get()[ new_value_size ] = '\0';
    msync( mapped.get(), file_size, MS_SYNC );
    }
    else std::cout << mapped.get() << std::endl;
    }
    ͍ͭ͜͸ԿΛ͍ͯ͠Δͷ͔

    View Slide

  38. NTZOD
    #include
    int msync(void *addr, size_t length, int flags);
    mmap͞ΕͨྖҬͷ͏ͪɺมߋ͕Ճ͑ΒΕͨ෦෼ΛϑΝΠϧγεςϜʹ൓ө͢Δ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ΞϓϦέʔγϣϯ
    VFS
    ϑΝΠϧγεςϜ
    IOεέδϡʔϥ
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    ͜͜Ͱࢭ·͍ͬͯΔσʔλΛ
    ετϨʔδ·Ͱ൓өͤ͞Δ
    Filesystem DAXͰ͸
    ϖʔδΩϟγϡΛᷖճͯ͠
    ௚઀σόΠεʹॻ͍͍ͯΔͷ͔ͩΒ
    msync͸ཁΒͳ͍ͷͰ͸?

    View Slide

  39. Ϩδελ
    L1 cache
    L2 cache
    L3 cache
    CPUͱNVDIMMͷؒʹ͸ΩϟογϡϝϞϦ͕͋Δ
    $16͔Β͸ॻ͚ͨΑ͏ʹݟ͍͑ͯͯ΋
    ͜ͷลͰࢭ·͍ͬͯΔ͔΋͠Εͳ͍
    ͜ͷঢ়ଶͰిݯ͕མͪΔͱ
    ॻ͍ͨഺͷ಺༰͸ࣦΘΕΔ

    View Slide

  40. CLFLUSH—Flush Cache Line
    Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that
    contains the linear address specified with the memory operand. If that cache line contains modified
    data at any level of the cache hierarchy, that data is written back to memory. The source operand is a
    byte memory location.
    — Intel® 64 and IA-32 Architectures Software Developer’s ManualΑΓ
    શͯͷΩϟογϡ͔Βࢦఆ͞ΕͨΞυϨεΛؚΉΩϟογϡϥΠϯΛ࡟আ͢Δ
    ͦͷΩϟογϡϥΠϯ͕มߋ͞ΕͨσʔλΛؚΜͰ͍Δ৔߹ϝϞϦʹॻ͘
    ׬ྃ͢Δ·Ͱϓϩηοα͸଴ػ͢Δ
    ΍Ίͯ!
    ΍Ίͯ!

    View Slide

  41. ॻ͍ͯ
    ॻ͍ͯ
    ॻ͚ͨ
    ॻ͚ͨ
    ॻ͍ͯ
    ॻ͍ͯ
    ॻ͚ͨ
    ॻ͍ͯ
    શ෦ॻ͚ͨ
    CLFLUSH ͜͏͍ͨ͠

    View Slide

  42. CLFLUSHOPT—Flush Cache Line Optimized
    (ུ)
    to enforce ordering with such an operation, software can insert an SFENCE instruction between
    CFLUSHOPT and that operation.
    — Intel® 64 and IA-32 Architectures Software Developer’s ManualΑΓ
    શͯͷΩϟογϡ͔Βࢦఆ͞ΕͨΞυϨεΛؚΉΩϟογϡϥΠϯΛ࡟আ͢Δ
    ͦͷΩϟογϡϥΠϯ͕มߋ͞ΕͨσʔλΛؚΜͰ͍Δ৔߹ϝϞϦʹॻ͘
    ׬ྃΛ଴͍ͪͨ৔߹͸4'&/$&͢Δ
    ΍Ίͯ!

    View Slide

  43. CLWB—Cache Line Write Back
    Writes back to memory the cache line (if modified) that contains the linear address specified with the
    memory operand from any level of the cache hierarchy in the cache coherence domain. The line may
    be retained in the cache hierarchy in non-modified state.
    — Intel® 64 and IA-32 Architectures Software Developer’s ManualΑΓ
    Ωϟογϡͷதʹࢦఆ͞ΕͨΞυϨεΛؚΉΩϟογϡϥΠϯ͕͋ͬͯ΋࡟আ͠ͳ͍
    ͦͷΩϟογϡϥΠϯ͕มߋ͞ΕͨσʔλΛؚΜͰ͍Δ৔߹ϝϞϦʹॻ͘
    ׬ྃΛ଴͍ͪͨ৔߹͸4'&/$&͢Δ

    View Slide

  44. NTZODΛཁٻ͞ΕͨΒ
    Χʔωϧ͸NNBQ͞ΕͨྖҬͷ͏ͪ
    มߋ͕͋ͬͨ෦෼Λ$-8#͠ͳ͚Ε͹Βͳ͍
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ΞϓϦέʔγϣϯ
    VFS
    σόΠευϥΠό
    ϖʔδΩϟογϡ
    bio
    ϑΝΠϧγεςϜ
    Χʔωϧ͸NVDIMM্ͷ
    มߋ͞ΕͨϖʔδΛ
    ஌͍ͬͯͳ͚Ε͹ͳΒͳ͍
    ?

    View Slide

  45. ϓϩηεͷԾ૝ΞυϨεۭؒ
    ΞΫηε
    ϖʔδ͕ͳ͍
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ϖʔδͱσόΠε্ͷҐஔͷ
    ରԠΛaddress spaceʹه࿥
    ৽͍͠ϖʔδΩϟογϡΛ֬อͯ͠
    ಡΈࠐΈઐ༻ʹ͢Δ
    ͷ৔߹
    ϖʔδϑΥʔϧτ
    address space

    View Slide

  46. ϓϩηεͷԾ૝ΞυϨεۭؒ
    ॻ͖ࠐΈ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    address space
    address space্ͷΤϯτϦʹ
    dirty bitΛཱͯΔ
    ϖʔδΛॻ͖ࠐΈՄೳʹ͢Δ
    ͷ৔߹
    ಡΈࠐΈઐ༻
    ͷϖʔδ
    ϖʔδϑΥʔϧτ

    View Slide

  47. ϓϩηεͷԾ૝ΞυϨεۭؒ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    address space
    ͷ৔߹
    msyncΛཁٻ͞ΕͨΒ
    ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ
    σόΠεʹॻ͖ࠐΉ
    msync
    ίϐʔ

    View Slide

  48. ϓϩηεͷԾ૝ΞυϨεۭؒ
    ΞΫηε
    ϖʔδ͕ͳ͍
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ϖʔδϑΥʔϧτ
    ϖʔδͱσόΠε্ͷҐஔͷ
    ରԠΛaddress spaceʹه࿥
    NVDIMM্ͷྖҬΛ
    ಡΈࠐΈઐ༻ͰׂΓ౰ͯΔ
    ͷ৔߹
    address space

    View Slide

  49. ϓϩηεͷԾ૝ΞυϨεۭؒ
    ॻ͖ࠐΈ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    address space্ͷΤϯτϦʹ
    dirty bitΛཱͯΔ
    ϖʔδΛॻ͖ࠐΈՄೳʹ͢Δ
    ͷ৔߹
    ಡΈࠐΈઐ༻
    ϖʔδ
    ϖʔδϑΥʔϧτ
    address space

    View Slide

  50. ϓϩηεͷԾ૝ΞυϨεۭؒ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ͷ৔߹
    msync
    address space
    msyncΛཁٻ͞ΕͨΒ
    ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ
    CLWB͢Δ
    CLWB

    View Slide

  51. ϓϩηεͷԾ૝ΞυϨεۭؒ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ͷ৔߹
    msync
    address space
    msyncΛཁٻ͞ΕͨΒ
    ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ
    CLWB͢Δ
    CLWB
    Y@ͷ࠷খϖʔδαΠζ όΠτ
    Y@ͷΩϟογϡϥΠϯͷαΠζ όΠτ
    ϖʔδϑΥʔϧτͰॻ͖׵͑Λݕ஌͢ΔΧʔωϧ͸
    ϖʔδͷཻ౓Ͱ͔͠ॻ͖׵͑ΒΕͨ෦෼Λ೺ѲͰ͖ͳ͍
    ͨͱ͑ॻ͖׵͑ΒΕͨͷ͕1ͭͷΩϟογϡϥΠϯͩͬͨͱͯ͠΋
    64ճͷCLWB͕ඞཁʹͳΔ

    View Slide

  52. ϓϩηεͷԾ૝ΞυϨεۭؒ
    Ϣʔβۭؒ
    Χʔωϧۭؒ
    ͷ৔߹
    msync
    address space
    msyncΛཁٻ͞ΕͨΒ
    ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ
    CLWB͢Δ
    CLWB
    ॻ͖׵͑ΛߦͬͨϢʔβۭؒΞϓϦέʔγϣϯ͸
    ࣗ෼͕ॻ͍ͨ෦෼͕Կॲͳͷ͔Λ஌͍ͬͯΔ
    CLWB͸ಛݖ໋ྩͰ͸ͳ͍ҝϢʔβۭ͔ؒΒ௚઀౤͛Δࣄ͕Ͱ͖Δ
    msyncͱ͔͠ͳ͍Ͱ
    Ϣʔβۭ͔ؒΒॻ͍ͨ෦෼ʹCLWBΛ౤͍͛ͨ

    View Slide

  53. Persistent
    Memory
    Development
    Kit
    ϢʔβۭؒͰͷflush౳ͷ
    NVDIMMΛ׆༻͢Δҝʹཉ͍͠ػೳΛඋ͑ͨϥΠϒϥϦ܈
    https://pmem.io/

    View Slide

  54. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj

    View Slide

  55. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj
    libpmem
    mmapͷϓϥοτϑΥʔϜඇґଘͷϥούpmem_map_file΍
    msyncΑΓࡉཻ͔͍౓ͰflushͰ͖Δpmem_persist౳ͷ
    جຊతͳૢ࡞Λߦ͏ؔ਺ΛؚΉ

    View Slide

  56. const auto raw = pmem_map_file(
    filename.c_str(), file_size, device_dax ? 0 : PMEM_FILE_CREATE,
    0644, &mapped_length, &is_pmem
    );
    if( raw == nullptr ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< char, unmap_pmem > mapped(
    reinterpret_cast< char* >( raw ), unmap_pmem( mapped_length )
    );
    if( vm.count( "write" ) ) {
    std::copy( new_value.begin(), new_value.end(), mapped.get() );
    mapped.get()[ new_value_size ] = '\0';
    if( is_pmem ) pmem_persist( mapped.get(), new_value_size );
    else {
    if( pmem_msync( mapped.get(), new_value_size ) ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    }
    }
    else std::cout << mapped.get() << std::endl;
    pmem_map_fileͯ͠
    ෆشൃϝϞϦͩͬͨΒ
    pmem_persist
    ஋Λॻ͖ࠐΜͰ
    ී௨ͷετϨʔδͩͬͨΒ
    pmem_msync

    View Slide

  57. void pmem_persist(const void *addr, size_t len);
    ࢦఆ͞ΕͨΞυϨεͷൣғʹରͯ͠CPU͕αϙʔτ͢Δํ๏Ͱ
    ΩϟογϡͷϥΠτόοΫΛߦ͏
    int pmem_msync(const void *addr, size_t len);
    ࢦఆ͞ΕͨΞυϨεͷൣғΛؚΉϖʔδʹରͯ͠msyncΛݺͼग़͢
    ͍ͣΕͷؔ਺΋msyncͱҟͳΓaddr͸ϖʔδͷઌ಄ʹ
    ΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍
    DAXͳΒ͜ͷૢ࡞͚ͩͰॻ͖ࠐΈ͕ӬଓԽ͞ΕΔ

    View Slide

  58. const auto raw = pmem_map_file(
    filename.c_str(), file_size, device_dax ? 0 : PMEM_FILE_CREATE,
    0644, &mapped_length, &is_pmem
    );
    if( raw == nullptr ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< char, unmap_pmem > mapped(
    reinterpret_cast< char* >( raw ), unmap_pmem( mapped_length )
    );
    if( vm.count( "write" ) ) {
    std::copy( new_value.begin(), new_value.end(), mapped.get() );
    mapped.get()[ new_value_size ] = '\0';
    if( is_pmem ) pmem_persist( mapped.get(), new_value_size );
    else {
    if( pmem_msync( mapped.get(), new_value_size ) ) {
    std::cerr << strerror( errno ) << std::endl;
    return 1;
    }
    }
    }
    else std::cout << mapped.get() << std::endl;
    ͜Ε
    ͜ΕΛॻ͍͍ͯΔ࠷தʹ
    ిݯ͕མͪΔͱͲ͏ͳΔ͔

    View Slide

  59. Hello, W
    Ωϟογϡ
    orld!
    Hello, W
    Ωϟογϡʹۭ͖͕ͳ͍ͷͰ
    ݹ͍ॻ͖ࠐΈΛflush
    $-8#
    orld!
    ͜͜Ͱిݯ͕
    མͪΔͱ
    σʔλ͕յΕΔ

    View Slide

  60. CJU
    Ұൠతͳx86_64ͷPCͷCPUͱϝϞϦͷؒ͸
    64bitͷσʔλόεͰܨ͕͍ͬͯΔ
    64bitΑΓେ͖ͳσʔλ͸
    2ճҎ্ʹ෼͚ͯૹΒΕΔ
    64bitΑΓେ͖ͳσʔλ͸ిݯ૕ࣦޙ
    ్த·Ͱॻ͔Ε͍ͯΔ͔΋͠Εͳ͍

    View Slide

  61. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj
    libpmemobj
    Ͱ͔͍σʔλΛτϥϯβΫγϣφϧʹॻͨ͘Ίͷ
    δϟʔφϧΛ࡞Δ

    View Slide

  62. PMEMobjpool *raw_pool = create ?
    pmemobj_create( filename.c_str(), layout, file_size, 0666 ) :
    pmemobj_open( filename.c_str(), layout );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool );
    PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) );
    auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) );
    if( !new_value.empty() ) {
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range(
    root,offsetof( data_t, message ),
    sizeof( char ) * ( new_value.size() + 1 )
    );
    std::copy( new_value.begin(), new_value.end(), root_raw->message );
    root_raw->message[ new_value.size() ] = '\0';
    } TX_END
    }
    else std::cout << root_raw->message << std::endl;

    View Slide

  63. PMEMobjpool *raw_pool = create ?
    pmemobj_create( filename.c_str(), layout, file_size, 0666 ) :
    pmemobj_open( filename.c_str(), layout );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool );
    PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) );
    auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) );
    if( !new_value.empty() ) {
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range(
    root,offsetof( data_t, message ),
    sizeof( char ) * ( new_value.size() + 1 )
    );
    std::copy( new_value.begin(), new_value.end(), root_raw->message );
    root_raw->message[ new_value.size() ] = '\0';
    } TX_END
    }
    else std::cout << root_raw->message << std::endl;
    ϓʔϧͷεʔύʔϒϩοΫΛ࡞Δ
    ͋ͷσʔλͱ
    ͜ͷσʔλ͸
    ॻ͖ࠐΈ͕
    ్தͰ్੾Ε͍ͯ·͢
    ϩά σʔλ
    pmemobj
    Ͱ
    ͢
    ϔομ

    View Slide

  64. PMEMobjpool *raw_pool = create ?
    pmemobj_create( filename.c_str(), layout, file_size, 0666 ) :
    pmemobj_open( filename.c_str(), layout );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool );
    PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) );
    auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) );
    if( !new_value.empty() ) {
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range(
    root,offsetof( data_t, message ),
    sizeof( char ) * ( new_value.size() + 1 )
    );
    std::copy( new_value.begin(), new_value.end(), root_raw->message );
    root_raw->message[ new_value.size() ] = '\0';
    } TX_END
    }
    else std::cout << root_raw->message << std::endl;
    ϓʔϧͷϧʔτΦϒδΣΫτΛऔಘ͢Δ
    ϓʔϧͷઌ಄͔ΒͷΦϑηοτΛද͢ܕ
    PMEMoid
    pmemobj_directͰPMEMoid͕ࢦ͍ͯ͠ΔҐஔΛ
    ݱࡏͷϖʔδϚοϓͷ΋ͱͰͷԾ૝ΞυϨεʹม׵

    View Slide

  65. PMEMobjpool *raw_pool = create ?
    pmemobj_create( filename.c_str(), layout, file_size, 0666 ) :
    pmemobj_open( filename.c_str(), layout );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool );
    PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) );
    auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) );
    if( !new_value.empty() ) {
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range(
    root,offsetof( data_t, message ),
    sizeof( char ) * ( new_value.size() + 1 )
    );
    std::copy( new_value.begin(), new_value.end(), root_raw->message );
    root_raw->message[ new_value.size() ] = '\0';
    } TX_END
    }
    else std::cout << root_raw->message << std::endl;
    pmemobj_tx_add_range͞ΕͨྖҬ͸
    TX_ENDʹḷΓண͚ͳ͔ͬͨ৔߹
    TX_BEGINલͷঢ়ଶʹͳΔ
    TX_BEGIN
    TX_END
    มߋA
    มߋB
    มߋA͚͕ͩ൓ө͞ΕͯมߋB͕൓ө͞Εͳ͍ঢ়ଶʹ͸ܾͯ͠ͳΒͳ͍

    View Slide

  66. PMEMobjpool *raw_pool = create ?
    pmemobj_create( filename.c_str(), layout, file_size, 0666 ) :
    pmemobj_open( filename.c_str(), layout );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool );
    PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) );
    auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) );
    if( !new_value.empty() ) {
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range(
    root,offsetof( data_t, message ),
    sizeof( char ) * ( new_value.size() + 1 )
    );
    std::copy( new_value.begin(), new_value.end(), root_raw->message );
    root_raw->message[ new_value.size() ] = '\0';
    } TX_END
    }
    else std::cout << root_raw->message << std::endl;
    ͜͜·Ͱ૸Γ͖Δͱpmemobj_tx_add_range͞ΕͨྖҬ͸
    pmem_persist͞ΕΔ

    View Slide

  67. ର৅ൣғΛϩάʹίϐʔ։࢝
    ϩάΛ༗ޮʹ͢Δ
    ର৅ൣғΛॻ͖׵͑׬ྃ
    ϩάΛແޮʹ͢Δ
    ϩάΛ࡟আ͢Δ

    View Slide

  68. ର৅ൣғΛϩάʹίϐʔ։࢝
    ϩάΛ༗ޮʹ͢Δ
    ର৅ൣғΛॻ͖׵͑׬ྃ
    ϩάΛແޮʹ͢Δ
    ϩάΛ࡟আ͢Δ
    ແޮ
    ϩά σʔλ
    ϩά σʔλ
    ࣍ʹpmemobj_openͨ࣌͠ʹ
    ແޮͳϩά͕͋ͬͨΒ࡟আ͢Δ
    ॻ͖׵͑લͷঢ়ଶʹͳΔ

    View Slide

  69. ࣍ʹpmemobj_openͨ࣌͠ʹ
    ༗ޮͳϩά͕͋ͬͨΒݩͷҐஔʹίϐʔ͢Δ
    ॻ͖׵͑લͷঢ়ଶʹͳΔ
    ༗ޮ
    ϩά σʔλ
    ϩά σʔλ
    ༗ޮ
    ର৅ൣғΛϩάʹίϐʔ։࢝
    ϩάΛ༗ޮʹ͢Δ
    ର৅ൣғΛॻ͖׵͑׬ྃ
    ϩάΛແޮʹ͢Δ
    ϩάΛ࡟আ͢Δ

    View Slide

  70. ίϐʔ։࢝
    ʹ͢Δ
    ׵͑׬ྃ
    ʹ͢Δ
    ͢Δ
    ϩάΛ࠶ੜ͢Δ
    ϩά σʔλ
    ༗ޮ
    ϩάͷ࠶ੜதʹ࠶౓Ϋϥογϡͨ͠৔߹
    ϩά͸༗ޮͳ··ͳͷͰ
    ࣍ʹpmemobj_openͨ͠ͱ͖ʹվΊͯ࠶ੜ͞ΕΔ
    ϩά σʔλ
    ༗ޮ

    View Slide

  71. ର৅ൣғΛϩάʹίϐʔ։࢝
    ϩάΛ༗ޮʹ͢Δ
    ର৅ൣғΛॻ͖׵͑׬ྃ
    ϩάΛແޮʹ͢Δ
    ϩάΛ࡟আ͢Δ
    ࣍ʹpmemobj_openͨ࣌͠ʹ
    ༗ޮͳϩά͕͋ͬͨΒݩͷҐஔʹίϐʔ͢Δ
    ॻ͖׵͑લͷঢ়ଶʹͳΔ
    ༗ޮ
    ϩά σʔλ
    ϩά σʔλ
    ༗ޮ

    View Slide

  72. ର৅ൣғΛϩάʹίϐʔ։࢝
    ϩάΛ༗ޮʹ͢Δ
    ର৅ൣғΛॻ͖׵͑׬ྃ
    ϩάΛແޮʹ͢Δ
    ϩάΛ࡟আ͢Δ
    ແޮ
    ϩά σʔλ
    ϩά σʔλ
    ࣍ʹpmemobj_openͨ࣌͠ʹ
    ແޮͳϩά͕͋ͬͨΒ࡟আ͢Δ
    ॻ͖׵͑ޙͷঢ়ଶʹͳΔ

    View Slide

  73. TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) );
    PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 );
    auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) );
    pmemobj_tx_add_range( next, 0, sizeof( data_t ) );
    new(next_raw) data_t();
    std::copy( new_value.begin(), new_value.end(), next_raw->message );
    next_raw->message[ new_value.size() ] = '\0';
    head_raw->next = next;
    } TX_END
    σʔλ
    pmem_tx_*allocͰ
    ϓʔϧʹ৽͍͠σʔλͷҝͷྖҬΛ֬อ
    ͜ͷϝϞϦ֬อ͸
    TX_ENDʹḷΓண͚ͳ͔ͬͨ৔߹
    ແ͔ͬͨ͜ͱʹͳΔ

    View Slide

  74. TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range( prev, offsetof( data_t, next ), sizeof( PMEMoid ) );
    prev_raw->next = cur_raw->next;
    cur_raw->~data_t();
    pmemobj_tx_free( cur );
    } TX_END
    σʔλ
    pmem_tx_freeͰ
    ֬อͨ͠ྖҬΛղ์
    ͜ͷϝϞϦղ์͸
    TX_ENDʹḷΓண͚ͳ͔ͬͨ৔߹
    ແ͔ͬͨ͜ͱʹͳΔ

    View Slide

  75. #include
    #include
    #include
    #include
    #include
    #include
    class close_pmemobj {
    public:
    template< typename T >
    void operator()( T *p ) const {
    if( p ) pmemobj_close( p );
    }
    };
    namespace fs = boost::filesystem;
    bool is_special_file( const fs::path &p ) {
    return
    fs::status( p ).type() == fs::file_type::character_file ||
    fs::status( p ).type() == fs::file_type::block_file;
    }
    struct data_t {
    char message[ 1024 ];
    PMEMoid next;
    };
    int main( int argc, const char *argv[] ) {
    namespace po = boost::program_options;
    po::options_description desc( "Options" );

    View Slide

  76. fs::status( p ).type() == fs::file_type::block_file;
    }
    struct data_t {
    char message[ 1024 ];
    PMEMoid next;
    };
    int main( int argc, const char *argv[] ) {
    namespace po = boost::program_options;
    po::options_description desc( "Options" );
    std::string new_value;
    std::string remove_value;
    uint64_t pool_size;
    constexpr const char layout[] = "90d2827d-3742-4054-aea8-7a43068085ac";
    std::string filename;
    desc.add_options()
    ( "help,h", "show this message" )
    ( "create,c", "create" )
    ( "size,s", po::value< size_t >( &pool_size )->default_value( PMEMOBJ_MIN_POOL ), "pool size" )
    ( "filename,f", po::value< std::string >( &filename )->default_value( "/dev/dax0.0" ), "filename" )
    ( "append,a", po::value< std::string >( &new_value ), "append" )
    ( "delete,d", po::value< std::string >( &remove_value ), "delete" )
    ( "list,l", "list" );
    po::variables_map vm;
    po::store( po::parse_command_line( argc, argv, desc ), vm );
    po::notify( vm );
    if( vm.count( "help" ) ) {
    std::cout << desc << std::endl;
    return 0;
    }
    size_t mapped_length = 0u;
    ࠷େ1024όΠτͷจࣈྻͱ
    ࣍ͷཁૉ΁ͷΦϑηοτΛ࣋ͭ୯ํ޲ϦϯΫϦετͷϊʔυ

    View Slide

  77. pmemobj_open( filename.c_str(), layout );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool );
    PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) );
    auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) );
    if( !new_value.empty() ) {
    auto head = root;
    auto head_raw = root_raw;
    while( 1 ) {
    auto next = reinterpret_cast< data_t* >( pmemobj_direct( head_raw->next ) );
    if( next ) {
    head = head_raw->next;
    head_raw = next;
    }
    else break;
    }
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) );
    PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 );
    auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) );
    pmemobj_tx_add_range( next, 0, sizeof( data_t ) );
    new(next_raw) data_t();
    std::copy( new_value.begin(), new_value.end(), next_raw->message );
    next_raw->message[ new_value.size() ] = '\0';
    head_raw->next = next;
    } TX_END
    }
    ऴ୺ͷϊʔυΛ୳͢

    View Slide

  78. auto head_raw = root_raw;
    while( 1 ) {
    auto next = reinterpret_cast< data_t* >( pmemobj_direct( head_raw->next ) );
    if( next ) {
    head = head_raw->next;
    head_raw = next;
    }
    else break;
    }
    new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) );
    PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 );
    auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) );
    pmemobj_tx_add_range( next, 0, sizeof( data_t ) );
    new(next_raw) data_t();
    std::copy( new_value.begin(), new_value.end(), next_raw->message );
    next_raw->message[ new_value.size() ] = '\0';
    head_raw->next = next;
    } TX_END
    }
    if( !remove_value.empty() ) {
    auto prev = root;
    auto prev_raw = root_raw;
    auto cur = prev_raw->next;
    auto cur_raw = reinterpret_cast< data_t* >( pmemobj_direct( cur ) );
    while( cur_raw ) {
    if( strcmp( cur_raw->message, remove_value.data() ) == 0 ) {
    break;
    }
    auto next = reinterpret_cast< data_t* >( pmemobj_direct( cur_raw->next ) );
    if( next ) {
    ऴ୺ͷϊʔυͷnextΛมߋର৅ͱͯ͠ϩάʹੵΉ
    ৽͍͠ϊʔυΛ࡞Δ
    ৽͍͠ϊʔυΛ
    มߋର৅ͱͯ͠ϩάʹੵΉ
    ৽͍͠ϊʔυʹ஋Λॻ͖ࠐΜͰ
    ऴ୺ͷϊʔυͷnextʹܨ͙
    ͜ΕΒͷૢ࡞ΛTX_BEGIN͔ΒTX_ENDͷؒͰߦ͏

    View Slide

  79. new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) );
    TX_BEGIN( pool.get() ) {
    pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) );
    PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 );
    auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) );
    pmemobj_tx_add_range( next, 0, sizeof( data_t ) );
    new(next_raw) data_t();
    std::copy( new_value.begin(), new_value.end(), next_raw->message );
    next_raw->message[ new_value.size() ] = '\0';
    head_raw->next = next;
    } TX_END
    }
    if( !remove_value.empty() ) {
    auto prev = root;
    auto prev_raw = root_raw;
    auto cur = prev_raw->next;
    auto cur_raw = reinterpret_cast< data_t* >( pmemobj_direct( cur ) );
    while( cur_raw ) {
    if( strcmp( cur_raw->message, remove_value.data() ) == 0 ) {
    break;
    }
    auto next = reinterpret_cast< data_t* >( pmemobj_direct( cur_raw->next ) );
    if( next ) {
    prev = cur;
    cur = cur_raw->next;
    prev_raw = cur_raw;
    cur_raw = next;
    }
    else {
    std::cerr << "Not found." << std::endl;
    return 1;
    }
    ϓʔϧΛ࡞Δ
    ௥Ճ
    ௥Ճ
    ௥Ճ
    ࡟আ
    $ ./03_pmemobj_alloc -c -f test -s 67108864
    $ ./03_pmemobj_alloc -f test -a abcde -l
    abcde
    $ ./03_pmemobj_alloc -f test -a fghij -l
    abcde
    fghij
    $ ./03_pmemobj_alloc -f test -a klmno -l
    abcde
    fghij
    klmno
    $ ./03_pmemobj_alloc -f test -d fghij -l
    abcde
    klmno

    View Slide

  80. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj
    libpmemobj++
    libpmemobjͷC++ϥούʔ

    View Slide

  81. #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    using pmem::obj::p;
    using pmem::obj::persistent_ptr;
    struct data_t {
    persistent_ptr< data_t > next;
    p< std::array< char, 1024 > > data;
    };
    namespace fs = boost::filesystem;
    bool is_special_file( const fs::path &p ) {
    return
    fs::status( p ).type() == fs::file_type::character_file ||
    fs::status( p ).type() == fs::file_type::block_file;
    }
    int main( int argc, const char *argv[] ) {
    namespace po = boost::program_options;
    po::options_description desc( "Options" );
    std::string new_value = "";
    std::string remove_value = "";
    uint64_t pool_size;

    View Slide

  82. #include
    #include
    using pmem::obj::p;
    using pmem::obj::persistent_ptr;
    struct data_t {
    persistent_ptr< data_t > next;
    p< std::array< char, 1024 > > data;
    };
    namespace fs = boost::filesystem;
    bool is_special_file( const fs::path &p ) {
    return
    fs::status( p ).type() == fs::file_type::character_file ||
    fs::status( p ).type() == fs::file_type::block_file;
    }
    int main( int argc, const char *argv[] ) {
    namespace po = boost::program_options;
    po::options_description desc( "Options" );
    std::string new_value = "";
    std::string remove_value = "";
    uint64_t pool_size;
    constexpr const char layout[] = "dd58d49d-4be6-44e0-b160-37e79d94ecf8";
    std::string filename;
    desc.add_options()
    ( "help,h", "show this message" )
    ( "create,c", "create" )
    ( "size,s", po::value< size_t >( &pool_size )->default_value( PMEMOBJ_MIN_POOL ), "pool size" )
    ( "filename,f", po::value< std::string >( &filename )->default_value( "/dev/dax0.0" ), "filename" )
    ( "append,a", po::value< std::string >( &new_value ), "append" )
    ( "delete,d", po::value< std::string >( &remove_value ), "delete" )
    ࠷େ1024όΠτͷจࣈྻͱ
    ࣍ͷཁૉ΁ͷΦϑηοτΛ࣋ͭ୯ํ޲ϦϯΫϦετͷϊʔυ

    View Slide

  83. file_size = pool_size;
    create = true;
    }
    namespace pobj = pmem::obj;
    auto pool = create ?
    pobj::pool< data_t >::create( filename.c_str(), layout, file_size, 0666 ) :
    pobj::pool< data_t >::open( filename.c_str(), layout );
    pobj::persistent_ptr< data_t > root = pool.get_root();
    if( !new_value.empty() ) {
    auto next = root->next;
    auto cur = root;
    while( next ) {
    cur = next;
    next = next->next;
    }
    new_value.resize( 1023 );
    std::array< char, 1024 > data;
    std::copy( new_value.begin(), new_value.end(), data.begin() );
    data[ 1023 ] = '\0';
    pmem::obj::transaction::exec_tx( pool, [&] {
    auto new_elem = pmem::obj::make_persistent< data_t >();
    new_elem->data = data;
    cur->next = new_elem;
    } );
    }
    if( !remove_value.empty() ) {
    auto next = root->next;
    auto cur = root;
    while( next ) {
    if( strcmp( next->data.get_ro().data(), remove_value.data() ) == 0 ) {
    const auto data_size = strlen( next->data.get_ro().data() );
    pmem::obj::transaction::exec_tx( pool, [&] {
    ৽͍͠ϊʔυΛ࡞Δ
    ৽͍͠ϊʔυʹσʔλΛॻ͖ࠐΉ
    ऴ୺ͷϊʔυͷnextʹ৽͍͠ϊʔυΛܨ͙
    ͜ΕΒͷૢ࡞Λexec_txʹ౉͢ϥϜμࣜͷதͰߦ͏

    View Slide

  84. if( !new_value.empty() ) {
    auto next = root->next;
    auto cur = root;
    while( next ) {
    cur = next;
    next = next->next;
    }
    new_value.resize( 1023 );
    std::array< char, 1024 > data;
    std::copy( new_value.begin(), new_value.end(), data.begin() );
    data[ 1023 ] = '\0';
    pmem::obj::transaction::exec_tx( pool, [&] {
    auto new_elem = pmem::obj::make_persistent< data_t >();
    new_elem->data = data;
    cur->next = new_elem;
    } );
    }
    if( !remove_value.empty() ) {
    auto next = root->next;
    auto cur = root;
    while( next ) {
    if( strcmp( next->data.get_ro().data(), remove_value.data() ) == 0 ) {
    const auto data_size = strlen( next->data.get_ro().data() );
    pmem::obj::transaction::exec_tx( pool, [&] {
    cur->next = next->next;
    pmem::obj::delete_persistent< data_t >( next );
    } );
    break;
    }
    cur = next;
    next = next->next;
    }
    $ ./04_pmemobj++ -c -f test -s 67108864
    $ ./04_pmemobj++ -f test -a abcde -l
    abcde
    $ ./04_pmemobj++ -f test -a fghij -l
    abcde
    fghij
    $ ./04_pmemobj++ -f test -a klmno -l
    abcde
    fghij
    klmno
    $ ./04_pmemobj++ -f test -d fghij -l
    abcde
    klmno
    ϓʔϧΛ࡞Δ
    ௥Ճ
    ௥Ճ
    ௥Ճ
    ࡟আ

    View Slide

  85. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj
    libpmemlog
    ௥ه͔͠Ͱ͖ͳ͍͕libpmemobjΑΓ؆୯ʹॻ͖ࠐΊΔ

    View Slide

  86. size_t mapped_length = 0u;
    int is_pmem = 0;
    fs::path path( filename );
    bool device_dax = false;
    size_t file_size = 0u;
    bool create = vm.count( "create" );
    if( fs::exists( path ) ) {
    device_dax = is_special_file( path );
    if( !device_dax ) file_size = fs::file_size( path );
    else file_size = 0;
    }
    else {
    file_size = pool_size;
    create = true;
    }
    PMEMlogpool *raw_pool = create ?
    pmemlog_create( filename.c_str(), file_size, 0666 ) :
    pmemlog_open( filename.c_str() );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMlogpool, close_pmemlog > pool( raw_pool );
    if( !new_value.empty() )
    pmemlog_append( pool.get(), new_value.data(), new_value.size() );
    if( vm.count( "list" ) ) {
    pmemlog_walk( pool.get(), 0, []( const void *data, size_t length, void* ) -> int {
    std::cout << std::string( reinterpret_cast< const char* >( data ), length ) << std::endl;
    return 0;
    }, nullptr );
    }
    }
    ։͘
    ॻ͖଍͢
    ᢞΊΔ

    View Slide

  87. size_t mapped_length = 0u;
    int is_pmem = 0;
    fs::path path( filename );
    bool device_dax = false;
    size_t file_size = 0u;
    bool create = vm.count( "create" );
    if( fs::exists( path ) ) {
    device_dax = is_special_file( path );
    if( !device_dax ) file_size = fs::file_size( path );
    else file_size = 0;
    }
    else {
    file_size = pool_size;
    create = true;
    }
    PMEMlogpool *raw_pool = create ?
    pmemlog_create( filename.c_str(), file_size, 0666 ) :
    pmemlog_open( filename.c_str() );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMlogpool, close_pmemlog > pool( raw_pool );
    if( !new_value.empty() )
    pmemlog_append( pool.get(), new_value.data(), new_value.size() );
    if( vm.count( "list" ) ) {
    pmemlog_walk( pool.get(), 0, []( const void *data, size_t length, void* ) -> int {
    std::cout << std::string( reinterpret_cast< const char* >( data ), length ) << std::endl;
    return 0;
    }, nullptr );
    }
    }
    $ ./05_pmemlog -c -f test
    $ ./05_pmemlog -f test -a abcde -l
    abcde
    $ ./05_pmemlog -f test -a fghij -l
    abcdefghij
    $ ./05_pmemlog -f test -a klmno -l
    abcdefghijklmno
    ϓʔϧΛ࡞Δ
    ௥Ճ
    ௥Ճ
    ௥Ճ

    View Slide

  88. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj
    libpmemblk
    ϒϩοΫ୯ҐͰͷॻ͖ࠐΈ͔͠Ͱ͖ͳ͍͕
    pmemobjΑΓ؆୯ʹॻ͖ࠐΊΔ

    View Slide

  89. PMEMblkpool *raw_pool = create ?
    pmemblk_create( filename.c_str(), block_size, file_size, 0666 ) :
    pmemblk_open( filename.c_str(), block_size );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMblkpool, close_pmemblk > pool( raw_pool );
    const size_t block_count = pmemblk_nblock( pool.get() );
    if( create ) {
    const char buffer[ block_size ] = { 0 };
    for( size_t i = 0; i != block_count; ++i ) {
    pmemblk_write( pool.get(), buffer, i );
    if( i % ( block_count / 10 ) == 0 )
    std::cout << 100 * i / block_count << "%" << std::endl;
    }
    }
    if( !new_value.empty() ) {
    char buffer[ block_size ];
    for( size_t i = 0; i != block_count; ++i ) {
    pmemblk_read( pool.get(), buffer, i );
    if( buffer[ 0 ] == '\0' ) {
    new_value.resize( block_size - 1 );
    std::copy( new_value.begin(), new_value.end(), buffer );
    buffer[ new_value.size() ] = '\0';
    pmemblk_write( pool.get(), buffer, i );
    break;
    }
    }
    }
    if( vm.count( "list" ) ) {
    char buffer[ block_size ];
    ։͘
    ϒϩοΫΛಡΉ
    ϒϩοΫΛॻ͘

    View Slide

  90. PMEMblkpool *raw_pool = create ?
    pmemblk_create( filename.c_str(), block_size, file_size, 0666 ) :
    pmemblk_open( filename.c_str(), block_size );
    if( !raw_pool ) {
    std::cerr << filename << ':' << strerror( errno ) << std::endl;
    return 1;
    }
    std::unique_ptr< PMEMblkpool, close_pmemblk > pool( raw_pool );
    const size_t block_count = pmemblk_nblock( pool.get() );
    if( create ) {
    const char buffer[ block_size ] = { 0 };
    for( size_t i = 0; i != block_count; ++i ) {
    pmemblk_write( pool.get(), buffer, i );
    if( i % ( block_count / 10 ) == 0 )
    std::cout << 100 * i / block_count << "%" << std::endl;
    }
    }
    if( !new_value.empty() ) {
    char buffer[ block_size ];
    for( size_t i = 0; i != block_count; ++i ) {
    pmemblk_read( pool.get(), buffer, i );
    if( buffer[ 0 ] == '\0' ) {
    new_value.resize( block_size - 1 );
    std::copy( new_value.begin(), new_value.end(), buffer );
    buffer[ new_value.size() ] = '\0';
    pmemblk_write( pool.get(), buffer, i );
    break;
    }
    }
    }
    if( vm.count( "list" ) ) {
    char buffer[ block_size ];
    $ ./06_pmemblk -c -f test
    0%
    9%
    19%
    29%
    39%
    49%
    59%
    69%
    79%
    89%
    99%
    $ ./06_pmemblk -f test -a abcde -l
    abcde
    $ ./06_pmemblk -f test -a fghij -l
    abcde
    fghij
    $ ./06_pmemblk -f test -a klmno -l
    abcde
    fghij
    klmno
    ϓʔϧΛ࡞Δ
    ௥Ճ
    ௥Ճ
    ௥Ճ

    View Slide

  91. libpmem
    Persistent
    Memory
    Development
    Kit
    libpmemblk libpmemlog
    libvmmalloc
    libpmemobj++
    ΞϓϦέʔγϣϯ
    libpmemobj
    libvmmalloc
    ϝϞϦ֬อʹؔΘΔؔ਺(mallocͱ͔)Λ
    NVDIMM͔ΒྖҬΛ֬อ͢Δؔ਺Ͱஔ͖׵͑Δ
    NVDIMMΛେ༰ྔشൃϝϞϦͱͯ͠࢖͏ࣄ͕Ͱ͖Δ

    View Slide

  92. Sparse File
    % % %
    ϑΝΠϧ
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    .
    ϑΝΠϧͷ͏ͪඇθϩ஋͕ॻ͖ࠐ·Ε͍ͯΔ
    ϖʔδ͚͕ͩετϨʔδʹه࿥͞Ε͍ͯΔ

    View Slide

  93. Sparse File
    % % %
    ϑΝΠϧ
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    ϑΝΠϧͷ͏ͪඇθϩ஋͕ॻ͖ࠐ·Ε͍ͯΔ
    ϖʔδ͚͕ͩετϨʔδʹه࿥͞Ε͍ͯΔ
    %
    ϖʔδ͕ͳ͍ͱ͜Ζʹॻ͖ࠐΉͱ
    ৽͍͠ϖʔδ͕֬อ͞ΕΔ
    ॻ͖ࠐΈ
    ϖʔδ͕૿͑ΔͱϑΝΠϧγεςϜͷϝλσʔλ͕มߋ͞ΕΔ
    .

    View Slide

  94. Sparse File
    % % %
    ϑΝΠϧ
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    %
    .
    ΞϓϦέʔγϣϯ͸
    ϑΝΠϧͷσʔλ͚ͩΛ
    ॻ͖׵͍͑ͯΔͭ΋ΓͳͷͰ
    ͚ͩ͜͜flush͢Δ
    γεςϜ͕ఀࢭͨ͠λΠϛϯάʹΑͬͯ͸
    ϝλσʔλ͕ݹ͍··ʹͳΓɺ৽͍͠ϖʔδͷ಺༰͕ࣦΘΕΔ

    View Slide

  95. if (flags & PMEM_FILE_CREATE) {
    /*
    * Always set length of file to 'len'.
    * (May either extend or truncate existing file.)
    */
    if (os_ftruncate(fd, (os_off_t)len) != 0) {
    ERR("!ftruncate");
    goto err;
    }
    if ((flags & PMEM_FILE_SPARSE) == 0) {
    if ((errno = os_posix_fallocate(fd, 0,
    (os_off_t)len)) != 0) {
    ERR("!posix_fallocate");
    goto err;
    }
    }
    } else {
    ssize_t actual_size = util_file_get_size(path);
    if (actual_size < 0) {
    ERR("stat %s: negative size", path);
    errno = EINVAL;
    goto err;
    }
    len = (size_t)actual_size;
    }
    pmdk-1.4.3/src/libpmem/pmem.cΑΓ
    ৽نϑΝΠϧ࡞੒࣌ʹ
    ϑΝΠϧͷઌ಄͔Β຤ඌ·ͰΛ
    fallocate͍ͯ͠Δ
    pmem_map_fileͰ
    ࡞੒͞ΕͨϑΝΠϧ͸
    SparseʹͳΒͳ͍

    View Slide

  96. Copy on Write
    % % %
    ϑΝΠϧ
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    %
    .
    ͍͔ͭ͘ͷϑΝΠϧγεςϜ͸
    ϖʔδ͕ॻ͖׵͑ΒΕΔࡍʹ
    ඞͣ৽͍͠ྖҬΛ֬อ͢Δ
    flushΛϢʔβۭؒͰย෇͚Δࣄ͕ઈରʹͰ͖ͳ͍
    ·͍ͣ
    ॻ͖ࠐΈ

    View Slide

  97. • flushΛϢʔβۭؒͰย෇͚͍ͨ
    • ϑΝΠϧγεςϜ͸Χʔωϧ͕؅ཧ͍ͯ͠Δ
    ͜ͷ2ͭΛཱ྆ͤ͞Α͏ͱ͢Δͷ͕ෆ޾ͷݯ
    ϑΝΠϧγεςϜΛ΍ΊΑ͏

    View Slide

  98. NVDIMMσόΠεΛ௚઀
    ϓϩηεͷԾ૝ΞυϨεۭؒʹ
    mmapͰ͖ΔΑ͏ʹ͢Δ
    Device DAX
    ϓϩηεͷԾ૝ΞυϨεۭؒ
    NVDIMM্ʹ࡞ͬͨ
    ϑΝΠϧγεςϜͷ্ͷ
    ϑΝΠϧͰ͸ͳ͘

    View Slide

  99. Device DAX
    NVDIMMσόΠεΛ௚઀
    ϓϩηεͷԾ૝ΞυϨεۭؒʹ
    mmapͰ͖ΔΑ͏ʹ͢Δ
    ΞυϨεۭؒ
    ར఺
    Ϣʔβۭؒϓϩηε͕
    flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ

    View Slide

  100. Device DAX
    NVDIMMσόΠεΛ௚઀
    ϓϩηεͷԾ૝ΞυϨεۭؒʹ
    mmapͰ͖ΔΑ͏ʹ͢Δ
    ΞυϨεۭؒ
    ར఺
    Ϣʔβۭؒϓϩηε͕
    flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ
    ॻ͖ࠐΈʹ͔͔Δ͕࣌ؒ༧ଌͰ͖Δ

    View Slide

  101. Device DAX
    NVDIMMσόΠεΛ௚઀
    ϓϩηεͷԾ૝ΞυϨεۭؒʹ
    mmapͰ͖ΔΑ͏ʹ͢Δ
    ΞυϨεۭؒ
    ར఺
    Ϣʔβۭؒϓϩηε͕
    flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ
    ॻ͖ࠐΈʹ͔͔Δ͕࣌ؒ༧ଌͰ͖Δ
    1GB HugePageΛ࢖ͬͯ
    TLBϛεΛ཈͑Δࣄ͕Ͱ͖Δ

    View Slide

  102. Device DAX
    NVDIMMσόΠεΛ௚઀
    ϓϩηεͷԾ૝ΞυϨεۭؒʹ
    mmapͰ͖ΔΑ͏ʹ͢Δ
    ΞυϨεۭؒ
    ར఺
    Ϣʔβۭؒϓϩηε͕
    flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ
    ॻ͖ࠐΈʹ͔͔Δ͕࣌ؒ༧ଌͰ͖Δ
    1GB HugePageΛ࢖ͬͯ
    TLBϛεΛ཈͑Δࣄ͕Ͱ͖Δ
    ܽ఺
    ϑΝΠϧγεςϜ͸࢖͑ͳ͍

    View Slide

  103. https://github.com/pmem/ndctl
    ndctl
    NVDIMMΛͲͷΑ͏ʹར༻͢Δ͔Λ
    LinuxΧʔωϧͷNVDIMMαϒγεςϜʹࢦࣔ͢ΔίϚϯυ

    View Slide

  104. $ ndctl list
    [
    {
    "dev":"namespace0.0",
    "mode":"fsdax",
    "map":"dev",
    "size":2111832064,
    "uuid":"d8aeb862-2052-4d0e-af2b-4961dfaca8d3",
    "sector_size":512,
    "align":2097152,
    "blockdev":"pmem0"
    }
    ]
    $ umount /mnt/pmem
    $ ndctl disable-namespace namespace0.0
    disabled 1 namespace
    $ ndctl destroy-namespace "namespace0.0"
    destroyed 0 namespaces
    $ ndctl list
    $ ls /dev/pmem0
    ls: cannot access '/dev/pmem0': No such file or directory
    σόΠεͷશྖҬ͕
    Filesystem DAXΛ࢖͑Δ໊લۭؒʹ
    ׂΓ౰ͯΒΕ͍ͯͯ
    /dev/pmem0͔Β
    ϒϩοΫσόΠεͱͯ͠࢖͑Δঢ়ଶ
    ϑΝΠϧγεςϜΛΞϯϚ΢ϯτ͠
    ໊લۭؒΛ࡟আ

    View Slide

  105. disabled 1 namespace
    $ ndctl destroy-namespace "namespace0.0"
    destroyed 0 namespaces
    $ ndctl list
    $ ls /dev/pmem0
    ls: cannot access '/dev/pmem0': No such file or directory
    ϑΝΠϧγεςϜΛΞϯϚ΢ϯτ͠
    ໊લۭؒΛ࡟আ
    $ ndctl create-namespace -e "namespace0.0" -m devdax -a 1G
    {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":"1024.00 MiB (1073.74 MB)",
    "uuid":"e307a092-8d2d-4d4c-a96e-2163c7d0b770",
    "daxregion":{
    "id":0,
    "size":"1024.00 MiB (1073.74 MB)",
    "align":1073741824,
    "devices":[
    {
    "chardev":"dax0.0",
    "size":"1024.00 MiB (1073.74 MB)",
    "target_node":0,
    "mode":"devdax"
    }
    ]
    },
    "align":1073741824
    }
    ར༻ํ๏devdax
    ϖʔδαΠζ1GBͰ
    ৽໊͍͠લۭؒΛ࡞Δ

    View Slide

  106. {
    "chardev":"dax0.0",
    "size":"1024.00 MiB (1073.74 MB)",
    "target_node":0,
    "mode":"devdax"
    }
    ]
    },
    "align":1073741824
    }
    ৽໊͍͠લۭؒΛ࡞Δ
    ls -lha /dev/dax0.0
    crw------- 1 root root 252, 6 10݄ 19 10:35 /dev/dax0.0
    It's a character device!

    View Slide

  107. Device DAX
    ls -lha /dev/dax0.0
    crw------- 1 root root 252, 6 10݄ 19 10:35 /dev/dax0.0
    ͜ͷσόΠε͸
    • open (։͘)
    • close (ด͡Δ)
    • mmap (Ծ૝ΞυϨεۭؒʹϚοϓ͢Δ)
    • fallocate (Ϛοϓͨ͠෺ͷҰ෦Λണ͕͢)
    ͚͕ͩͰ͖Δ
    fallocate͸ಛఆͷϖʔδͷׂΓ౰ͯΛ
    ണ͕͚ͩ͢ͷͨΊʹ༻ҙ͞Ε͍ͯΔ

    View Slide

  108. bool device_dax = false;
    size_t file_size = 0u;
    bool create = vm.count( "create" );
    if( fs::exists( path ) ) {
    device_dax = is_special_file( path );
    if( !device_dax ) file_size = fs::file_size( path );
    else file_size = 0;
    }
    else {
    file_size = pool_size;
    create = true;
    }
    PMEMobjpool *raw_pool = create ?
    pmemobj_create( filename.c_str(), layout, file_size, 0666 ) :
    pmemobj_open( filename.c_str(), layout );
    DeviceDAXʹpmemobjͷϓʔϧΛ࡞Δ࣌͸
    pmemobj_createͷϑΝΠϧαΠζΛ0ʹ͢Δ

    View Slide

  109. ͓·͚
    Intel Optane DC Persistent MemoryΛಈ͔͢ʹ͸
    CascadeLakeϚΠΫϩΞʔΩςΫνϟҎ߱ͷ
    Xeon GoldҎ্ͷϓϩηοα͕ཁΔ
    ௒ߴ͍
    memmap=2G!14G
    ΧʔωϧύϥϝʔλmemmapʹಛผͳࢦఆΛ෇͚ͯLinuxΛىಈ͢Δͱ
    DRAMͷҰ෦ΛNVDIMMͩͱࢥ͍ࠐΉΑ͏ʹͳΔ
    /7%*..ѻ͍͢ΔαΠζ %3".ѻ͍͢ΔαΠζ
    NVDIMMΛ࢖͏ΞϓϦέʔγϣϯͷςετʹศར

    View Slide

  110. ·ͱΊ
    ϝϞϦͷΑ͏ʹॻ͚Δ৽͍͠ετϨʔδ
    NVDIMM
    ΧʔωϧͷϒϩοΫϨΠϠʔΛᷖճ͢Δ
    Filesystem DAXͱDevice DAX
    ޮ཰ͷѱ͍ϖʔδ୯Ґͷflushͷ୅ΘΓʹϢʔβۭؒͰ
    PMDK

    View Slide