Slide 1

Slide 1 text

ిݯΛ੾ͬͯ΋ফ͑ͳ͍ϝϞϦͱͷ෇͖߹͍ํ NAOMASA MATSUBAYASHI https://github.com/Fadis/kernelvm_20191019_samples αϯϓϧίʔυ

Slide 2

Slide 2 text

Ϩδελ L1 cache L2 cache L3 cache DRAM SSD ϋʔυσΟεΫ ߴ଎Ͱ ༰ྔ୯Ձ͕ߴ͍ ௿଎Ͱ ༰ྔ୯Ձ͕͍҆

Slide 3

Slide 3 text

1013 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 10−10 10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 ༰ྔ[bytes] ॻ͖ࠐΈͷϨΠςϯγ[ඵ] Ϩδελ L1 cache L2 cache 10−1 L3 cache DRAM SSD ϋʔυσΟεΫ ӬଓԽ͞Εͳ͍ ӬଓԽ͞ΕΔ

Slide 4

Slide 4 text

1013 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 10−10 10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 ༰ྔ[bytes] ॻ͖ࠐΈͷϨΠςϯγ[ඵ] Ϩδελ L1 cache L2 cache 10−1 L3 cache DRAM SSD ϋʔυσΟεΫ ӬଓԽ=͕͔͔࣌ؒ͘͢͝Δ

Slide 5

Slide 5 text

ϑϥογϡϝϞϦͷ࢓૊Έ τϯωϧࢎԽບ N N 1 ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ ిՙ͸௨աͰ͖ͳ͍ ຊؾΛग़͢ͱిՙ͕௨ա͢Δ ిքޮՌτϥϯδελʹ ϑϩʔςΟϯάήʔτ͕ڬ·ͬͨΑ͏ͳߏ଄

Slide 6

Slide 6 text

20V τϯωϧࢎԽບ N N 1 ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ 0V ੍ޚήʔτʹߴిѹΛ͔͚ ଞΛGNDʹ͢Δͱ ిՙ͕τϯωϧࢎԽບΛಥ͖ൈ͚ ϑϩʔςΟϯάήʔτʹஷ·Δ 0V 0V

Slide 7

Slide 7 text

5V τϯωϧࢎԽບ N N 1 ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ ϑϩʔςΟϯάήʔτʹ ిՙ͕ͨ·͍ͬͯΔͱ ੍ޚήʔτʹগʑిѹΛ͔͚ͯ΋ P૚෇ۙͷిࢠ͕ރׇ͠ͳ͍ҝ νϟωϧ͕ܗ੒͞Εͳ͍ ͜ͷঢ়ଶͰ N-P-NͰిྲྀ͕ྲྀΕΔͷʹ ඞཁͳ੍ޚήʔτͷిѹΛ ͱ͢Δ Vh

Slide 8

Slide 8 text

0V τϯωϧࢎԽບ N N 1 ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ ੍ޚήʔτΛGNDʹͯ͠ ͦΕҎ֎ʹߴిѹΛ͔͚Δͱ ϑϩʔςΟϯάήʔτ͔Β ిՙ͕ൈ͚Δ 20V 20V 20V

Slide 9

Slide 9 text

5V τϯωϧࢎԽບ N N 1 ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ ϑϩʔςΟϯάήʔτʹ ిՙ͕ͨ·͍ͬͯͳ͍ͱ ੍ޚήʔτʹిѹΛ͔͚ͨ࣌ʹ P૚෇ۙͷిࢠ͕ރׇ͢Δҝ P૚ͷిࢠ͕ҾͬுΒΕͯ νϟωϧ͕ܗ੒͞ΕΔ ͜ͷঢ়ଶͰ N-P-NͰిྲྀ͕ྲྀΕΔͷʹ ඞཁͳ੍ޚήʔτͷిѹΛ ͱ͢Δ Vl

Slide 10

Slide 10 text

௚ྻʹ୔ࢁܨ͙ ݸผʹ઀ଓ͢ΔΑΓूੵ౓Λ্͛Δࣄ͕Ͱ͖Δ ௚ྻʹܨ͕Εͨૉࢠ͸ݸผʹॻ͖׵͕͑Ͱ͖ͳ͍ ར఺ ܽ఺

Slide 11

Slide 11 text

ಡΈ͍ͨηϧʹ ɺͦΕҎ֎ʹ Λ͔͚Δͱ ಡΈ͍ͨηϧͷ஋͕఍߅஋ͰಡΊΔ Vl Vh Vh Vh Vh Vh Vl ໰౴ແ༻Ͱ ྲྀΕΔ ঢ়ଶʹΑͬͯ͸ ྲྀΕΔ ໰౴ແ༻Ͱ ྲྀΕΔ ໰౴ແ༻Ͱ ྲྀΕΔ ໰౴ແ༻Ͱ ྲྀΕΔ 1 2 3 4 5 3൪ͷ஋͕ ಡΊΔ ௚ྻʹ୔ࢁܨ͙

Slide 12

Slide 12 text

20V ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ ॻ͖ࠐΈ༻ͷߴిѹΛ࡞Δ νϟʔδϙϯϓ͸ ݪཧ্ߴ଎ͳԠ౴͕Ͱ͖ͳ͍ ʹ௿ϨΠςϯγΛٻΊΔͷ͸ແཧ͕͋Δ V V 0 2V ΫϩοΫͰ ੾Γସ͑ ͜ΕΛඞཁͳిѹʹͳΔ·Ͱ܁Γฦ͢ ͨΊΔ ͩ͢

Slide 13

Slide 13 text

20V ϑϩʔςΟϯάήʔτ ઈԑບ ੍ޚήʔτ ॻ͖ࠐΈ༻ͷߴిѹΛ࡞Δ νϟʔδϙϯϓ͸ ݪཧ্ߴ଎ͳԠ౴͕Ͱ͖ͳ͍ ௚ྻʹͳͬͨηϧͷ Ұ෦͚ͩΛॻ͖׵͍͑ͨ৔߹ શͯͷηϧͷ஋ΛಡΈग़ͯ͠ ॻ͖௚͢ඞཁ͕͋Δ ʹ௿ϨΠςϯγΛٻΊΔͷ͸ແཧ͕͋Δ

Slide 14

Slide 14 text

ಉظI/O ඇಉظI/O Χʔωϧ/VM୳ݕୂ@ؔ੢ 9ճ໨ ۃΊͯ଎͍ετϨʔδͱͷ෇͖߹͍ํ ΑΓ ࠓ೔ͷSSD͸ ͷϨΠςϯγΛ େྔͷॻ͖ࠐΈΛಉ࣌ʹߦ͏͜ͱͰΧόʔ͍ͯ͠Δҝ ॻ͖ࠐΉ΋ͷ͕େྔʹͳ͍ͱੑೳ͕ग़ͳ͍

Slide 15

Slide 15 text

1013 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 10−10 10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 ༰ྔ[bytes] ॻ͖ࠐΈͷϨΠςϯγ[ඵ] Ϩδελ L1 cache L2 cache 10−1 L3 cache DRAM SSD ϋʔυσΟεΫ ? ͜ͷลΓʹ ӬଓԽ͞ΕΔϠπ͕ཉ͍͠ ϑϥογϡϝϞϦʹ୅ΘΔ ෆشൃϝϞϦͷݚڀ͕ ଟํ໘ͰߦΘΕ͍ͯͨ

Slide 16

Slide 16 text

1013 100 101 102 103 104 105 106 107 108 109 1010 1011 1012 10−10 10−9 10−8 10−7 10−6 10−5 10−4 10−3 10−2 ༰ྔ[bytes] ॻ͖ࠐΈͷϨΠςϯγ[ඵ] Ϩδελ L1 cache L2 cache 10−1 L3 cache DRAM SSD ϋʔυσΟεΫ ͜ͷลΓʹ ӬଓԽ͞ΕΔϠπ͕ཉ͍͠ NVDIMM IntelɺϑϥογϡϝϞϦʹ୅ΘΔෆشൃϝϞϦΛ࠾༻ͨ͠ Optane DC Persistent MemoryΛ੡඼Խ

Slide 17

Slide 17 text

NVMe SSD Intel Optane DC DRAM 300µsఔ౓ 500nsఔ౓ 50nsఔ౓ ӬଓԽ͞ΕΔ ӬଓԽ͞ΕΔ ӬଓԽ͞Εͳ͍ 128GBͰ 6000ԁ͘Β͍ 128GBͰ 5ສԁ͘Β͍ 128GBͰ 40ສԁ͘Β͍ ϖʔδ୯ҐͰ͔͠ ॻ͚ͳ͍ ΩϟογϡϥΠϯ୯ҐͰ ॻ͚Δ ΩϟογϡϥΠϯ୯ҐͰ ॻ͚Δ ϨΠςϯγ ӬଓԽ ༰ྔ୯Ձ ॻ͖ࠐΈ ୯Ґ

Slide 18

Slide 18 text

Ge1 Sb2 Te4 SeAsGeSi https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/Proceedings_Chrono_2017.html Intel͸Optane DCʹ༻͍ͨෆشൃϝϞϦ3D XpointͷৄࡉΛެ։͍ͯ͠ͳ͍͕ ൒ಋମͷ෼ੳΛઐ໳ͱ͢ΔاۀʹΑΔௐࠪ݁Ռ͕ൃද͞Ε͍ͯΔ 3D XPoint: Current Implementations and Future Trends

Slide 19

Slide 19 text

Ge1 Sb2 Te4 SeAsGeSi ΦϘχοΫᮢ஋εΠον ిѹ͕ҰఆҎԼͷ৔߹͚ͩߴ͍఍߅஋Λࣔ͢෺࣭ ࿙ΕిྲྀͰҙਤ͠ͳ͍ηϧ͕Ԡ౴͢ΔͷΛ๷͙ https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/Proceedings_Chrono_2017.html Intel͸Optane DCʹ༻͍ͨෆشൃϝϞϦ3D XpointͷৄࡉΛެ։͍ͯ͠ͳ͍͕ ൒ಋମͷ෼ੳΛઐ໳ͱ͢ΔاۀʹΑΔௐࠪ݁Ռ͕ൃද͞Ε͍ͯΔ 3D XPoint: Current Implementations and Future Trends

Slide 20

Slide 20 text

Ge1 Sb2 Te4 SeAsGeSi ΦϘχοΫᮢ஋εΠον ిѹ͕ҰఆҎԼͷ৔߹͚ͩߴ͍఍߅஋Λࣔ͢෺࣭ ࿙ΕిྲྀͰҙਤ͠ͳ͍ηϧ͕Ԡ౴͢ΔͷΛ๷͙ https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/Proceedings_Chrono_2017.html Intel͸Optane DCʹ༻͍ͨෆشൃϝϞϦ3D XpointͷৄࡉΛެ։͍ͯ͠ͳ͍͕ ൒ಋମͷ෼ੳΛઐ໳ͱ͢ΔاۀʹΑΔௐࠪ݁Ռ͕ൃද͞Ε͍ͯΔ 3D XPoint: Current Implementations and Future Trends ͓ͦΒ͘ ͱ Λੵ૚ͨ͠ ௒֨ࢠܕ૬มԽϝϞϦ GeTe Sb2 Te3 ͔͚ͨిѹʹΑͬͯ • มԽ͠ͳ͍(ࠓͷঢ়ଶ͕ಡΈग़ͤΔ) • ΞϞϧϑΝε(఍߅େ)ʹมԽ͢Δ • ݁থ(఍߅খ)ʹมԽ͢Δ ͷ3௨ΓͷৼΔ෣͍Λ͢Δ෺࣭

Slide 21

Slide 21 text

Ge1 Sb2 Te4 SeAsGeSi ϙΠϯτ 2ઢͰશͯͷૢ࡞Λߦ͏ҝ ϑϥογϡϝϞϦͷΑ͏ʹݸผͷॻ͖׵͑Λ ٘ਜ਼ʹ͠ͳͯ͘΋ߴີ౓Խ͕Մೳ 1VҎԼͷ௿ిѹͰॻ͖ࠐΈ͕Մೳͳҝ ߴిѹΛಘΔҝͷ͕͔͔࣌ؒΔ࢓૊Έ͕ෆཁ ͜ͷ݁Ռ ฒͷେ༰ྔͱ DRAMʹഭΔ௿ϨΠςϯγͱ ӬଓԽ͕શͯୡ੒͞ΕΔ

Slide 22

Slide 22 text

໰୊ OS͸͜ͷσόΠεΛͲͷΑ͏ʹϢʔβۭؒʹݟͤΔ΂͖͔ ϝ Ϟ Ϧ? ϒϩοΫσόΠε?

Slide 23

Slide 23 text

DRAMͱҰॹʹ DIMMιέοτʹऔΓ෇͚Δ σόΠεͰ͸͋Δ͕ σʔλ͸ӬଓԽ͞ΕΔҝ طଘͷΞϓϦέʔγϣϯ͸ͦ͜ʹϑΝΠϧΛஔ͖͍ͨ $ ls /dev/pmem0 -lha brw-rw---- 1 root disk 259, 1 10݄ 2 03:15 /dev/pmem0 LinuxͰ͸NVDIMM͕͍ͬͯ͞͞Δͱ ͱΓ͋͑ͣϒϩοΫσόΠε͕ੜ͑ͯ͘Δ

Slide 24

Slide 24 text

$ mkfs.xfs /dev/pmem0 meta-data=/dev/pmem0 isize=512 agcount=4, agsize=128896 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=515584, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ mount -t xfs /dev/pmem0 /mnt/pmem/ $ dmesg (ུ) [1506131.089817] XFS (pmem0): Mounting V5 Filesystem [1506131.094488] XFS (pmem0): Ending clean mount $ cd /mnt/pmem/ $ echo 'Hello, World' >hoge $ ls hoge $ mount|grep pmem0 /dev/pmem0 on /mnt/pmem type xfs (rw,relatime,attr2,inode64,noquota) ϑΝΠϧγεςϜΛ࡞ͬͯ Ϛ΢ϯτͯ͠ಡΈॻ͖

Slide 25

Slide 25 text

Ϣʔβۭؒ Χʔωϧۭؒ ॻ͖ࠐΈΛཁٻ͢Δ ॻ͖ࠐΉϖʔδ͕͋Δఔ౓ͷྔʹͳΔ·ͰஷΊΔ ϑΝΠϧγεςϜͷҧ͍Λந৅Խ͢Δ ετϨʔδͷͲ͜ʹॻ͖ࠐΉ͔Λܾఆ͢Δ ཁٻΛޮ཰Α͘ॻ͖ࠐΊΔॱ൪ʹฒ΂׵͑Δ ࣮ࡍͷσόΠεʹॻ͖ࠐΈΛߦ͏ ΞϓϦέʔγϣϯ VFS ϑΝΠϧγεςϜ IOεέδϡʔϥ σόΠευϥΠό ϖʔδΩϟογϡ bio ϋʔυ΢ΣΞͷҧ͍Λந৅Խ͢Δ Linux্Ͱ ϑΝΠϧͷॻ͖ࠐΈΛཁٻ͔ͯ͠Β ϋʔυσΟεΫʹॻ͖ࠐ·ΕΔ·Ͱ

Slide 26

Slide 26 text

Ϣʔβۭؒ Χʔωϧۭؒ ΞϓϦέʔγϣϯ VFS ϑΝΠϧγεςϜ IOεέδϡʔϥ σόΠευϥΠό ϖʔδΩϟογϡ bio ॻ͖ࠐΈॱং͕ ॻ͖ࠐΈ଎౓ʹӨڹΛ༩͑ͳ͍ͷͰ εέδϡʔϦϯά͸ཁΒͳ͍ ͜Ε͸ NVMeͰ΋ লུ͞Ε͍ͯͨ

Slide 27

Slide 27 text

ඞཁͳσʔλΛ Ұ࣌తʹίϐʔ ίϐʔ͞Εͨ σʔλΛಡΉ DRAM্ͷσʔλΛ ॻ͖׵͑Δ ॻ͖׵Θͬͨ಺༰Λ σΟεΫʹಉظ͢Δ ϖʔδΩϟογϡ CPU͸ϋʔυσΟεΫͷ಺༰Λ ௚઀ಡΈॻ͖͸Ͱ͖ͳ͍ σΟεΫͷ಺༰ͷ Ұ෦ͷίϐʔ ӬଓԽ͞Εͨ σʔλ

Slide 28

Slide 28 text

ϖʔδΩϟογϡΛ Ϣʔβۭؒϓϩηεͷ Ծ૝ΞυϨεۭؒʹϚοϓͯ͠ ಡΈॻ͖Ͱ͖ΔΑ͏ʹ͢Δ mmap σΟεΫͷ಺༰ͷ Ұ෦ͷίϐʔ ϓϩηεͷԾ૝ΞυϨεۭؒ ӬଓԽ͞Εͨ σʔλ void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

Slide 29

Slide 29 text

const auto fd = open( filename.c_str(), new_file ? O_RDWR|O_CREAT : O_RDWR, 0644 ); if( fd < 0 ) { std::cerr << strerror( errno ) << std::endl; return 1; } if( new_file ) ftruncate( fd, file_size ); const auto raw = mmap( nullptr, file_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0 ); if( raw == nullptr ) { std::cerr << strerror( errno ) << std::endl; return 1; } std::unique_ptr< char, unmap > mapped( reinterpret_cast< char* >v( raw ), unmap( mapped_length ) ); if( vm.count( "write" ) ) { std::copy( new_value.begin(), new_value.end(), mapped.get() ); mapped.get()[ new_value_size ] = '\0'; msync( mapped.get(), file_size, MS_SYNC ); } else std::cout << mapped.get() << std::endl; } ϑΝΠϧΛ։͍ͯ mmapͯ͠ಘͨΞυϨεʹ msync ஋Λॻ͖ࠐΜͰ

Slide 30

Slide 30 text

[ 0.000000] reserve setup_data: [mem 0x0000000000058000-0x0000000000058fff] reserved [ 0.000000] reserve setup_data: [mem 0x0000000000059000-0x000000000009efff] usable [ 0.000000] reserve setup_data: [mem 0x000000000009f000-0x000000000009ffff] reserved [ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000009c4d6017] usable [ 0.000000] reserve setup_data: [mem 0x000000009c4d6018-0x000000009c4e6c57] usable [ 0.000000] reserve setup_data: [mem 0x000000009c4e6c58-0x000000009c4e7017] usable [ 0.000000] reserve setup_data: [mem 0x000000009c4e7018-0x000000009c4f7057] usable [ 0.000000] reserve setup_data: [mem 0x000000009c4f7058-0x000000009c4f8017] usable [ 0.000000] reserve setup_data: [mem 0x000000009c4f8018-0x000000009c518057] usable [ 0.000000] reserve setup_data: [mem 0x000000009c518058-0x000000009dc65fff] usable [ 0.000000] reserve setup_data: [mem 0x000000009dc66000-0x000000009dc92fff] ACPI data [ 0.000000] reserve setup_data: [mem 0x000000009dc93000-0x000000009f7f7fff] usable [ 0.000000] reserve setup_data: [mem 0x000000009f7f8000-0x000000009f7f8fff] ACPI NVS [ 0.000000] reserve setup_data: [mem 0x000000009f7f9000-0x000000009f822fff] reserved [ 0.000000] reserve setup_data: [mem 0x000000009f823000-0x000000009f8c7fff] usable [ 0.000000] reserve setup_data: [mem 0x000000009f8c8000-0x00000000a03d8fff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000a03d9000-0x00000000a5952fff] usable [ 0.000000] reserve setup_data: [mem 0x00000000a5953000-0x00000000a705afff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000a705b000-0x00000000a707cfff] ACPI data [ 0.000000] reserve setup_data: [mem 0x00000000a707d000-0x00000000a7236fff] usable [ 0.000000] reserve setup_data: [mem 0x00000000a7237000-0x00000000a786ffff] ACPI NVS [ 0.000000] reserve setup_data: [mem 0x00000000a7870000-0x00000000a7ffefff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000a7fff000-0x00000000a7ffffff] usable [ 0.000000] reserve setup_data: [mem 0x00000000a8000000-0x00000000a80fffff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000f8000000-0x00000000fbffffff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000fe000000-0x00000000fe010fff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000fec00000-0x00000000fec00fff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000fee00000-0x00000000fee00fff] reserved [ 0.000000] reserve setup_data: [mem 0x00000000ff000000-0x00000000ffffffff] reserved [ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000037fffffff] usable [ 0.000000] reserve setup_data: [mem 0x0000000380000000-0x00000003ffffffff] persistent (type 12) [ 0.000000] reserve setup_data: [mem 0x0000000400000000-0x000000044dffffff] usable ىಈ࣌ͷ ΧʔωϧϩάͷҰ෦ /7%*..͸શྖҬ͕෺ཧΞυϨεۭؒʹస͕͍ͬͯΔ [mem 0x0000000380000000-0x00000003ffffffff] persistent (type 12)

Slide 31

Slide 31 text

CPU͸NVDIMMͷ಺༰Λ ௚઀ಡΈॻ͖Ͱ͖Δ NVDIMMͷϨΠςϯγ͕ DRAMͷϨΠςϯγʹ͍ۙ৔߹ NVDIMM্ͷσʔλΛ ϖʔδΩϟογϡʹ ίϐʔ͢Δͷ͸ ແବ σΟεΫͷ಺༰ͷ Ұ෦ͷίϐʔ ӬଓԽ͞Εͨ σʔλ

Slide 32

Slide 32 text

mmap࣌ʹ ϑΝΠϧ͕ஔ͔Εͨ෺ཧΞυϨεΛ ௚઀ϓϩηεͷԾ૝ΞυϨεۭؒʹ Ϛοϓ͢Δ Filesystem DAX ϓϩηεͷԾ૝ΞυϨεۭؒ ӬଓԽ͞Εͨ σʔλ

Slide 33

Slide 33 text

$ mount -t xfs -o dax /dev/pmem0 /mnt/pmem/ $ dmesg (ུ) [1686537.353077] XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [1686537.356044] XFS (pmem0): Mounting V5 Filesystem [1686537.361297] XFS (pmem0): Ending clean mount $ cd /mnt/pmem/ $ ls hoge $ mount|grep pmem0 /dev/pmem0 on /mnt/pmem type xfs (rw,relatime,attr2,dax,inode64,noquota) Filesystem DAXʹରԠͨ͠ϑΝΠϧγεςϜͰ Ϛ΢ϯτ࣌ʹ-o daxΛ෇͚Δ Filesystem DAXΛ༗ޮʹ͢Δ

Slide 34

Slide 34 text

Ϣʔβۭؒ Χʔωϧۭؒ ΞϓϦέʔγϣϯ VFS σόΠευϥΠό ϖʔδΩϟογϡ bio ϑΝΠϧγεςϜ mmapͨ͠ྖҬͷಡΈॻ͖͸ ΧʔωϧͷϒϩοΫϨΠϠʔΛᷖճͯ͠ ௚઀σόΠεʹରͯ͠ߦΘΕΔ

Slide 35

Slide 35 text

[mem 0x0000000380000000-0x00000003ffffffff] persistent (type 12) NVDIMMͷ෺ཧΞυϨε $ ./00_get_physical_address -p `pidof 00_mmap` -f /mnt/pmem/fuga /mnt/pmem/fuga: VirtualAddress=0x7f9e0086b000 PhysicalAddress=0x41d1d4000 -o daxΛ෇͚͍ͯͳ͍৔߹ -o daxΛ෇͚ͨ৔߹ $ ./00_get_physical_address -p `pidof 00_mmap` -f /mnt/pmem/fuga /mnt/pmem/fuga: VirtualAddress=0x7fae9df25000 PhysicalAddress=0x38220d000 mmapͷฦΓ஋ͷԾ૝ΞυϨεʹରԠ͢Δ෺ཧΞυϨε͸ NVDIMMͷઌ಄͔Β35,704,832όΠτͷҐஔΛࢦ͍ͯ͠Δ mmapͷฦΓ஋ͷԾ૝ΞυϨεʹରԠ͢Δ෺ཧΞυϨε͸ NVDIMMҎ֎ͷͲ͔͜Λࢦ͍ͯ͠Δ

Slide 36

Slide 36 text

NNBQ͢ΔطଘͷΞϓϦέʔγϣϯ͕ มߋͳ͠ͰΧʔωϧΛᷖճͯ͠ߴ଎ ΍ͬͨʔ ʜͱ͸͍͔ͳ͍

Slide 37

Slide 37 text

const auto fd = open( filename.c_str(), new_file ? O_RDWR|O_CREAT : O_RDWR, 0644 ); if( fd < 0 ) { std::cerr << strerror( errno ) << std::endl; return 1; } if( new_file ) ftruncate( fd, file_size ); const auto raw = mmap( nullptr, file_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0 ); if( raw == nullptr ) { std::cerr << strerror( errno ) << std::endl; return 1; } std::unique_ptr< char, unmap > mapped( reinterpret_cast< char* >v( raw ), unmap( mapped_length ) ); if( vm.count( "write" ) ) { std::copy( new_value.begin(), new_value.end(), mapped.get() ); mapped.get()[ new_value_size ] = '\0'; msync( mapped.get(), file_size, MS_SYNC ); } else std::cout << mapped.get() << std::endl; } ͍ͭ͜͸ԿΛ͍ͯ͠Δͷ͔

Slide 38

Slide 38 text

NTZOD #include int msync(void *addr, size_t length, int flags); mmap͞ΕͨྖҬͷ͏ͪɺมߋ͕Ճ͑ΒΕͨ෦෼ΛϑΝΠϧγεςϜʹ൓ө͢Δ Ϣʔβۭؒ Χʔωϧۭؒ ΞϓϦέʔγϣϯ VFS ϑΝΠϧγεςϜ IOεέδϡʔϥ σόΠευϥΠό ϖʔδΩϟογϡ bio ͜͜Ͱࢭ·͍ͬͯΔσʔλΛ ετϨʔδ·Ͱ൓өͤ͞Δ Filesystem DAXͰ͸ ϖʔδΩϟγϡΛᷖճͯ͠ ௚઀σόΠεʹॻ͍͍ͯΔͷ͔ͩΒ msync͸ཁΒͳ͍ͷͰ͸?

Slide 39

Slide 39 text

Ϩδελ L1 cache L2 cache L3 cache CPUͱNVDIMMͷؒʹ͸ΩϟογϡϝϞϦ͕͋Δ $16͔Β͸ॻ͚ͨΑ͏ʹݟ͍͑ͯͯ΋ ͜ͷลͰࢭ·͍ͬͯΔ͔΋͠Εͳ͍ ͜ͷঢ়ଶͰిݯ͕མͪΔͱ ॻ͍ͨഺͷ಺༰͸ࣦΘΕΔ

Slide 40

Slide 40 text

CLFLUSH—Flush Cache Line Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the linear address specified with the memory operand. If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory. The source operand is a byte memory location. — Intel® 64 and IA-32 Architectures Software Developer’s ManualΑΓ શͯͷΩϟογϡ͔Βࢦఆ͞ΕͨΞυϨεΛؚΉΩϟογϡϥΠϯΛ࡟আ͢Δ ͦͷΩϟογϡϥΠϯ͕มߋ͞ΕͨσʔλΛؚΜͰ͍Δ৔߹ϝϞϦʹॻ͘ ׬ྃ͢Δ·Ͱϓϩηοα͸଴ػ͢Δ ΍Ίͯ! ΍Ίͯ!

Slide 41

Slide 41 text

ॻ͍ͯ ॻ͍ͯ ॻ͚ͨ ॻ͚ͨ ॻ͍ͯ ॻ͍ͯ ॻ͚ͨ ॻ͍ͯ શ෦ॻ͚ͨ CLFLUSH ͜͏͍ͨ͠

Slide 42

Slide 42 text

CLFLUSHOPT—Flush Cache Line Optimized (ུ) to enforce ordering with such an operation, software can insert an SFENCE instruction between CFLUSHOPT and that operation. — Intel® 64 and IA-32 Architectures Software Developer’s ManualΑΓ શͯͷΩϟογϡ͔Βࢦఆ͞ΕͨΞυϨεΛؚΉΩϟογϡϥΠϯΛ࡟আ͢Δ ͦͷΩϟογϡϥΠϯ͕มߋ͞ΕͨσʔλΛؚΜͰ͍Δ৔߹ϝϞϦʹॻ͘ ׬ྃΛ଴͍ͪͨ৔߹͸4'&/$&͢Δ ΍Ίͯ!

Slide 43

Slide 43 text

CLWB—Cache Line Write Back Writes back to memory the cache line (if modified) that contains the linear address specified with the memory operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the cache hierarchy in non-modified state. — Intel® 64 and IA-32 Architectures Software Developer’s ManualΑΓ Ωϟογϡͷதʹࢦఆ͞ΕͨΞυϨεΛؚΉΩϟογϡϥΠϯ͕͋ͬͯ΋࡟আ͠ͳ͍ ͦͷΩϟογϡϥΠϯ͕มߋ͞ΕͨσʔλΛؚΜͰ͍Δ৔߹ϝϞϦʹॻ͘ ׬ྃΛ଴͍ͪͨ৔߹͸4'&/$&͢Δ

Slide 44

Slide 44 text

NTZODΛཁٻ͞ΕͨΒ Χʔωϧ͸NNBQ͞ΕͨྖҬͷ͏ͪ มߋ͕͋ͬͨ෦෼Λ$-8#͠ͳ͚Ε͹Βͳ͍ Ϣʔβۭؒ Χʔωϧۭؒ ΞϓϦέʔγϣϯ VFS σόΠευϥΠό ϖʔδΩϟογϡ bio ϑΝΠϧγεςϜ Χʔωϧ͸NVDIMM্ͷ มߋ͞ΕͨϖʔδΛ ஌͍ͬͯͳ͚Ε͹ͳΒͳ͍ ?

Slide 45

Slide 45 text

ϓϩηεͷԾ૝ΞυϨεۭؒ ΞΫηε ϖʔδ͕ͳ͍ Ϣʔβۭؒ Χʔωϧۭؒ ϖʔδͱσόΠε্ͷҐஔͷ ରԠΛaddress spaceʹه࿥ ৽͍͠ϖʔδΩϟογϡΛ֬อͯ͠ ಡΈࠐΈઐ༻ʹ͢Δ ͷ৔߹ ϖʔδϑΥʔϧτ address space

Slide 46

Slide 46 text

ϓϩηεͷԾ૝ΞυϨεۭؒ ॻ͖ࠐΈ Ϣʔβۭؒ Χʔωϧۭؒ address space address space্ͷΤϯτϦʹ dirty bitΛཱͯΔ ϖʔδΛॻ͖ࠐΈՄೳʹ͢Δ ͷ৔߹ ಡΈࠐΈઐ༻ ͷϖʔδ ϖʔδϑΥʔϧτ

Slide 47

Slide 47 text

ϓϩηεͷԾ૝ΞυϨεۭؒ Ϣʔβۭؒ Χʔωϧۭؒ address space ͷ৔߹ msyncΛཁٻ͞ΕͨΒ ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ σόΠεʹॻ͖ࠐΉ msync ίϐʔ

Slide 48

Slide 48 text

ϓϩηεͷԾ૝ΞυϨεۭؒ ΞΫηε ϖʔδ͕ͳ͍ Ϣʔβۭؒ Χʔωϧۭؒ ϖʔδϑΥʔϧτ ϖʔδͱσόΠε্ͷҐஔͷ ରԠΛaddress spaceʹه࿥ NVDIMM্ͷྖҬΛ ಡΈࠐΈઐ༻ͰׂΓ౰ͯΔ ͷ৔߹ address space

Slide 49

Slide 49 text

ϓϩηεͷԾ૝ΞυϨεۭؒ ॻ͖ࠐΈ Ϣʔβۭؒ Χʔωϧۭؒ address space্ͷΤϯτϦʹ dirty bitΛཱͯΔ ϖʔδΛॻ͖ࠐΈՄೳʹ͢Δ ͷ৔߹ ಡΈࠐΈઐ༻ ϖʔδ ϖʔδϑΥʔϧτ address space

Slide 50

Slide 50 text

ϓϩηεͷԾ૝ΞυϨεۭؒ Ϣʔβۭؒ Χʔωϧۭؒ ͷ৔߹ msync address space msyncΛཁٻ͞ΕͨΒ ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ CLWB͢Δ CLWB

Slide 51

Slide 51 text

ϓϩηεͷԾ૝ΞυϨεۭؒ Ϣʔβۭؒ Χʔωϧۭؒ ͷ৔߹ msync address space msyncΛཁٻ͞ΕͨΒ ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ CLWB͢Δ CLWB Y@ͷ࠷খϖʔδαΠζ όΠτ Y@ͷΩϟογϡϥΠϯͷαΠζ όΠτ ϖʔδϑΥʔϧτͰॻ͖׵͑Λݕ஌͢ΔΧʔωϧ͸ ϖʔδͷཻ౓Ͱ͔͠ॻ͖׵͑ΒΕͨ෦෼Λ೺ѲͰ͖ͳ͍ ͨͱ͑ॻ͖׵͑ΒΕͨͷ͕1ͭͷΩϟογϡϥΠϯͩͬͨͱͯ͠΋ 64ճͷCLWB͕ඞཁʹͳΔ

Slide 52

Slide 52 text

ϓϩηεͷԾ૝ΞυϨεۭؒ Ϣʔβۭؒ Χʔωϧۭؒ ͷ৔߹ msync address space msyncΛཁٻ͞ΕͨΒ ൣғ಺ͷdirty bitཱ͕͍ͬͯΔϖʔδΛ CLWB͢Δ CLWB ॻ͖׵͑ΛߦͬͨϢʔβۭؒΞϓϦέʔγϣϯ͸ ࣗ෼͕ॻ͍ͨ෦෼͕Կॲͳͷ͔Λ஌͍ͬͯΔ CLWB͸ಛݖ໋ྩͰ͸ͳ͍ҝϢʔβۭ͔ؒΒ௚઀౤͛Δࣄ͕Ͱ͖Δ msyncͱ͔͠ͳ͍Ͱ Ϣʔβۭ͔ؒΒॻ͍ͨ෦෼ʹCLWBΛ౤͍͛ͨ

Slide 53

Slide 53 text

Persistent Memory Development Kit ϢʔβۭؒͰͷflush౳ͷ NVDIMMΛ׆༻͢Δҝʹཉ͍͠ػೳΛඋ͑ͨϥΠϒϥϦ܈ https://pmem.io/

Slide 54

Slide 54 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj

Slide 55

Slide 55 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj libpmem mmapͷϓϥοτϑΥʔϜඇґଘͷϥούpmem_map_file΍ msyncΑΓࡉཻ͔͍౓ͰflushͰ͖Δpmem_persist౳ͷ جຊతͳૢ࡞Λߦ͏ؔ਺ΛؚΉ

Slide 56

Slide 56 text

const auto raw = pmem_map_file( filename.c_str(), file_size, device_dax ? 0 : PMEM_FILE_CREATE, 0644, &mapped_length, &is_pmem ); if( raw == nullptr ) { std::cerr << strerror( errno ) << std::endl; return 1; } std::unique_ptr< char, unmap_pmem > mapped( reinterpret_cast< char* >( raw ), unmap_pmem( mapped_length ) ); if( vm.count( "write" ) ) { std::copy( new_value.begin(), new_value.end(), mapped.get() ); mapped.get()[ new_value_size ] = '\0'; if( is_pmem ) pmem_persist( mapped.get(), new_value_size ); else { if( pmem_msync( mapped.get(), new_value_size ) ) { std::cerr << strerror( errno ) << std::endl; return 1; } } } else std::cout << mapped.get() << std::endl; pmem_map_fileͯ͠ ෆشൃϝϞϦͩͬͨΒ pmem_persist ஋Λॻ͖ࠐΜͰ ී௨ͷετϨʔδͩͬͨΒ pmem_msync

Slide 57

Slide 57 text

void pmem_persist(const void *addr, size_t len); ࢦఆ͞ΕͨΞυϨεͷൣғʹରͯ͠CPU͕αϙʔτ͢Δํ๏Ͱ ΩϟογϡͷϥΠτόοΫΛߦ͏ int pmem_msync(const void *addr, size_t len); ࢦఆ͞ΕͨΞυϨεͷൣғΛؚΉϖʔδʹରͯ͠msyncΛݺͼग़͢ ͍ͣΕͷؔ਺΋msyncͱҟͳΓaddr͸ϖʔδͷઌ಄ʹ ΞϥΠϯ͞Ε͍ͯͳͯ͘΋ྑ͍ DAXͳΒ͜ͷૢ࡞͚ͩͰॻ͖ࠐΈ͕ӬଓԽ͞ΕΔ

Slide 58

Slide 58 text

const auto raw = pmem_map_file( filename.c_str(), file_size, device_dax ? 0 : PMEM_FILE_CREATE, 0644, &mapped_length, &is_pmem ); if( raw == nullptr ) { std::cerr << strerror( errno ) << std::endl; return 1; } std::unique_ptr< char, unmap_pmem > mapped( reinterpret_cast< char* >( raw ), unmap_pmem( mapped_length ) ); if( vm.count( "write" ) ) { std::copy( new_value.begin(), new_value.end(), mapped.get() ); mapped.get()[ new_value_size ] = '\0'; if( is_pmem ) pmem_persist( mapped.get(), new_value_size ); else { if( pmem_msync( mapped.get(), new_value_size ) ) { std::cerr << strerror( errno ) << std::endl; return 1; } } } else std::cout << mapped.get() << std::endl; ͜Ε ͜ΕΛॻ͍͍ͯΔ࠷தʹ ిݯ͕མͪΔͱͲ͏ͳΔ͔

Slide 59

Slide 59 text

Hello, W Ωϟογϡ orld! Hello, W Ωϟογϡʹۭ͖͕ͳ͍ͷͰ ݹ͍ॻ͖ࠐΈΛflush $-8# orld! ͜͜Ͱిݯ͕ མͪΔͱ σʔλ͕յΕΔ

Slide 60

Slide 60 text

CJU Ұൠతͳx86_64ͷPCͷCPUͱϝϞϦͷؒ͸ 64bitͷσʔλόεͰܨ͕͍ͬͯΔ 64bitΑΓେ͖ͳσʔλ͸ 2ճҎ্ʹ෼͚ͯૹΒΕΔ 64bitΑΓେ͖ͳσʔλ͸ిݯ૕ࣦޙ ్த·Ͱॻ͔Ε͍ͯΔ͔΋͠Εͳ͍

Slide 61

Slide 61 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj libpmemobj Ͱ͔͍σʔλΛτϥϯβΫγϣφϧʹॻͨ͘Ίͷ δϟʔφϧΛ࡞Δ

Slide 62

Slide 62 text

PMEMobjpool *raw_pool = create ? pmemobj_create( filename.c_str(), layout, file_size, 0666 ) : pmemobj_open( filename.c_str(), layout ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool ); PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) ); auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) ); if( !new_value.empty() ) { new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( root,offsetof( data_t, message ), sizeof( char ) * ( new_value.size() + 1 ) ); std::copy( new_value.begin(), new_value.end(), root_raw->message ); root_raw->message[ new_value.size() ] = '\0'; } TX_END } else std::cout << root_raw->message << std::endl;

Slide 63

Slide 63 text

PMEMobjpool *raw_pool = create ? pmemobj_create( filename.c_str(), layout, file_size, 0666 ) : pmemobj_open( filename.c_str(), layout ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool ); PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) ); auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) ); if( !new_value.empty() ) { new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( root,offsetof( data_t, message ), sizeof( char ) * ( new_value.size() + 1 ) ); std::copy( new_value.begin(), new_value.end(), root_raw->message ); root_raw->message[ new_value.size() ] = '\0'; } TX_END } else std::cout << root_raw->message << std::endl; ϓʔϧͷεʔύʔϒϩοΫΛ࡞Δ ͋ͷσʔλͱ ͜ͷσʔλ͸ ॻ͖ࠐΈ͕ ్தͰ్੾Ε͍ͯ·͢ ϩά σʔλ pmemobj Ͱ ͢ ϔομ

Slide 64

Slide 64 text

PMEMobjpool *raw_pool = create ? pmemobj_create( filename.c_str(), layout, file_size, 0666 ) : pmemobj_open( filename.c_str(), layout ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool ); PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) ); auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) ); if( !new_value.empty() ) { new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( root,offsetof( data_t, message ), sizeof( char ) * ( new_value.size() + 1 ) ); std::copy( new_value.begin(), new_value.end(), root_raw->message ); root_raw->message[ new_value.size() ] = '\0'; } TX_END } else std::cout << root_raw->message << std::endl; ϓʔϧͷϧʔτΦϒδΣΫτΛऔಘ͢Δ ϓʔϧͷઌ಄͔ΒͷΦϑηοτΛද͢ܕ PMEMoid pmemobj_directͰPMEMoid͕ࢦ͍ͯ͠ΔҐஔΛ ݱࡏͷϖʔδϚοϓͷ΋ͱͰͷԾ૝ΞυϨεʹม׵

Slide 65

Slide 65 text

PMEMobjpool *raw_pool = create ? pmemobj_create( filename.c_str(), layout, file_size, 0666 ) : pmemobj_open( filename.c_str(), layout ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool ); PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) ); auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) ); if( !new_value.empty() ) { new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( root,offsetof( data_t, message ), sizeof( char ) * ( new_value.size() + 1 ) ); std::copy( new_value.begin(), new_value.end(), root_raw->message ); root_raw->message[ new_value.size() ] = '\0'; } TX_END } else std::cout << root_raw->message << std::endl; pmemobj_tx_add_range͞ΕͨྖҬ͸ TX_ENDʹḷΓண͚ͳ͔ͬͨ৔߹ TX_BEGINલͷঢ়ଶʹͳΔ TX_BEGIN TX_END มߋA มߋB มߋA͚͕ͩ൓ө͞ΕͯมߋB͕൓ө͞Εͳ͍ঢ়ଶʹ͸ܾͯ͠ͳΒͳ͍

Slide 66

Slide 66 text

PMEMobjpool *raw_pool = create ? pmemobj_create( filename.c_str(), layout, file_size, 0666 ) : pmemobj_open( filename.c_str(), layout ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool ); PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) ); auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) ); if( !new_value.empty() ) { new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( root,offsetof( data_t, message ), sizeof( char ) * ( new_value.size() + 1 ) ); std::copy( new_value.begin(), new_value.end(), root_raw->message ); root_raw->message[ new_value.size() ] = '\0'; } TX_END } else std::cout << root_raw->message << std::endl; ͜͜·Ͱ૸Γ͖Δͱpmemobj_tx_add_range͞ΕͨྖҬ͸ pmem_persist͞ΕΔ

Slide 67

Slide 67 text

ର৅ൣғΛϩάʹίϐʔ։࢝ ϩάΛ༗ޮʹ͢Δ ର৅ൣғΛॻ͖׵͑׬ྃ ϩάΛແޮʹ͢Δ ϩάΛ࡟আ͢Δ

Slide 68

Slide 68 text

ର৅ൣғΛϩάʹίϐʔ։࢝ ϩάΛ༗ޮʹ͢Δ ର৅ൣғΛॻ͖׵͑׬ྃ ϩάΛແޮʹ͢Δ ϩάΛ࡟আ͢Δ ແޮ ϩά σʔλ ϩά σʔλ ࣍ʹpmemobj_openͨ࣌͠ʹ ແޮͳϩά͕͋ͬͨΒ࡟আ͢Δ ॻ͖׵͑લͷঢ়ଶʹͳΔ

Slide 69

Slide 69 text

࣍ʹpmemobj_openͨ࣌͠ʹ ༗ޮͳϩά͕͋ͬͨΒݩͷҐஔʹίϐʔ͢Δ ॻ͖׵͑લͷঢ়ଶʹͳΔ ༗ޮ ϩά σʔλ ϩά σʔλ ༗ޮ ର৅ൣғΛϩάʹίϐʔ։࢝ ϩάΛ༗ޮʹ͢Δ ର৅ൣғΛॻ͖׵͑׬ྃ ϩάΛແޮʹ͢Δ ϩάΛ࡟আ͢Δ

Slide 70

Slide 70 text

ίϐʔ։࢝ ʹ͢Δ ׵͑׬ྃ ʹ͢Δ ͢Δ ϩάΛ࠶ੜ͢Δ ϩά σʔλ ༗ޮ ϩάͷ࠶ੜதʹ࠶౓Ϋϥογϡͨ͠৔߹ ϩά͸༗ޮͳ··ͳͷͰ ࣍ʹpmemobj_openͨ͠ͱ͖ʹվΊͯ࠶ੜ͞ΕΔ ϩά σʔλ ༗ޮ

Slide 71

Slide 71 text

ର৅ൣғΛϩάʹίϐʔ։࢝ ϩάΛ༗ޮʹ͢Δ ର৅ൣғΛॻ͖׵͑׬ྃ ϩάΛແޮʹ͢Δ ϩάΛ࡟আ͢Δ ࣍ʹpmemobj_openͨ࣌͠ʹ ༗ޮͳϩά͕͋ͬͨΒݩͷҐஔʹίϐʔ͢Δ ॻ͖׵͑લͷঢ়ଶʹͳΔ ༗ޮ ϩά σʔλ ϩά σʔλ ༗ޮ

Slide 72

Slide 72 text

ର৅ൣғΛϩάʹίϐʔ։࢝ ϩάΛ༗ޮʹ͢Δ ର৅ൣғΛॻ͖׵͑׬ྃ ϩάΛແޮʹ͢Δ ϩάΛ࡟আ͢Δ ແޮ ϩά σʔλ ϩά σʔλ ࣍ʹpmemobj_openͨ࣌͠ʹ ແޮͳϩά͕͋ͬͨΒ࡟আ͢Δ ॻ͖׵͑ޙͷঢ়ଶʹͳΔ

Slide 73

Slide 73 text

TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) ); PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 ); auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) ); pmemobj_tx_add_range( next, 0, sizeof( data_t ) ); new(next_raw) data_t(); std::copy( new_value.begin(), new_value.end(), next_raw->message ); next_raw->message[ new_value.size() ] = '\0'; head_raw->next = next; } TX_END σʔλ pmem_tx_*allocͰ ϓʔϧʹ৽͍͠σʔλͷҝͷྖҬΛ֬อ ͜ͷϝϞϦ֬อ͸ TX_ENDʹḷΓண͚ͳ͔ͬͨ৔߹ ແ͔ͬͨ͜ͱʹͳΔ

Slide 74

Slide 74 text

TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( prev, offsetof( data_t, next ), sizeof( PMEMoid ) ); prev_raw->next = cur_raw->next; cur_raw->~data_t(); pmemobj_tx_free( cur ); } TX_END σʔλ pmem_tx_freeͰ ֬อͨ͠ྖҬΛղ์ ͜ͷϝϞϦղ์͸ TX_ENDʹḷΓண͚ͳ͔ͬͨ৔߹ ແ͔ͬͨ͜ͱʹͳΔ

Slide 75

Slide 75 text

#include #include #include #include #include #include class close_pmemobj { public: template< typename T > void operator()( T *p ) const { if( p ) pmemobj_close( p ); } }; namespace fs = boost::filesystem; bool is_special_file( const fs::path &p ) { return fs::status( p ).type() == fs::file_type::character_file || fs::status( p ).type() == fs::file_type::block_file; } struct data_t { char message[ 1024 ]; PMEMoid next; }; int main( int argc, const char *argv[] ) { namespace po = boost::program_options; po::options_description desc( "Options" );

Slide 76

Slide 76 text

fs::status( p ).type() == fs::file_type::block_file; } struct data_t { char message[ 1024 ]; PMEMoid next; }; int main( int argc, const char *argv[] ) { namespace po = boost::program_options; po::options_description desc( "Options" ); std::string new_value; std::string remove_value; uint64_t pool_size; constexpr const char layout[] = "90d2827d-3742-4054-aea8-7a43068085ac"; std::string filename; desc.add_options() ( "help,h", "show this message" ) ( "create,c", "create" ) ( "size,s", po::value< size_t >( &pool_size )->default_value( PMEMOBJ_MIN_POOL ), "pool size" ) ( "filename,f", po::value< std::string >( &filename )->default_value( "/dev/dax0.0" ), "filename" ) ( "append,a", po::value< std::string >( &new_value ), "append" ) ( "delete,d", po::value< std::string >( &remove_value ), "delete" ) ( "list,l", "list" ); po::variables_map vm; po::store( po::parse_command_line( argc, argv, desc ), vm ); po::notify( vm ); if( vm.count( "help" ) ) { std::cout << desc << std::endl; return 0; } size_t mapped_length = 0u; ࠷େ1024όΠτͷจࣈྻͱ ࣍ͷཁૉ΁ͷΦϑηοτΛ࣋ͭ୯ํ޲ϦϯΫϦετͷϊʔυ

Slide 77

Slide 77 text

pmemobj_open( filename.c_str(), layout ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMobjpool, close_pmemobj > pool( raw_pool ); PMEMoid root = pmemobj_root( pool.get(), sizeof( data_t ) ); auto root_raw = reinterpret_cast< data_t* >( pmemobj_direct( root ) ); if( !new_value.empty() ) { auto head = root; auto head_raw = root_raw; while( 1 ) { auto next = reinterpret_cast< data_t* >( pmemobj_direct( head_raw->next ) ); if( next ) { head = head_raw->next; head_raw = next; } else break; } new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) ); PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 ); auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) ); pmemobj_tx_add_range( next, 0, sizeof( data_t ) ); new(next_raw) data_t(); std::copy( new_value.begin(), new_value.end(), next_raw->message ); next_raw->message[ new_value.size() ] = '\0'; head_raw->next = next; } TX_END } ऴ୺ͷϊʔυΛ୳͢

Slide 78

Slide 78 text

auto head_raw = root_raw; while( 1 ) { auto next = reinterpret_cast< data_t* >( pmemobj_direct( head_raw->next ) ); if( next ) { head = head_raw->next; head_raw = next; } else break; } new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) ); PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 ); auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) ); pmemobj_tx_add_range( next, 0, sizeof( data_t ) ); new(next_raw) data_t(); std::copy( new_value.begin(), new_value.end(), next_raw->message ); next_raw->message[ new_value.size() ] = '\0'; head_raw->next = next; } TX_END } if( !remove_value.empty() ) { auto prev = root; auto prev_raw = root_raw; auto cur = prev_raw->next; auto cur_raw = reinterpret_cast< data_t* >( pmemobj_direct( cur ) ); while( cur_raw ) { if( strcmp( cur_raw->message, remove_value.data() ) == 0 ) { break; } auto next = reinterpret_cast< data_t* >( pmemobj_direct( cur_raw->next ) ); if( next ) { ऴ୺ͷϊʔυͷnextΛมߋର৅ͱͯ͠ϩάʹੵΉ ৽͍͠ϊʔυΛ࡞Δ ৽͍͠ϊʔυΛ มߋର৅ͱͯ͠ϩάʹੵΉ ৽͍͠ϊʔυʹ஋Λॻ͖ࠐΜͰ ऴ୺ͷϊʔυͷnextʹܨ͙ ͜ΕΒͷૢ࡞ΛTX_BEGIN͔ΒTX_ENDͷؒͰߦ͏

Slide 79

Slide 79 text

new_value.resize( std::min( new_value.size(), size_t( 1023 ) ) ); TX_BEGIN( pool.get() ) { pmemobj_tx_add_range( head, offsetof( data_t, next ), sizeof( PMEMoid ) ); PMEMoid next = pmemobj_tx_zalloc( sizeof( data_t ), 0 ); auto next_raw = reinterpret_cast< data_t* >( pmemobj_direct( next ) ); pmemobj_tx_add_range( next, 0, sizeof( data_t ) ); new(next_raw) data_t(); std::copy( new_value.begin(), new_value.end(), next_raw->message ); next_raw->message[ new_value.size() ] = '\0'; head_raw->next = next; } TX_END } if( !remove_value.empty() ) { auto prev = root; auto prev_raw = root_raw; auto cur = prev_raw->next; auto cur_raw = reinterpret_cast< data_t* >( pmemobj_direct( cur ) ); while( cur_raw ) { if( strcmp( cur_raw->message, remove_value.data() ) == 0 ) { break; } auto next = reinterpret_cast< data_t* >( pmemobj_direct( cur_raw->next ) ); if( next ) { prev = cur; cur = cur_raw->next; prev_raw = cur_raw; cur_raw = next; } else { std::cerr << "Not found." << std::endl; return 1; } ϓʔϧΛ࡞Δ ௥Ճ ௥Ճ ௥Ճ ࡟আ $ ./03_pmemobj_alloc -c -f test -s 67108864 $ ./03_pmemobj_alloc -f test -a abcde -l abcde $ ./03_pmemobj_alloc -f test -a fghij -l abcde fghij $ ./03_pmemobj_alloc -f test -a klmno -l abcde fghij klmno $ ./03_pmemobj_alloc -f test -d fghij -l abcde klmno

Slide 80

Slide 80 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj libpmemobj++ libpmemobjͷC++ϥούʔ

Slide 81

Slide 81 text

#include #include #include #include #include #include #include #include #include #include using pmem::obj::p; using pmem::obj::persistent_ptr; struct data_t { persistent_ptr< data_t > next; p< std::array< char, 1024 > > data; }; namespace fs = boost::filesystem; bool is_special_file( const fs::path &p ) { return fs::status( p ).type() == fs::file_type::character_file || fs::status( p ).type() == fs::file_type::block_file; } int main( int argc, const char *argv[] ) { namespace po = boost::program_options; po::options_description desc( "Options" ); std::string new_value = ""; std::string remove_value = ""; uint64_t pool_size;

Slide 82

Slide 82 text

#include #include using pmem::obj::p; using pmem::obj::persistent_ptr; struct data_t { persistent_ptr< data_t > next; p< std::array< char, 1024 > > data; }; namespace fs = boost::filesystem; bool is_special_file( const fs::path &p ) { return fs::status( p ).type() == fs::file_type::character_file || fs::status( p ).type() == fs::file_type::block_file; } int main( int argc, const char *argv[] ) { namespace po = boost::program_options; po::options_description desc( "Options" ); std::string new_value = ""; std::string remove_value = ""; uint64_t pool_size; constexpr const char layout[] = "dd58d49d-4be6-44e0-b160-37e79d94ecf8"; std::string filename; desc.add_options() ( "help,h", "show this message" ) ( "create,c", "create" ) ( "size,s", po::value< size_t >( &pool_size )->default_value( PMEMOBJ_MIN_POOL ), "pool size" ) ( "filename,f", po::value< std::string >( &filename )->default_value( "/dev/dax0.0" ), "filename" ) ( "append,a", po::value< std::string >( &new_value ), "append" ) ( "delete,d", po::value< std::string >( &remove_value ), "delete" ) ࠷େ1024όΠτͷจࣈྻͱ ࣍ͷཁૉ΁ͷΦϑηοτΛ࣋ͭ୯ํ޲ϦϯΫϦετͷϊʔυ

Slide 83

Slide 83 text

file_size = pool_size; create = true; } namespace pobj = pmem::obj; auto pool = create ? pobj::pool< data_t >::create( filename.c_str(), layout, file_size, 0666 ) : pobj::pool< data_t >::open( filename.c_str(), layout ); pobj::persistent_ptr< data_t > root = pool.get_root(); if( !new_value.empty() ) { auto next = root->next; auto cur = root; while( next ) { cur = next; next = next->next; } new_value.resize( 1023 ); std::array< char, 1024 > data; std::copy( new_value.begin(), new_value.end(), data.begin() ); data[ 1023 ] = '\0'; pmem::obj::transaction::exec_tx( pool, [&] { auto new_elem = pmem::obj::make_persistent< data_t >(); new_elem->data = data; cur->next = new_elem; } ); } if( !remove_value.empty() ) { auto next = root->next; auto cur = root; while( next ) { if( strcmp( next->data.get_ro().data(), remove_value.data() ) == 0 ) { const auto data_size = strlen( next->data.get_ro().data() ); pmem::obj::transaction::exec_tx( pool, [&] { ৽͍͠ϊʔυΛ࡞Δ ৽͍͠ϊʔυʹσʔλΛॻ͖ࠐΉ ऴ୺ͷϊʔυͷnextʹ৽͍͠ϊʔυΛܨ͙ ͜ΕΒͷૢ࡞Λexec_txʹ౉͢ϥϜμࣜͷதͰߦ͏

Slide 84

Slide 84 text

if( !new_value.empty() ) { auto next = root->next; auto cur = root; while( next ) { cur = next; next = next->next; } new_value.resize( 1023 ); std::array< char, 1024 > data; std::copy( new_value.begin(), new_value.end(), data.begin() ); data[ 1023 ] = '\0'; pmem::obj::transaction::exec_tx( pool, [&] { auto new_elem = pmem::obj::make_persistent< data_t >(); new_elem->data = data; cur->next = new_elem; } ); } if( !remove_value.empty() ) { auto next = root->next; auto cur = root; while( next ) { if( strcmp( next->data.get_ro().data(), remove_value.data() ) == 0 ) { const auto data_size = strlen( next->data.get_ro().data() ); pmem::obj::transaction::exec_tx( pool, [&] { cur->next = next->next; pmem::obj::delete_persistent< data_t >( next ); } ); break; } cur = next; next = next->next; } $ ./04_pmemobj++ -c -f test -s 67108864 $ ./04_pmemobj++ -f test -a abcde -l abcde $ ./04_pmemobj++ -f test -a fghij -l abcde fghij $ ./04_pmemobj++ -f test -a klmno -l abcde fghij klmno $ ./04_pmemobj++ -f test -d fghij -l abcde klmno ϓʔϧΛ࡞Δ ௥Ճ ௥Ճ ௥Ճ ࡟আ

Slide 85

Slide 85 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj libpmemlog ௥ه͔͠Ͱ͖ͳ͍͕libpmemobjΑΓ؆୯ʹॻ͖ࠐΊΔ

Slide 86

Slide 86 text

size_t mapped_length = 0u; int is_pmem = 0; fs::path path( filename ); bool device_dax = false; size_t file_size = 0u; bool create = vm.count( "create" ); if( fs::exists( path ) ) { device_dax = is_special_file( path ); if( !device_dax ) file_size = fs::file_size( path ); else file_size = 0; } else { file_size = pool_size; create = true; } PMEMlogpool *raw_pool = create ? pmemlog_create( filename.c_str(), file_size, 0666 ) : pmemlog_open( filename.c_str() ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMlogpool, close_pmemlog > pool( raw_pool ); if( !new_value.empty() ) pmemlog_append( pool.get(), new_value.data(), new_value.size() ); if( vm.count( "list" ) ) { pmemlog_walk( pool.get(), 0, []( const void *data, size_t length, void* ) -> int { std::cout << std::string( reinterpret_cast< const char* >( data ), length ) << std::endl; return 0; }, nullptr ); } } ։͘ ॻ͖଍͢ ᢞΊΔ

Slide 87

Slide 87 text

size_t mapped_length = 0u; int is_pmem = 0; fs::path path( filename ); bool device_dax = false; size_t file_size = 0u; bool create = vm.count( "create" ); if( fs::exists( path ) ) { device_dax = is_special_file( path ); if( !device_dax ) file_size = fs::file_size( path ); else file_size = 0; } else { file_size = pool_size; create = true; } PMEMlogpool *raw_pool = create ? pmemlog_create( filename.c_str(), file_size, 0666 ) : pmemlog_open( filename.c_str() ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMlogpool, close_pmemlog > pool( raw_pool ); if( !new_value.empty() ) pmemlog_append( pool.get(), new_value.data(), new_value.size() ); if( vm.count( "list" ) ) { pmemlog_walk( pool.get(), 0, []( const void *data, size_t length, void* ) -> int { std::cout << std::string( reinterpret_cast< const char* >( data ), length ) << std::endl; return 0; }, nullptr ); } } $ ./05_pmemlog -c -f test $ ./05_pmemlog -f test -a abcde -l abcde $ ./05_pmemlog -f test -a fghij -l abcdefghij $ ./05_pmemlog -f test -a klmno -l abcdefghijklmno ϓʔϧΛ࡞Δ ௥Ճ ௥Ճ ௥Ճ

Slide 88

Slide 88 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj libpmemblk ϒϩοΫ୯ҐͰͷॻ͖ࠐΈ͔͠Ͱ͖ͳ͍͕ pmemobjΑΓ؆୯ʹॻ͖ࠐΊΔ

Slide 89

Slide 89 text

PMEMblkpool *raw_pool = create ? pmemblk_create( filename.c_str(), block_size, file_size, 0666 ) : pmemblk_open( filename.c_str(), block_size ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMblkpool, close_pmemblk > pool( raw_pool ); const size_t block_count = pmemblk_nblock( pool.get() ); if( create ) { const char buffer[ block_size ] = { 0 }; for( size_t i = 0; i != block_count; ++i ) { pmemblk_write( pool.get(), buffer, i ); if( i % ( block_count / 10 ) == 0 ) std::cout << 100 * i / block_count << "%" << std::endl; } } if( !new_value.empty() ) { char buffer[ block_size ]; for( size_t i = 0; i != block_count; ++i ) { pmemblk_read( pool.get(), buffer, i ); if( buffer[ 0 ] == '\0' ) { new_value.resize( block_size - 1 ); std::copy( new_value.begin(), new_value.end(), buffer ); buffer[ new_value.size() ] = '\0'; pmemblk_write( pool.get(), buffer, i ); break; } } } if( vm.count( "list" ) ) { char buffer[ block_size ]; ։͘ ϒϩοΫΛಡΉ ϒϩοΫΛॻ͘

Slide 90

Slide 90 text

PMEMblkpool *raw_pool = create ? pmemblk_create( filename.c_str(), block_size, file_size, 0666 ) : pmemblk_open( filename.c_str(), block_size ); if( !raw_pool ) { std::cerr << filename << ':' << strerror( errno ) << std::endl; return 1; } std::unique_ptr< PMEMblkpool, close_pmemblk > pool( raw_pool ); const size_t block_count = pmemblk_nblock( pool.get() ); if( create ) { const char buffer[ block_size ] = { 0 }; for( size_t i = 0; i != block_count; ++i ) { pmemblk_write( pool.get(), buffer, i ); if( i % ( block_count / 10 ) == 0 ) std::cout << 100 * i / block_count << "%" << std::endl; } } if( !new_value.empty() ) { char buffer[ block_size ]; for( size_t i = 0; i != block_count; ++i ) { pmemblk_read( pool.get(), buffer, i ); if( buffer[ 0 ] == '\0' ) { new_value.resize( block_size - 1 ); std::copy( new_value.begin(), new_value.end(), buffer ); buffer[ new_value.size() ] = '\0'; pmemblk_write( pool.get(), buffer, i ); break; } } } if( vm.count( "list" ) ) { char buffer[ block_size ]; $ ./06_pmemblk -c -f test 0% 9% 19% 29% 39% 49% 59% 69% 79% 89% 99% $ ./06_pmemblk -f test -a abcde -l abcde $ ./06_pmemblk -f test -a fghij -l abcde fghij $ ./06_pmemblk -f test -a klmno -l abcde fghij klmno ϓʔϧΛ࡞Δ ௥Ճ ௥Ճ ௥Ճ

Slide 91

Slide 91 text

libpmem Persistent Memory Development Kit libpmemblk libpmemlog libvmmalloc libpmemobj++ ΞϓϦέʔγϣϯ libpmemobj libvmmalloc ϝϞϦ֬อʹؔΘΔؔ਺(mallocͱ͔)Λ NVDIMM͔ΒྖҬΛ֬อ͢Δؔ਺Ͱஔ͖׵͑Δ NVDIMMΛେ༰ྔشൃϝϞϦͱͯ͠࢖͏ࣄ͕Ͱ͖Δ

Slide 92

Slide 92 text

Sparse File % % % ϑΝΠϧ ϓϩηεͷԾ૝ΞυϨεۭؒ . ϑΝΠϧͷ͏ͪඇθϩ஋͕ॻ͖ࠐ·Ε͍ͯΔ ϖʔδ͚͕ͩετϨʔδʹه࿥͞Ε͍ͯΔ

Slide 93

Slide 93 text

Sparse File % % % ϑΝΠϧ ϓϩηεͷԾ૝ΞυϨεۭؒ ϑΝΠϧͷ͏ͪඇθϩ஋͕ॻ͖ࠐ·Ε͍ͯΔ ϖʔδ͚͕ͩετϨʔδʹه࿥͞Ε͍ͯΔ % ϖʔδ͕ͳ͍ͱ͜Ζʹॻ͖ࠐΉͱ ৽͍͠ϖʔδ͕֬อ͞ΕΔ ॻ͖ࠐΈ ϖʔδ͕૿͑ΔͱϑΝΠϧγεςϜͷϝλσʔλ͕มߋ͞ΕΔ .

Slide 94

Slide 94 text

Sparse File % % % ϑΝΠϧ ϓϩηεͷԾ૝ΞυϨεۭؒ % . ΞϓϦέʔγϣϯ͸ ϑΝΠϧͷσʔλ͚ͩΛ ॻ͖׵͍͑ͯΔͭ΋ΓͳͷͰ ͚ͩ͜͜flush͢Δ γεςϜ͕ఀࢭͨ͠λΠϛϯάʹΑͬͯ͸ ϝλσʔλ͕ݹ͍··ʹͳΓɺ৽͍͠ϖʔδͷ಺༰͕ࣦΘΕΔ

Slide 95

Slide 95 text

if (flags & PMEM_FILE_CREATE) { /* * Always set length of file to 'len'. * (May either extend or truncate existing file.) */ if (os_ftruncate(fd, (os_off_t)len) != 0) { ERR("!ftruncate"); goto err; } if ((flags & PMEM_FILE_SPARSE) == 0) { if ((errno = os_posix_fallocate(fd, 0, (os_off_t)len)) != 0) { ERR("!posix_fallocate"); goto err; } } } else { ssize_t actual_size = util_file_get_size(path); if (actual_size < 0) { ERR("stat %s: negative size", path); errno = EINVAL; goto err; } len = (size_t)actual_size; } pmdk-1.4.3/src/libpmem/pmem.cΑΓ ৽نϑΝΠϧ࡞੒࣌ʹ ϑΝΠϧͷઌ಄͔Β຤ඌ·ͰΛ fallocate͍ͯ͠Δ pmem_map_fileͰ ࡞੒͞ΕͨϑΝΠϧ͸ SparseʹͳΒͳ͍

Slide 96

Slide 96 text

Copy on Write % % % ϑΝΠϧ ϓϩηεͷԾ૝ΞυϨεۭؒ % . ͍͔ͭ͘ͷϑΝΠϧγεςϜ͸ ϖʔδ͕ॻ͖׵͑ΒΕΔࡍʹ ඞͣ৽͍͠ྖҬΛ֬อ͢Δ flushΛϢʔβۭؒͰย෇͚Δࣄ͕ઈରʹͰ͖ͳ͍ ·͍ͣ ॻ͖ࠐΈ

Slide 97

Slide 97 text

• flushΛϢʔβۭؒͰย෇͚͍ͨ • ϑΝΠϧγεςϜ͸Χʔωϧ͕؅ཧ͍ͯ͠Δ ͜ͷ2ͭΛཱ྆ͤ͞Α͏ͱ͢Δͷ͕ෆ޾ͷݯ ϑΝΠϧγεςϜΛ΍ΊΑ͏

Slide 98

Slide 98 text

NVDIMMσόΠεΛ௚઀ ϓϩηεͷԾ૝ΞυϨεۭؒʹ mmapͰ͖ΔΑ͏ʹ͢Δ Device DAX ϓϩηεͷԾ૝ΞυϨεۭؒ NVDIMM্ʹ࡞ͬͨ ϑΝΠϧγεςϜͷ্ͷ ϑΝΠϧͰ͸ͳ͘

Slide 99

Slide 99 text

Device DAX NVDIMMσόΠεΛ௚઀ ϓϩηεͷԾ૝ΞυϨεۭؒʹ mmapͰ͖ΔΑ͏ʹ͢Δ ΞυϨεۭؒ ར఺ Ϣʔβۭؒϓϩηε͕ flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ

Slide 100

Slide 100 text

Device DAX NVDIMMσόΠεΛ௚઀ ϓϩηεͷԾ૝ΞυϨεۭؒʹ mmapͰ͖ΔΑ͏ʹ͢Δ ΞυϨεۭؒ ར఺ Ϣʔβۭؒϓϩηε͕ flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ ॻ͖ࠐΈʹ͔͔Δ͕࣌ؒ༧ଌͰ͖Δ

Slide 101

Slide 101 text

Device DAX NVDIMMσόΠεΛ௚઀ ϓϩηεͷԾ૝ΞυϨεۭؒʹ mmapͰ͖ΔΑ͏ʹ͢Δ ΞυϨεۭؒ ར఺ Ϣʔβۭؒϓϩηε͕ flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ ॻ͖ࠐΈʹ͔͔Δ͕࣌ؒ༧ଌͰ͖Δ 1GB HugePageΛ࢖ͬͯ TLBϛεΛ཈͑Δࣄ͕Ͱ͖Δ

Slide 102

Slide 102 text

Device DAX NVDIMMσόΠεΛ௚઀ ϓϩηεͷԾ૝ΞυϨεۭؒʹ mmapͰ͖ΔΑ͏ʹ͢Δ ΞυϨεۭؒ ར఺ Ϣʔβۭؒϓϩηε͕ flushΛཁ͢ΔՕॴΛ׬શʹ೺ѲͰ͖Δ ॻ͖ࠐΈʹ͔͔Δ͕࣌ؒ༧ଌͰ͖Δ 1GB HugePageΛ࢖ͬͯ TLBϛεΛ཈͑Δࣄ͕Ͱ͖Δ ܽ఺ ϑΝΠϧγεςϜ͸࢖͑ͳ͍

Slide 103

Slide 103 text

https://github.com/pmem/ndctl ndctl NVDIMMΛͲͷΑ͏ʹར༻͢Δ͔Λ LinuxΧʔωϧͷNVDIMMαϒγεςϜʹࢦࣔ͢ΔίϚϯυ

Slide 104

Slide 104 text

$ ndctl list [ { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":2111832064, "uuid":"d8aeb862-2052-4d0e-af2b-4961dfaca8d3", "sector_size":512, "align":2097152, "blockdev":"pmem0" } ] $ umount /mnt/pmem $ ndctl disable-namespace namespace0.0 disabled 1 namespace $ ndctl destroy-namespace "namespace0.0" destroyed 0 namespaces $ ndctl list $ ls /dev/pmem0 ls: cannot access '/dev/pmem0': No such file or directory σόΠεͷશྖҬ͕ Filesystem DAXΛ࢖͑Δ໊લۭؒʹ ׂΓ౰ͯΒΕ͍ͯͯ /dev/pmem0͔Β ϒϩοΫσόΠεͱͯ͠࢖͑Δঢ়ଶ ϑΝΠϧγεςϜΛΞϯϚ΢ϯτ͠ ໊લۭؒΛ࡟আ

Slide 105

Slide 105 text

disabled 1 namespace $ ndctl destroy-namespace "namespace0.0" destroyed 0 namespaces $ ndctl list $ ls /dev/pmem0 ls: cannot access '/dev/pmem0': No such file or directory ϑΝΠϧγεςϜΛΞϯϚ΢ϯτ͠ ໊લۭؒΛ࡟আ $ ndctl create-namespace -e "namespace0.0" -m devdax -a 1G { "dev":"namespace0.0", "mode":"devdax", "map":"dev", "size":"1024.00 MiB (1073.74 MB)", "uuid":"e307a092-8d2d-4d4c-a96e-2163c7d0b770", "daxregion":{ "id":0, "size":"1024.00 MiB (1073.74 MB)", "align":1073741824, "devices":[ { "chardev":"dax0.0", "size":"1024.00 MiB (1073.74 MB)", "target_node":0, "mode":"devdax" } ] }, "align":1073741824 } ར༻ํ๏devdax ϖʔδαΠζ1GBͰ ৽໊͍͠લۭؒΛ࡞Δ

Slide 106

Slide 106 text

{ "chardev":"dax0.0", "size":"1024.00 MiB (1073.74 MB)", "target_node":0, "mode":"devdax" } ] }, "align":1073741824 } ৽໊͍͠લۭؒΛ࡞Δ ls -lha /dev/dax0.0 crw------- 1 root root 252, 6 10݄ 19 10:35 /dev/dax0.0 It's a character device!

Slide 107

Slide 107 text

Device DAX ls -lha /dev/dax0.0 crw------- 1 root root 252, 6 10݄ 19 10:35 /dev/dax0.0 ͜ͷσόΠε͸ • open (։͘) • close (ด͡Δ) • mmap (Ծ૝ΞυϨεۭؒʹϚοϓ͢Δ) • fallocate (Ϛοϓͨ͠෺ͷҰ෦Λണ͕͢) ͚͕ͩͰ͖Δ fallocate͸ಛఆͷϖʔδͷׂΓ౰ͯΛ ണ͕͚ͩ͢ͷͨΊʹ༻ҙ͞Ε͍ͯΔ

Slide 108

Slide 108 text

bool device_dax = false; size_t file_size = 0u; bool create = vm.count( "create" ); if( fs::exists( path ) ) { device_dax = is_special_file( path ); if( !device_dax ) file_size = fs::file_size( path ); else file_size = 0; } else { file_size = pool_size; create = true; } PMEMobjpool *raw_pool = create ? pmemobj_create( filename.c_str(), layout, file_size, 0666 ) : pmemobj_open( filename.c_str(), layout ); DeviceDAXʹpmemobjͷϓʔϧΛ࡞Δ࣌͸ pmemobj_createͷϑΝΠϧαΠζΛ0ʹ͢Δ

Slide 109

Slide 109 text

͓·͚ Intel Optane DC Persistent MemoryΛಈ͔͢ʹ͸ CascadeLakeϚΠΫϩΞʔΩςΫνϟҎ߱ͷ Xeon GoldҎ্ͷϓϩηοα͕ཁΔ ௒ߴ͍ memmap=2G!14G ΧʔωϧύϥϝʔλmemmapʹಛผͳࢦఆΛ෇͚ͯLinuxΛىಈ͢Δͱ DRAMͷҰ෦ΛNVDIMMͩͱࢥ͍ࠐΉΑ͏ʹͳΔ /7%*..ѻ͍͢ΔαΠζ %3".ѻ͍͢ΔαΠζ NVDIMMΛ࢖͏ΞϓϦέʔγϣϯͷςετʹศར

Slide 110

Slide 110 text

·ͱΊ ϝϞϦͷΑ͏ʹॻ͚Δ৽͍͠ετϨʔδ NVDIMM ΧʔωϧͷϒϩοΫϨΠϠʔΛᷖճ͢Δ Filesystem DAXͱDevice DAX ޮ཰ͷѱ͍ϖʔδ୯Ґͷflushͷ୅ΘΓʹϢʔβۭؒͰ PMDK