Slide 1

Slide 1 text

Introduction of System Software for Persistent Memory Makoto Shimazu @Reading Circle 2014/12/18 S. R. Dulloor1,3, S. Kumar1, A. Keshavamurthy2, P. Lantz1, D. Reddy1, R. Sankaran1, J. Jackson1 1Intel Labs, 2Intel Corp, 3Georgia Institute of Technology EuroSys 2014

Slide 2

Slide 2 text

Contributions Introduction of pm_wbarrier File system architecture optimized for PM ■ light-weight and consistent POSIX file system ■ memory-mapped I/O ■ protecting stray writes Performance evaluation with PM emulator

Slide 3

Slide 3 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 4

Slide 4 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 5

Slide 5 text

Flush the cache explicitly works well (clflush) Caching problem in PM 5 fig of HDD/SSD) http://storage-system.fujitsu.com/jp/lib-f/tech/beginner/ssd/ load/store to DRAM read/write to SSD/HDD load/store to PM Non-volatile Area Cache Volatile Area

Slide 6

Slide 6 text

Flush the cache explicitly works well (clflush) clflush cannot flush from memory controller Caching problem in PM 6 fig of HDD/SSD) http://storage-system.fujitsu.com/jp/lib-f/tech/beginner/ssd/ load/store to DRAM read/write to SSD/HDD load/store to PM Non-volatile Area Cache Volatile Area MC

Slide 7

Slide 7 text

pm_wbarrier Feature Enforce the durability of a cacheline Steps of usage 1. clflush A ■ flush the cacheline contains A 2. sfence ■ ensure the completion of store 3. pm_wbarrier ■ ensure the durability of every store to PM

Slide 8

Slide 8 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 9

Slide 9 text

Layout of PMFS

Slide 10

Slide 10 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 11

Slide 11 text

Consistency Three existing techniques: Copy on Write (CoW) Journaling Log-structured updates One more PM specific technique: Atomic in-place writes Used for updates on Data Area Used for updates on Meta Data (inode) Used for updates of small portion of data

Slide 12

Slide 12 text

Copy on Write (Shadow Paging) Safe and consistent method to modify data Three steps: Copy, Modify, Refer 1: Copy 2: Modify 3: Refer Recursive Copy!!! 12

Slide 13

Slide 13 text

Journaling 13 Hello World! RINKO NXXXX hello.txt 1: WRITE “RINKO” 2: WRITE “NOW!!!” Log Snapshot CRASH! Hello World! RINKO NOW!!!

Slide 14

Slide 14 text

Hybrid method Metadata ■ Updated by fine-grained logging Data ■ Use Copy on Write method Distributed small modification Centralized large modification Copy on Write ☓ (Write Amplification) ◯ (Freely after copy) Journaling ◯ (Just append logs) ☓ (Double writes)

Slide 15

Slide 15 text

Fine-grained Logging 64 Bytes granularity is good for logging of file system metadata

Slide 16

Slide 16 text

Extended atomic in-place writes 8 bytes (the same as BPFS) Update inode’s access time 16 bytes Using cmpxchg16b instruction Update inode’s size and modification time 64 bytes Using RTM (introduced in Haswell and having erratum) Update a number of inode fields like delete

Slide 17

Slide 17 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 18

Slide 18 text

Write Protection Supervisor Mode Access Protection (SMAP) ■ Prohibit writes into user area Write windows (introduced in this paper) ■ Mount as read-only ■ When writing, CR0.WP is set to zero Right) http://en.wikipedia.org/wiki/Protection_ring

Slide 19

Slide 19 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 20

Slide 20 text

Implementation on Linux Execution In Place (XIP) Interface of loading data from Flash directly in limited RAM environment Used to avoid the block device/page cache layer

Slide 21

Slide 21 text

Testing and Validation Yat: Hypervisor-based validation framework ■ Ensure cache flushing and pm_wbarrier are executed in correct order ■ Paper is published in USENIX ATC’14

Slide 22

Slide 22 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 23

Slide 23 text

Evaluation Environment PM Emulation Platform (PMEP) PM Block Driver (PMBD) Results File-based Access Memory-Mapped I/O Write Protection

Slide 24

Slide 24 text

Evaluation Settings PM Emulation Platform (PMEP) Configurable latencies and bandwidth for PM Configurable pm_wbarrier latency Environment Partitioned memory channels ■ using custom BIOS? Latency Emulation ■ debug hook and HW counter counting LLC stall cycles Bandwidth Emulation ■ memory controller Element Value CPU Xeon(2.6GHz) 8 cores x 2sockets DRAM 16GB PM 256GB (disabled NUMA?)

Slide 25

Slide 25 text

PMBD Persistent Memory Block Driver (PMBD) presented in MSST’14 Introduced for fair comparison Open-source implementation ■ https://github.com/linux-pmbd/pmbd Partition between DRAM and PM Use non-temporal stores

Slide 26

Slide 26 text

File-based Access File I/O (Right 4 Graphs) Single thread Single 64GB file File Utilities (Bottom) For Linux Kernel tarball

Slide 27

Slide 27 text

In-place updates/Logging Effect of in-place updates Compare with fine-grained logging... ■ Using 16-byte atomic writes: 1.8X faster ■ Using 64-byte atomic writes: 18% faster Logging Overhead

Slide 28

Slide 28 text

Mmap Random read/write in a single 64GB file PMFS-D: default 4kB page PMFS-L: 1GB page Large enough not to be on page cache Thanks to omitting page cache

Slide 29

Slide 29 text

Neo4j (user application of mmap) Dataset 10M nodes/100M edges from Wikipedia dataset Workload Delete: deleting 2000 nodes and associated edges Insert: adding back the 2000 nodes and the edges Query: selecting two nodes and calculate the shortest path Improvements by no copy overhead Improvements by synchronous write latency

Slide 30

Slide 30 text

Effect of Write Protection Multi-threaded workload is serialized by writing control register

Slide 31

Slide 31 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 32

Slide 32 text

Related Work Enhance new storage DFS[30], Log-structured File System[37], Conquest FS[41] Hybrid of NVM and Disk or Flash Rio File Cache[24], Conquest FS[41] PM-only Storage BPFS[27], SCMFS[43] High Level API on PM Failure-atomic msync[33] NV-Heaps[26], Mnemosyne[40] Library solutions[39]

Slide 33

Slide 33 text

Outline Volatile cache problem Architecture ■ Consistency ■ Write protection from stray writes Implementation Evaluation Related Work Conclusion

Slide 34

Slide 34 text

Conclusion Substantial benefits to legacy application implementing POSIX API Well-considered consistency protocol Deep evaluation with PM emulator