Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fairer and More Scalable Reader-Writer Locks by...

Fairer and More Scalable Reader-Writer Locks by Optimizing Queue Management

Slides of my talk at PPoPP 2025

The paper URL: https://doi.org/10.1145/3710848.3710877
Artifact URL: https://doi.org/10.5281/zenodo.14577424

Takashi HOSHINO

March 03, 2025
Tweet

More Decks by Takashi HOSHINO

Other Decks in Research

Transcript

  1. Fairer and More Scalable Reader-Writer Locks by Optimizing Queue Management

    Takashi Hoshino (Cybozu Labs, Inc.) and Kenjiro Taura (The University of Tokyo) PPoPP 2025 1 Static version for PDF
  2. Background • NUMA-awareness for shared-memory locking • Reducing cross-NUMA communication

    is critical for performance • Fairness is also necessary for low tail latency • Many mutexes for fine-grained locking • Heavy mutex usage (e.g., B-Tree nodes) requires more efficiency • More carefully reducing contention is necessary • Reader-writer locking • Rwlocks are widely used in many systems, particularly in database systems • Multiple readers existence introduces engineering challenges B-tree node Mutex stub 2 L3 cache L1/L2 Core … Main memory L1/L2 Core Main memory NUMA node 0 NUMA node 1 L3 cache L1/L2 Core … L1/L2 Core
  3. Goal, Contributions, and Results • Goal • Fairer and more

    scalable reader-writer locking methods • Balancing fairness (low tail latency) and scalability (throughput) in high-level • Key contributions • Freezer mechanism eliminates mutex stub spinning while preserving request stack allocation • Freezer fast path (FFP) does not compromise the fairness policy • Request-queue optimizations enable batching and parallel processing of read requests • Key results • Up to 3.5x higher throughput improvement in B-Tree workloads • Up to 3.1x better tail latency compared to conventional methods with fast paths 3
  4. Outline • Background • Goals, contributions, and results • Conventional

    methods and their issues • Proposed methods • Evaluation • Conclusion 4
  5. MCS Lock (Mellor-Crummery+ 1991a) • Lock: (1)(2) enqueue and (3)

    wait if the queue is not empty • Unlock: (4) dequeue (by notifying the next request if it exists) • MCS rwlock (Mellor-Crummery+ 1991b) inherited this structure • All threads spinwait locally, reducing mutex stub contention • Issues: • Request allocation requires heap memory (for POSIX or C++ STL lock compatibility) • Task-fair fairness policy, which is FIFO and NUMA-oblivious 5 tail next recv Owner Waiter Mutex stub Request Thread (1) (2) (3) Waiter Waiter Waiter (4) Spinwait Indirection Notification
  6. lock_word LSM (Leader-Spinning-on-Mutex) Lock • Two staging approach to always

    allocate requests in the stack memory • 1st stage: an ordinary lock (MCS lock or any) • 2nd stage: A simple lock with lock_word • The owner (lock holder in 1st stage) spinwaits on lock_word • Fast path (SFP) can skip 1st stage but it limits fairness to starvation-free • Shfl rwlock (Kashyap+, 2019) is a typical LSM lock • Issues • Starvartion-free fairness is not enough for low tail latency, bounded-bypass is better • Spinning on the mutex stub is necessary again 6 tail Waiter Waiter Waiter Owner Lock holder(s) Fast path Normal path
  7. Request-queue Shuffling (Kashyap+, 2019) • An approach for NUMA-awareness •

    Write shufflers group writers of the same NUMA node • Read shufflers groups readers • Issues • Every request becomes the owner and modifies the mutex stub individually • Only a single thread can act as a shuffler at any given time 7 S O Traverse and reorder S O S Shuffler S S Owner O Read request Write request from different NUMA nodes
  8. Evolution of Rwlocks 8 Name Year Request queue Request allocation

    NUMA- awareness Fairness Spin rwlock 1971 --- --- --- --- Ticket rwlock 1991 --- --- --- Task-fair MCS rwlock 1991 Yes Heap --- Task-fair Phase-fair queue lock 2010(*1) Yes Heap --- Starvation-free Cohort rwlock NP/RP/WP 2013(*2) --- --- Yes Bounded-bypass/---/--- CST rwlock 2017(*3) Yes Stack Yes --- Shfl rwlock 2019 Yes Stack Yes Starvation-free Frzr rwlock Proposed Yes Stack Yes Bounded-bypass Fairness level: Task-fair ⊂ Bounded-bypass ⊂ Starvation-free ⊂ No fairness Starvation-free: overtaken by a finite number of requests (theoretically) Bounded-bypass: overtaken by a constant number of requests *1: Brandenburg+ 2010 *2: Kashyap+ 2017 *3: Calciu+ 2013
  9. Outline • Background • Goals, contributions, and results • Conventional

    methods and their issues • Proposed methods • Evaluation • Conclusion 9
  10. Proposed Methods • Freezer mechanism • Eliminates mutex stub spinning

    by temporarily freezing the request queue • Freezer Fast Path (FFP) • Does not compromise the fairness property provided by the request queue • Four request-queue optimizations for readers • Ex. batching mutex-stub modifications to reduce contention 10
  11. Freezer: Eliminating Mutex Stub Spinning • Freeze() • (1) Save

    the new head request pointer (and dequeue the previous ones implicitly) • Unfreeze() • (2) Extract the head pointer and (3) Notify the head request as the owner • Lock holders never keep their request • Requests can always be allocated in the stack memory • No one need spinwait on the mutex stub • Local Freezing optimization: does not use the head field but a per-lock object 11 tail Waiter Waiter Waiter Waiter (1) head Lock holder (2) (3) Mutex stub Request Thread Spinwait Indirection Notification Extraction
  12. Freezer Fast Path (FFP) • Locks using Freezer need to

    distinguish multiple states when the queue is empty • A rwlock uses 3 tags stored in the tail field: UNLOCKED, WRITE_LOCKED, READ_LOCKED. • FFP adds FAST_READ tag for an efficient read fast path • Fast path of write/read lock/unlock requires a single CAS on the tail field • Limitation: FFP can work only when the mutex is in the unlocked state • SFP read lock can work when the mutex is also in read locked state 12 tail: UNLOCKED head: null rcount: 0 tail: WRITE_LOCKED head: null rcount: 0 tail: FAST_READ head: null rcount: 0 Write unlock FFP read unlock FFP write lock FFP read lock
  13. Request Batching • Request Fusion • Read owner counts up

    its contiguous read requests • Hierarchical Notification • Read owner notifies non-owners through a heap tree to reduce critical path 13 rcount Add 5 5 read requests 3 read requests Notification O Read request Write request from different NUMA nodes O Owner Mutex stub O 4 5 6 7 1 2 3 Request i notifies 2i and 2i+1 1 2 3 4 5 6 7 O
  14. Parallel Queue Traversal • Reader-preferring Shuffling • A read shuffler

    and a write shuffler can work concurrently • Coin-flip Traversal • Each read request independently decides to become a shuffler with a certain probability 14 S S S Writer shuffling Reader shuffling Processed by previous reader shuffler S Reader shuffling S Reader shuffling S Reader shuffling Read request Write request from different NUMA nodes S Shuffler S
  15. Outline • Background • Goals, contributions, and results • Conventional

    methods and their issues • Proposed methods • Evaluation • Conclusion 15
  16. Evaluation • Key aspects • Performance comparison between the proposed

    and conventional rwlocks • Impact of each optimization method and fast path • Impact of fairness policies on tail latency • Environment • Two CPUs (Intel Xeon Platinum 8280: 28 cores/56 SMTs, totally 112 threads) • NUMA node per CPU (totally 2 nodes) • 192GB memory • Workloads (110 worker threads run for all workloads) • Hash map benchmark with a single (global) mutex • B-Tree benchmark with a mutex per tree node (by lock coupling) • YCSB variant using transactions with a mutex per record (by two-phase locking) 16
  17. Performance of Rwlock Methods • SFP (lsm-opt-sfp) affects differently in

    two workloads • With multiple mutexes, SFP is inefficient due to large cache coherence traffic • Our Freezer design outperforms the LSM design in both workloads • The difference between frzr-opt and lsm-opt shows the benefit of Freezer 17 Hash map benchmark with a single mutex B-Tree benchmark with multiple mutexes Proposed methods
  18. Effects of Optimization Methods • Local Freezing (lf) improves write

    lock performance • Request Fusuion (rf) improves read lock performance • Read-preferring Shuffling (sh) improves overall performance • Coin-flip Traversal (cf) and Hierarchical Notification (hn) improve read lock performance 18 B-Tree benchmark with mutex per tree node 3.5x
  19. Fast Path Effects: YCSB Variant with 50% Reads • Under

    skew 0, both fast paths (FFP and SFP) improve performance • Access conflicts do not occur mostly then the fast path almost succeeds • Under skew 1, performance gain of the fast path becomes smaller • Access conflicts lead to failing the fast path and falling back to the normal path 19 Skew 0 (Uniform key distribution) Zipf skew 1
  20. Tail Latency • Methods with higher throughput tends to show

    lower tail latency • SFP (lsm-opt-sfp) enlarges tail latency due to its starvation-free behavior 20 B-Tree benchmark with mutex per tree node Read 50% Read 95% Proposed methods Better Fairness Method Task-fair mcs-rw Bounded-bypass frzr-opt frzt-opt-ffp lsm-opt cohort-np Starvation-free lsm-opt-sfp 3.1x
  21. Outline • Background • Goals, contributions, and results • Conventional

    methods and their issues • Proposed methods • Evaluation • Conclusion 21
  22. Conclusion • Key contributions • Freezer mechanism: eliminates mutex stub

    spinning while preserving request stack allocation • Freezer fast path (FFP): does not change fairness policy by request-queue • Request-queue optimizations: enables batching and parallel processing read requests • Results highlights • Up to 3.5x higher throughput improvement in B-Tree workload • Up to 3.1x better tail latency compared to conventional methods with fast paths • Future work • Integration with reader-counter splitting techniques • Application to more complex locking methods and parallel algorithms 22
  23. References (1) • Mellor-Crummey+ 1991a • John M. Mellor-Crummey and

    Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9, 1 (Feb. 1991), 21–65. • Mellor-Crummey+ 1991b • John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors. SIGPLAN Not. 26, 7 (April 1991), 106–113. • Brandenburg+ 2010 • Björn B. Brandenburg and James H. Anderson. 2010. Spin-based reader-writer synchronization for multiprocessor real-time systems. Real-Time Systems 46 (2010), 25–87. • Calciu+ 2013 • Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013. NUMA- Aware Reader-Writer Locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13). Association for Computing Machinery, New York, NY, USA, 157–166. 23
  24. References (2) • Kashyap+ 2017 • Sanidhya Kashyap, Changwoo Min,

    and Taesoo Kim. 2017. Scalable NUMA-aware Blocking Synchronization Primitives. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 603–615. • Kashyap+ 2019 • Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. 2019. Scalable and Practical Locking with Shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). Association for Computing Machinery, New York, NY, USA, 586–599. 24