Slide 1

Slide 1 text

Fairer and More Scalable Reader-Writer Locks by Optimizing Queue Management Takashi Hoshino (Cybozu Labs, Inc.) and Kenjiro Taura (The University of Tokyo) PPoPP 2025 1 Static version for PDF

Slide 2

Slide 2 text

Background • NUMA-awareness for shared-memory locking • Reducing cross-NUMA communication is critical for performance • Fairness is also necessary for low tail latency • Many mutexes for fine-grained locking • Heavy mutex usage (e.g., B-Tree nodes) requires more efficiency • More carefully reducing contention is necessary • Reader-writer locking • Rwlocks are widely used in many systems, particularly in database systems • Multiple readers existence introduces engineering challenges B-tree node Mutex stub 2 L3 cache L1/L2 Core … Main memory L1/L2 Core Main memory NUMA node 0 NUMA node 1 L3 cache L1/L2 Core … L1/L2 Core

Slide 3

Slide 3 text

Goal, Contributions, and Results • Goal • Fairer and more scalable reader-writer locking methods • Balancing fairness (low tail latency) and scalability (throughput) in high-level • Key contributions • Freezer mechanism eliminates mutex stub spinning while preserving request stack allocation • Freezer fast path (FFP) does not compromise the fairness policy • Request-queue optimizations enable batching and parallel processing of read requests • Key results • Up to 3.5x higher throughput improvement in B-Tree workloads • Up to 3.1x better tail latency compared to conventional methods with fast paths 3

Slide 4

Slide 4 text

Outline • Background • Goals, contributions, and results • Conventional methods and their issues • Proposed methods • Evaluation • Conclusion 4

Slide 5

Slide 5 text

MCS Lock (Mellor-Crummery+ 1991a) • Lock: (1)(2) enqueue and (3) wait if the queue is not empty • Unlock: (4) dequeue (by notifying the next request if it exists) • MCS rwlock (Mellor-Crummery+ 1991b) inherited this structure • All threads spinwait locally, reducing mutex stub contention • Issues: • Request allocation requires heap memory (for POSIX or C++ STL lock compatibility) • Task-fair fairness policy, which is FIFO and NUMA-oblivious 5 tail next recv Owner Waiter Mutex stub Request Thread (1) (2) (3) Waiter Waiter Waiter (4) Spinwait Indirection Notification

Slide 6

Slide 6 text

lock_word LSM (Leader-Spinning-on-Mutex) Lock • Two staging approach to always allocate requests in the stack memory • 1st stage: an ordinary lock (MCS lock or any) • 2nd stage: A simple lock with lock_word • The owner (lock holder in 1st stage) spinwaits on lock_word • Fast path (SFP) can skip 1st stage but it limits fairness to starvation-free • Shfl rwlock (Kashyap+, 2019) is a typical LSM lock • Issues • Starvartion-free fairness is not enough for low tail latency, bounded-bypass is better • Spinning on the mutex stub is necessary again 6 tail Waiter Waiter Waiter Owner Lock holder(s) Fast path Normal path

Slide 7

Slide 7 text

Request-queue Shuffling (Kashyap+, 2019) • An approach for NUMA-awareness • Write shufflers group writers of the same NUMA node • Read shufflers groups readers • Issues • Every request becomes the owner and modifies the mutex stub individually • Only a single thread can act as a shuffler at any given time 7 S O Traverse and reorder S O S Shuffler S S Owner O Read request Write request from different NUMA nodes

Slide 8

Slide 8 text

Evolution of Rwlocks 8 Name Year Request queue Request allocation NUMA- awareness Fairness Spin rwlock 1971 --- --- --- --- Ticket rwlock 1991 --- --- --- Task-fair MCS rwlock 1991 Yes Heap --- Task-fair Phase-fair queue lock 2010(*1) Yes Heap --- Starvation-free Cohort rwlock NP/RP/WP 2013(*2) --- --- Yes Bounded-bypass/---/--- CST rwlock 2017(*3) Yes Stack Yes --- Shfl rwlock 2019 Yes Stack Yes Starvation-free Frzr rwlock Proposed Yes Stack Yes Bounded-bypass Fairness level: Task-fair ⊂ Bounded-bypass ⊂ Starvation-free ⊂ No fairness Starvation-free: overtaken by a finite number of requests (theoretically) Bounded-bypass: overtaken by a constant number of requests *1: Brandenburg+ 2010 *2: Kashyap+ 2017 *3: Calciu+ 2013

Slide 9

Slide 9 text

Outline • Background • Goals, contributions, and results • Conventional methods and their issues • Proposed methods • Evaluation • Conclusion 9

Slide 10

Slide 10 text

Proposed Methods • Freezer mechanism • Eliminates mutex stub spinning by temporarily freezing the request queue • Freezer Fast Path (FFP) • Does not compromise the fairness property provided by the request queue • Four request-queue optimizations for readers • Ex. batching mutex-stub modifications to reduce contention 10

Slide 11

Slide 11 text

Freezer: Eliminating Mutex Stub Spinning • Freeze() • (1) Save the new head request pointer (and dequeue the previous ones implicitly) • Unfreeze() • (2) Extract the head pointer and (3) Notify the head request as the owner • Lock holders never keep their request • Requests can always be allocated in the stack memory • No one need spinwait on the mutex stub • Local Freezing optimization: does not use the head field but a per-lock object 11 tail Waiter Waiter Waiter Waiter (1) head Lock holder (2) (3) Mutex stub Request Thread Spinwait Indirection Notification Extraction

Slide 12

Slide 12 text

Freezer Fast Path (FFP) • Locks using Freezer need to distinguish multiple states when the queue is empty • A rwlock uses 3 tags stored in the tail field: UNLOCKED, WRITE_LOCKED, READ_LOCKED. • FFP adds FAST_READ tag for an efficient read fast path • Fast path of write/read lock/unlock requires a single CAS on the tail field • Limitation: FFP can work only when the mutex is in the unlocked state • SFP read lock can work when the mutex is also in read locked state 12 tail: UNLOCKED head: null rcount: 0 tail: WRITE_LOCKED head: null rcount: 0 tail: FAST_READ head: null rcount: 0 Write unlock FFP read unlock FFP write lock FFP read lock

Slide 13

Slide 13 text

Request Batching • Request Fusion • Read owner counts up its contiguous read requests • Hierarchical Notification • Read owner notifies non-owners through a heap tree to reduce critical path 13 rcount Add 5 5 read requests 3 read requests Notification O Read request Write request from different NUMA nodes O Owner Mutex stub O 4 5 6 7 1 2 3 Request i notifies 2i and 2i+1 1 2 3 4 5 6 7 O

Slide 14

Slide 14 text

Parallel Queue Traversal • Reader-preferring Shuffling • A read shuffler and a write shuffler can work concurrently • Coin-flip Traversal • Each read request independently decides to become a shuffler with a certain probability 14 S S S Writer shuffling Reader shuffling Processed by previous reader shuffler S Reader shuffling S Reader shuffling S Reader shuffling Read request Write request from different NUMA nodes S Shuffler S

Slide 15

Slide 15 text

Outline • Background • Goals, contributions, and results • Conventional methods and their issues • Proposed methods • Evaluation • Conclusion 15

Slide 16

Slide 16 text

Evaluation • Key aspects • Performance comparison between the proposed and conventional rwlocks • Impact of each optimization method and fast path • Impact of fairness policies on tail latency • Environment • Two CPUs (Intel Xeon Platinum 8280: 28 cores/56 SMTs, totally 112 threads) • NUMA node per CPU (totally 2 nodes) • 192GB memory • Workloads (110 worker threads run for all workloads) • Hash map benchmark with a single (global) mutex • B-Tree benchmark with a mutex per tree node (by lock coupling) • YCSB variant using transactions with a mutex per record (by two-phase locking) 16

Slide 17

Slide 17 text

Performance of Rwlock Methods • SFP (lsm-opt-sfp) affects differently in two workloads • With multiple mutexes, SFP is inefficient due to large cache coherence traffic • Our Freezer design outperforms the LSM design in both workloads • The difference between frzr-opt and lsm-opt shows the benefit of Freezer 17 Hash map benchmark with a single mutex B-Tree benchmark with multiple mutexes Proposed methods

Slide 18

Slide 18 text

Effects of Optimization Methods • Local Freezing (lf) improves write lock performance • Request Fusuion (rf) improves read lock performance • Read-preferring Shuffling (sh) improves overall performance • Coin-flip Traversal (cf) and Hierarchical Notification (hn) improve read lock performance 18 B-Tree benchmark with mutex per tree node 3.5x

Slide 19

Slide 19 text

Fast Path Effects: YCSB Variant with 50% Reads • Under skew 0, both fast paths (FFP and SFP) improve performance • Access conflicts do not occur mostly then the fast path almost succeeds • Under skew 1, performance gain of the fast path becomes smaller • Access conflicts lead to failing the fast path and falling back to the normal path 19 Skew 0 (Uniform key distribution) Zipf skew 1

Slide 20

Slide 20 text

Tail Latency • Methods with higher throughput tends to show lower tail latency • SFP (lsm-opt-sfp) enlarges tail latency due to its starvation-free behavior 20 B-Tree benchmark with mutex per tree node Read 50% Read 95% Proposed methods Better Fairness Method Task-fair mcs-rw Bounded-bypass frzr-opt frzt-opt-ffp lsm-opt cohort-np Starvation-free lsm-opt-sfp 3.1x

Slide 21

Slide 21 text

Outline • Background • Goals, contributions, and results • Conventional methods and their issues • Proposed methods • Evaluation • Conclusion 21

Slide 22

Slide 22 text

Conclusion • Key contributions • Freezer mechanism: eliminates mutex stub spinning while preserving request stack allocation • Freezer fast path (FFP): does not change fairness policy by request-queue • Request-queue optimizations: enables batching and parallel processing read requests • Results highlights • Up to 3.5x higher throughput improvement in B-Tree workload • Up to 3.1x better tail latency compared to conventional methods with fast paths • Future work • Integration with reader-counter splitting techniques • Application to more complex locking methods and parallel algorithms 22

Slide 23

Slide 23 text

References (1) • Mellor-Crummey+ 1991a • John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9, 1 (Feb. 1991), 21–65. • Mellor-Crummey+ 1991b • John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors. SIGPLAN Not. 26, 7 (April 1991), 106–113. • Brandenburg+ 2010 • Björn B. Brandenburg and James H. Anderson. 2010. Spin-based reader-writer synchronization for multiprocessor real-time systems. Real-Time Systems 46 (2010), 25–87. • Calciu+ 2013 • Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit. 2013. NUMA- Aware Reader-Writer Locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13). Association for Computing Machinery, New York, NY, USA, 157–166. 23

Slide 24

Slide 24 text

References (2) • Kashyap+ 2017 • Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Scalable NUMA-aware Blocking Synchronization Primitives. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 603–615. • Kashyap+ 2019 • Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. 2019. Scalable and Practical Locking with Shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). Association for Computing Machinery, New York, NY, USA, 586–599. 24

Slide 25

Slide 25 text

Thank you 25