Constant Integer Division and Modulo Optimization Revisited (English)

Constant Integer Division and Modulo Optimization Revisited Cybozu Labs, Inc.
Mitsunari Shigeo 2026/1/28 SCIS2026 1 / 17

Overview Faster 32-bit Constant Division for 64-bit CPUs For C
code such as uint32_t divd(uint32_t x) { return x / 7; } , we propose a new method that is 36-39% faster than code generated by GCC, Clang, and MSVC for x86-64 and Apple M4, which method we call the GM (Granlund-Montgomery) method. Uses only one 64-bit integer multiplication instruction Avoids 128-bit integer shifts Smaller Multipliers for Constant Modulus Choose for used in MLKEM, etc., we propose a method using smaller multipliers than existing Barrett Reduction for modulus operations Requires additional select instruction 2 / 17

Preliminaries Notation : maximum value of dividend, : constant divisor,
: dividend Quotient: , Remainder: When , : largest integer not exceeding whose remainder when divided by is : function that returns x if condition cond is true, otherwise y u32 , u64 , etc.: unsigned 32/64-bit integers Case Classification for When is a power of 2, , then and When , then Otherwise → The main topic of this presentation 3 / 17

Formulation of GM Method Theorem 1 For , let and
, if then for any , holds A different but equivalent formulation to the original GM for Theorem 2 below Case of There exists such that When Both and are u32 variables and multiplication result fits in u64 , so is sufficient When GM method proposes a way to keep intermediate variable values within 32 bits (next slide) 4 / 17

Status of x/7 Code Generated by Apple/x64 Compilers div7: ;
Code generated by Apple Clang ; Code generated by gcc-14 for x64 mov w8, #18725 ; =0x4925 ; mov eax, edi movk w8, #9362, lsl #16 umull x8, w0, w8 ; x8 = x * cL ; imul rax, rax, 613566757 lsr x8, x8, #32 ; y = x8 = (x * cL) >> 32 ; shr rax, 32 sub w9, w0, w8 ; w9 = x - y ; sub edi, eax add w8, w8, w9, lsr #1 ; t = ((x - y) >> 1) + y ; shr edi / add eax, edi lsr w0, w8, #2 ; t >> (a - 33) ; shr eax, 2 ret ; ret Rewritten in C code ( , ) u32 udivd(u32 x) { u32 cL = c & 0xffffffff; u32 y = (u64(x) * cL) >> 32; u32 t = ((x - y) >> 1) + y; return t >> (a - 33); } 5 / 17

Fitting Intermediate Variables After MUL into 32 Bits does not
fit in u64 Decompose into 3 stages of shifts: Let (where ), then Calculate using 32-bit×32-bit→64-bit (mul32) Let , then Computing does not fit in u32 Using , c L 1 1-bit x 32-bit x c L x x 2 32 + (x c L ) × ＋ c 6 / 17

Preparation for 64-bit CPU Improvements No Need to Fit Intermediate
Variables into 32 Bits 1. Use 64-bit×64-bit MUL (mul64) and u128 shift (mul64+sft) Directly compute x64 can obtain 128 bits with a single imul instruction M4 (AArch64) obtains upper/lower 64 bits with two instructions umulh / mul u32 udivd1(u32 x) { return (x * u128(c)) >> a; } 2. Calculate using the mul32, then compute with u64 u32 udivd2(u32 x) { u64 cL = c & 0xffffffff; u64 y = (x * cL) >> 32; return (x + y) >> (a - 32); // Simpler since there's no 32-bit overflow handling } 7 / 17

Benchmark Method Evaluation Method Inline expansion instead of function calls
due to small target size Process multiple times and evaluate differences lp=1, 2, 3 represent the number of times processed within a processing unit (see proceedings for details) The difference lat. when lp increases roughly corresponds to the latency of proc. time u32 loop(FuncType f) { u32 sum = 0; // This loop is not unrolled for (u32 x = 0; x < N; x++) { u32 t = x; // Embed code for f(t) = t//d lp times for (int i = 0; i < lp; i++) { sum += f(t); t += sum; } ... } return sum; } 8 / 17

Benchmark Results Comparison with Clang, udiv1, and udiv2 Xeon w9-3495X
- Processing time (unit: msec) Method lp=1 lp=2 lp=3 lat. clang 60.62 239.26 452.89 196.14 mul64+sft 42.61 219.96 408.95 183.17 mul32 52.03 195.53 363.37 155.67 Apple M4 Pro 方式 lp=1 lp=2 lp=3 lat. clang 40.94 267.47 511.61 235.34 mul64+sft 33.29 201.42 382.63 174.67 mul32 33.33 223.04 423.37 195.02 lat.=((lp=3)-(lp=2))/2 On x86-64, 128-bit integer shift (shrd) is slow, so mul32 has smaller latency On M4, 128-bit integer shift (extr) is fast, so mul64+sft has smaller latency 9 / 17 Instruction Latency Throughput add/sub/or 1 0.25 shr 1 0.5 imul 3 1 shrd 3 1

Proposed Method Considerations Want to use 128-bit multiplication (64-bit×64-bit→128-bit) Want
to avoid 128-bit shifts Proposed Method Remove shift by increasing multiplier ・・・ Since a 128-bit integer is represented by two 64-bit integers, the cost of extracting the upper 64-bit integer is 0 Becomes one multiplication instruction in assembly language on x64/M4 excluding value setting 10 / 17 div7_optimized: mov edx, edi mov rax, 0x24924924a0000000 mulx rax, rax, rax ret

Proposed Method and Benchmark Results Added to Previous Comparison Xeon
w9-3495X - Processing time (unit: msec) 方式 lp=1 lp=2 lp=3 lat. clang 60.62 239.26 452.89 196.14 mul64+sft 42.61 219.96 408.95 183.17 mul32 52.03 195.53 363.37 155.67 mul64 43.61 158.61 283.94 120.16 Apple M4 Pro 方式 lp=1 lp=2 lp=3 lat. clang 40.94 267.47 511.61 235.34 mul64+sft 33.29 201.42 382.63 174.67 mul32 33.33 223.04 423.37 195.02 mul64 30.21 156.40 289.56 129.68 Achieved 39% speedup on x86-64 and 36% on Apple M4 Latency speedup of 60-80% An implementation of this mehod is github.com/herumi/constdiv 11 / 17

Constant Integer Modulus Barrett Reduction Let and Find slightly smaller
than , and if is greater than or equal to , subtract u32 barrett(u32 x) { u32 q = (x * c) >> a; u32 r = x - q * d; while (r >= d) r -= d; // Loop may execute at most 2 times depending on constant selection return r; } In the MLKEM768 implementation mlkem-native, for and s16 ref3329(s16 x) { const int d = 3329; int t = (x * 20159 + (1<<25)) >> 26; return x - t * d; } 12 / 17

Proposed Method Theorem 2 For , let , , ,
if " and " then for all , holds. Condition "" is relaxed from " " in Theorem 1, making smaller and division error at most 1 Find at most 1 larger than , and if is negative, add Implementation For u32 , always fits in 32 bits (see below) u32 uremd_mywork(u32 x) { u32 q = (x * c) >> a; u32 r = x - q * d; if (r < 0) r += d; return r; } 13 / 17

is Smaller than GM Method or Barrett Reduction Some Examples
and Distribution of (Th. 1) (Th. 1) (Th. 2) (Th. 2) 7 0x124924925 35 0x24924925 32 824480341 0x14d653545 62 3 31 1239864366 0x1bb6663f9 63 4 32 When , 89% of for have Specifically, account for 84% When , can optimize to can also replace multiplication with bit shift + one addition/subtraction c 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16+ Total Ratio(%) 0.0 0.0 0.0 48.7 5.8 19.8 0.0 3.5 0.0 6.6 0.0 3.1 0.0 1.3 0.0 0.4 11.0 100.0 14 / 17

Case of MLKEM Prime ( ) Barrett Reduction Method (Revisited)
and Montgomery Reduction const int q = 3329; s16 ref3329(s16 x) { // ((q-1)/2)^2 * c > 2^32 int t = (x * 20159 + (1<<25)) >> 26; return x - t * q; } int mont(int x, int y) { int t = x * y; int m = ((t & 0xffff) * 3327) & 0xffff; t = (t + m * q) >> 16; if (t >= q) t -= q; return t; } Proposed Method (Th. 2) Applying Th. 2 to gives , Using (x * (5<<2))>>16 enables SIMD (PMULHW) that returns upper u16 × u16 MUL Since , multiplication cost may be reduced in embedded systems s16 my3329(s16 x) { // Perfectly matches output of ref3329() int t = (x * 5) >> 14; // (x*20)>>16 also works t = x - t * q; if (t > q/2) t -= q; return t; // To fit within [-(q//2), q//2] } 15 / 17

Proof of Th. 2 Proof For , let and For
, let and , then Therefore, if , then or holds are non-negative constants; consider when varies (and varies accordingly) Maximum is given by or From , we get From , we get Therefore, if condition "" holds, then , and holds M 0 x M d d-1 r (M,r 0 ) ... (M d ,d-1 ) 16 / 17

Summary Constant Integer Division Optimization Proposed a method for u32
/ u32 division using only one u64 MUL, compared to traditional GM method Achieved 39% speedup on x86-64 and 36% on Apple M4 compared to existing compilers GCC, Clang, MSVC Improvement of Constant Integer Modulus Operations Proposed a variant of Barrett Reduction Proposed a method with smaller multipliers compared to existing methods Easier to apply in SIMD environments since 64-bit multiplication can be avoided Possibility of reduced modulus operations for primes used in MLKEM 17 / 17

Constant Integer Division and Modulo Optimizati...

Constant Integer Division and Modulo Optimization Revisited (English)

herumi

More Decks by herumi

Other Decks in Research

Featured

Transcript

Constant Integer Division and Modulo Optimization Revisited Cybozu Labs, Inc.

Overview Faster 32-bit Constant Division for 64-bit CPUs For C

Preliminaries Notation : maximum value of dividend, : constant divisor,

Formulation of GM Method Theorem 1 For , let and

Status of x/7 Code Generated by Apple/x64 Compilers div7: ;

Fitting Intermediate Variables After MUL into 32 Bits does not

Preparation for 64-bit CPU Improvements No Need to Fit Intermediate

Benchmark Method Evaluation Method Inline expansion instead of function calls

Benchmark Results Comparison with Clang, udiv1, and udiv2 Xeon w9-3495X

Proposed Method Considerations Want to use 128-bit multiplication (64-bit×64-bit→128-bit) Want

Proposed Method and Benchmark Results Added to Previous Comparison Xeon

Constant Integer Modulus Barrett Reduction Let and Find slightly smaller

Proposed Method Theorem 2 For , let , , ,

is Smaller than GM Method or Barrett Reduction Some Examples

Case of MLKEM Prime ( ) Barrett Reduction Method (Revisited)

Proof of Th. 2 Proof For , let and For

Summary Constant Integer Division Optimization Proposed a method for u32