Fast random number generation in Python and NumPy by Bernardt Duvenhage

Fast random number generation Bernardt Duvenhage Feersum Engine, Praekelt Consulting

Outline How to efﬁciently generate a random number. How to
do it in Python and Numpy.

Generating a random number.

Generating a random number. A fair/unbiased dice can generate a
uniform random number.

uniform random number. Numbers 1-3, 1-4, 1-5, 1-6, 1-10, 1-12, 1-20, …

uniform random number. Numbers 1-3, 1-4, 1-5, 1-6, 1-10, 1-12, 1-20, … Given these dice, how does one generate a uniform random number in [1,15]?

Generating a random number.

Generating a random number. One can use a computer program
to generate pseudo random numbers.

to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG.

to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG. Numbers [0, 216), [0, 232), [0, 264), …

to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG. Numbers [0, 216), [0, 232), [0, 264), … Given a random number in [0, 232), how does one now generate a uniform random number in [1,15]?

Linear Congruential Generator

Linear Congruential Generator Xn+1 = (aXn + c) mod m.

If m is 232 Xn+1 = 1664525 * Xn + 1013904223 # numerical recipes Xn+1 = 1103515245 * Xn + 12345 # glibc Xn+1 = 214013 * Xn + 2531011 # msvs

If m is 232 Xn+1 = 1664525 * Xn + 1013904223 # numerical recipes Xn+1 = 1103515245 * Xn + 12345 # glibc Xn+1 = 214013 * Xn + 2531011 # msvs LCG128 >> 64 passes TestU01’s BigCrush! *

If m is 232 Xn+1 = 1664525 * Xn + 1013904223 # numerical recipes Xn+1 = 1103515245 * Xn + 12345 # glibc Xn+1 = 214013 * Xn + 2531011 # msvs LCG128 >> 64 passes TestU01’s BigCrush! * * [http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html]

Multiplicative Congruential Generator

Multiplicative Congruential Generator Xn+1 = (aXn + 0) mod m

MCG128 >> 64 passes TestU01’s BigCrush! * m = 2^128 a = 92563704562804186071655587898373606109 **

MCG128 >> 64 passes TestU01’s BigCrush! * m = 2^128 a = 92563704562804186071655587898373606109 ** ** [Pierre L’ECUYER. 1999. Tables of Linear Congruential Generators of Different …]

MCG128 >> 64 passes TestU01’s BigCrush! * m = 2^128 a = 92563704562804186071655587898373606109 ** ** [Pierre L’ECUYER. 1999. Tables of Linear Congruential Generators of Different …] * [http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html]

Mersenne Twister, XorShift & PCG.

Mersenne Twister, XorShift & PCG. Mersenne Twister - 1997 Very
popular, but relatively large mem footprint.

popular, but relatively large mem footprint. XorShift - 2003 Much smaller mem footprint.

popular, but relatively large mem footprint. XorShift - 2003 Much smaller mem footprint. Permuted Congruential Generator (PCG) - 2014 Small mem footprint & faster than MT and XorShift. Melissa O'Neill @ Stanford - https://www.youtube.com/watch?v=45Oet5qjlms

Where are PRNGs used?

Where are PRNGs used? When you don’t actually have a
dice!

dice! Any computational simulation of some physical system.

dice! Any computational simulation of some physical system. Searching of high dimensional spaces.

to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG. Numbers [0, 216), [0, 232), [0, 264), … Given a random number in 232, how does one generate a uniform random number in any interval?

Random number rn in [1, s).

Random number rn in [1, s). Given Xn in [0,
2L) generate a uniform random number rn in [1, s).

2L) generate a uniform random number rn in [1, s). We know Xn in [0, 2L) takes 3 - 10 clock cycles.

2L) generate a uniform random number rn in [1, s). We know Xn in [0, 2L) takes 3 - 10 clock cycles. rn in [1, s) should not take much longer.

rn = Xn mod s

rn = Xn mod s Xn mod 3, for example,
gives 1,0,0,2,0,1,2,1,0,1,2,…

gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3

gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1

gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1 0,1 appears more often than 2 so rn is not uniform!

gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1 0,1 appears more often than 2 so rn is not uniform! How eﬃcient is this approach?

gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1 0,1 appears more often than 2 so rn is not uniform! How eﬃcient is this approach? The mod/div can take 20 - 50+ cycles.

rn = Xn mod s with rejection

rn = Xn mod s with rejection If Xn in
[0, 23) and s=3

[0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1

[0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample.

[0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample. Finally, rn = Xn mod s

[0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample. Finally, rn = Xn mod s How eﬃcient is this approach?

[0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample. Finally, rn = Xn mod s How eﬃcient is this approach? Two mods and 2/8 of the samples are rejected in this case.

Masking + rejection sampling

Masking + rejection sampling Generate Xn * in [0, 2M)
for M such that 2M >= s

for M such that 2M >= s Xn * = Xn & 2M-1

for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample.

for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample. How eﬃcient is rejection sampling + masking?

for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample. How eﬃcient is rejection sampling + masking? Worst case s=2(M-1)+1 e.g. 1025 if M=11 => approx. 50% rejected.

for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample. How eﬃcient is rejection sampling + masking? Worst case s=2(M-1)+1 e.g. 1025 if M=11 => approx. 50% rejected. No mods.

Integer scaling

Integer scaling Generate Xn in [0, 232) and calc Qn
= Xn x s

= Xn x s Finally, rn = Qn >> 32

= Xn x s Finally, rn = Qn >> 32 How eﬃcient is this approach?

= Xn x s Finally, rn = Qn >> 32 How eﬃcient is this approach? 64-bit is very eﬃcient on modern CPUs.

= Xn x s Finally, rn = Qn >> 32 How eﬃcient is this approach? 64-bit is very eﬃcient on modern CPUs. Does this generate a uniform random number rn in [1, s)?

= Xn x s Finally, rn = Qn >> 32 How eﬃcient is this approach? 64-bit is very eﬃcient on modern CPUs. Does this generate a uniform random number rn in [1, s)? No, some numbers in [1, s) will appear more often than others.

Integer scaling (cont.) E.g. Xn in [0, 2L) and calc
Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7

Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7 Qn = Xn x 3 in [0, 24) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23

Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7 Qn = Xn x 3 in [0, 24) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 Qn ÷ 23 in [0, 3) {0,1,2,3,4,5,6,7} => 0, {8,9,10,11,12,13,14,15} => 1, {16,17,18,19,20,21,22,23} => 2 appears less often than 0 and 1!

Integer scaling + rejection sampling* * Daniel Lemire. 2018. Fast
Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

Integer scaling + rejection sampling* E.g. Xn in [0, 2L)
and calc Qn = Xn x s for s = 3 and L = 3 * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in ﬁrst (2L mod s) positions of an s bin * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in ﬁrst (2L mod s) positions of an s bin 2 will now appear as often as 0 and 1! * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in ﬁrst (2L mod s) positions of an s bin 2 will now appear as often as 0 and 1! How eﬃcient is this approach? * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in ﬁrst (2L mod s) positions of an s bin 2 will now appear as often as 0 and 1! How eﬃcient is this approach? One mod and chance of rejection is (2L mod s)/2L < s/2L * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)

Integer scaling + rejection sampling

Integer scaling + rejection sampling Mod/divide by 2L is just
and/shift and therefore eﬃcient.

and/shift and therefore eﬃcient. s is an upper bound of t and used so that we only do a mod with chance s/(2L-1).

and/shift and therefore eﬃcient. s is an upper bound of t and used so that we only do a mod with chance s/(2L-1). t is only dependent on s and not on x as well so no need to recalculate t whenever a number is rejected!

Integer scaling + rejection sampling

Python and Numpy

Python and Numpy Python uses mod+rejection OR masking+rejection.

Python and Numpy Python uses mod+rejection OR masking+rejection. Numpy uses
masking+rejection.

masking+rejection. I did a Numpy PR in mid August, but Charles Harris pointed me to NEP-19 and the RandomGen project.

masking+rejection. I did a Numpy PR in mid August, but Charles Harris pointed me to NEP-19 and the RandomGen project. Kevin Sheppard from RandomGen seemed keen.

masking+rejection. I did a Numpy PR in mid August, but Charles Harris pointed me to NEP-19 and the RandomGen project. Kevin Sheppard from RandomGen seemed keen. So I did a RandomGen PR to optionally use Lemire’s algorithm for unsigned integers in an interval (busy testing 64 bit).

NEP-19 https://www.numpy.org/neps/nep-0019-rng-policy.html

RandomGen https://github.com/bashtage/randomgen - "Provides access to more modern PRNGs that
support modern features such as streams and easy advancement so that they can be easily used on 1000s of nodes." May also be used as alternative to the Python random module. PRNGs and rejection sampling implemented in C. Provides a number of random number generators MT, XorShift, PCG, ThreeFry, Philox.

Results - Masked rejection vs. Lemire

Results - Masked rejection vs. Lemire Best case performance is
slightly worse.

slightly worse. Reduces average execution time by 54%.

slightly worse. Reduces average execution time by 54%. Reduces worst case execution time by 73%.

Results - Approx. avrg. case With Lemire rejection Speed-up relative
to NumPy MT 32-bit MT19937 85.0% PCG32 200.7% PCG64 188.3% ThreeFry 71.5% Xoroshiro128 258.0% Xorshift1024 209.6% Speed-up relative to NumPy MT 64-bit MT19937 96.8% PCG32 215.6% PCG64 260.5% ThreeFry 61.7% Xoroshiro128 396.9% Xorshift1024 331.8% Original Masked Rejection Speed-up relative to NumPy MT 32-bit: MT19937 -2.6% PCG32 46.5% PCG64 33.2% ThreeFry -12.1% Xoroshiro128 65.8% Xorshift1024 41.3% Speed-up relative to NumPy MT 64-bit: MT19937 -4.0% PCG32 29.5% PCG64 25.0% ThreeFry -5.4% Xoroshiro128 56.0% Xorshift1024 30.9%

Results - Conﬁrmation http://www.pcg-random.org/posts/bounded-rands.html - 2018-07-22 "From our benchmarks, we
can see that switching from a commonly- used PRNG (e.g., the 32-bit Mersenne Twister) to a faster PRNG reduced the execution time of our benchmarks by 45%. But switching from a commonly used method for ﬁnding a number in a range to our fastest method reduced our benchmark time by about 66%, in other words reducing execution time to one third of the original time."

Questions?

Fast random number generation in Python and Num...

Fast random number generation in Python and NumPy by Bernardt Duvenhage

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript