Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast random number generation in Python and Num...

Pycon ZA
October 12, 2018

Fast random number generation in Python and NumPy by Bernardt Duvenhage

A fast Random Number Generator (RNG) is key to doing Monte Carlo simulations, efficiently initialising machine learning models, shuffling long sequences of numbers and many tasks in scientific computing. CPython and NumPy use implementations of the Mersenne Twister RNG and rejection sampling to generate random numbers in an interval. The NumPy implementation trades more samples for cheaper division by a power of two. The implementation is very efficient when the random interval is a power of two, but on average generate many more samples compared to the GNU C or Java algorithms. The Python RNG uses an algorithm similar to GNU C.

A recent paper by Daniel Lemire (https://arxiv.org/abs/1805.10941) discusses an efficient division light method to generate uniform random numbers in an interval. This method is apparently used in the Go language. On 64-bit platforms there are also fast alternative RNGs that perform comparatively on statistical examinations passing tests like BigCrush. The splitmix64 (part of the standard Java API) and lehmer64 RNGs, for example, require approximately 1.5 cycles to generate 32 random bits (without using SIMD) while the 32-bit Mersenne Twister requires approximately 10 cycles per 32 bits.

This talk will discuss the inclusion of Lemire's method in NumPy as an alternative to the current rejection sampling implementation. My implementation results in a 2x - 3x improvement in the performance of generating a sequence of random numbers. Using splitmix64 or lehmer64 RNGs in NumPy instead of the Mersenne Twister results in a further 2x performance improvement.

The random module in Python does not do the rejection sampling in C like NumPy does. Much of the time to get a random number is therefore spent in the Python code. This talk will also discuss a fast random Python module that implements Lemire's method instead of the current rejection sampling, provides alternative RNGs and moves more of the code into C.

I'm considering doing pull requests for both the NumPy modification and the Python fast random module and will present a detailed analysis of the proposed modifications.

Pycon ZA

October 12, 2018
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Generating a random number. A fair/unbiased dice can generate a

    uniform random number. Numbers 1-3, 1-4, 1-5, 1-6, 1-10, 1-12, 1-20, …
  2. Generating a random number. A fair/unbiased dice can generate a

    uniform random number. Numbers 1-3, 1-4, 1-5, 1-6, 1-10, 1-12, 1-20, … Given these dice, how does one generate a uniform random number in [1,15]?
  3. Generating a random number. One can use a computer program

    to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG.
  4. Generating a random number. One can use a computer program

    to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG. Numbers [0, 216), [0, 232), [0, 264), …
  5. Generating a random number. One can use a computer program

    to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG. Numbers [0, 216), [0, 232), [0, 264), … Given a random number in [0, 232), how does one now generate a uniform random number in [1,15]?
  6. Linear Congruential Generator Xn+1 = (aXn + c) mod m.

    If m is 232 Xn+1 = 1664525 * Xn + 1013904223 # numerical recipes Xn+1 = 1103515245 * Xn + 12345 # glibc Xn+1 = 214013 * Xn + 2531011 # msvs
  7. Linear Congruential Generator Xn+1 = (aXn + c) mod m.

    If m is 232 Xn+1 = 1664525 * Xn + 1013904223 # numerical recipes Xn+1 = 1103515245 * Xn + 12345 # glibc Xn+1 = 214013 * Xn + 2531011 # msvs LCG128 >> 64 passes TestU01’s BigCrush! *
  8. Linear Congruential Generator Xn+1 = (aXn + c) mod m.

    If m is 232 Xn+1 = 1664525 * Xn + 1013904223 # numerical recipes Xn+1 = 1103515245 * Xn + 12345 # glibc Xn+1 = 214013 * Xn + 2531011 # msvs LCG128 >> 64 passes TestU01’s BigCrush! * * [http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html]
  9. Multiplicative Congruential Generator Xn+1 = (aXn + 0) mod m

    MCG128 >> 64 passes TestU01’s BigCrush! * m = 2^128 a = 92563704562804186071655587898373606109 **
  10. Multiplicative Congruential Generator Xn+1 = (aXn + 0) mod m

    MCG128 >> 64 passes TestU01’s BigCrush! * m = 2^128 a = 92563704562804186071655587898373606109 ** ** [Pierre L’ECUYER. 1999. Tables of Linear Congruential Generators of Different …]
  11. Multiplicative Congruential Generator Xn+1 = (aXn + 0) mod m

    MCG128 >> 64 passes TestU01’s BigCrush! * m = 2^128 a = 92563704562804186071655587898373606109 ** ** [Pierre L’ECUYER. 1999. Tables of Linear Congruential Generators of Different …] * [http://www.pcg-random.org/posts/does-it-beat-the-minimal-standard.html]
  12. Mersenne Twister, XorShift & PCG. Mersenne Twister - 1997 Very

    popular, but relatively large mem footprint.
  13. Mersenne Twister, XorShift & PCG. Mersenne Twister - 1997 Very

    popular, but relatively large mem footprint. XorShift - 2003 Much smaller mem footprint.
  14. Mersenne Twister, XorShift & PCG. Mersenne Twister - 1997 Very

    popular, but relatively large mem footprint. XorShift - 2003 Much smaller mem footprint. Permuted Congruential Generator (PCG) - 2014 Small mem footprint & faster than MT and XorShift. Melissa O'Neill @ Stanford - https://www.youtube.com/watch?v=45Oet5qjlms
  15. Where are PRNGs used? When you don’t actually have a

    dice! Any computational simulation of some physical system.
  16. Where are PRNGs used? When you don’t actually have a

    dice! Any computational simulation of some physical system. Searching of high dimensional spaces.
  17. Generating a random number. One can use a computer program

    to generate pseudo random numbers. LCG, Mersenne Twister, XorShift, PCG. Numbers [0, 216), [0, 232), [0, 264), … Given a random number in 232, how does one generate a uniform random number in any interval?
  18. Random number rn in [1, s). Given Xn in [0,

    2L) generate a uniform random number rn in [1, s).
  19. Random number rn in [1, s). Given Xn in [0,

    2L) generate a uniform random number rn in [1, s). We know Xn in [0, 2L) takes 3 - 10 clock cycles.
  20. Random number rn in [1, s). Given Xn in [0,

    2L) generate a uniform random number rn in [1, s). We know Xn in [0, 2L) takes 3 - 10 clock cycles. rn in [1, s) should not take much longer.
  21. rn = Xn mod s Xn mod 3, for example,

    gives 1,0,0,2,0,1,2,1,0,1,2,…
  22. rn = Xn mod s Xn mod 3, for example,

    gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3
  23. rn = Xn mod s Xn mod 3, for example,

    gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1
  24. rn = Xn mod s Xn mod 3, for example,

    gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1 0,1 appears more often than 2 so rn is not uniform!
  25. rn = Xn mod s Xn mod 3, for example,

    gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1 0,1 appears more often than 2 so rn is not uniform! How efficient is this approach?
  26. rn = Xn mod s Xn mod 3, for example,

    gives 1,0,0,2,0,1,2,1,0,1,2,… If Xn in [0, 23) and s=3 Xn :0,1,2,3,4,5,6,7 => rn :0,1,2,0,1,2,0,1 0,1 appears more often than 2 so rn is not uniform! How efficient is this approach? The mod/div can take 20 - 50+ cycles.
  27. rn = Xn mod s with rejection If Xn in

    [0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1
  28. rn = Xn mod s with rejection If Xn in

    [0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample.
  29. rn = Xn mod s with rejection If Xn in

    [0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample. Finally, rn = Xn mod s
  30. rn = Xn mod s with rejection If Xn in

    [0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample. Finally, rn = Xn mod s How efficient is this approach?
  31. rn = Xn mod s with rejection If Xn in

    [0, 23) and s=3 Xn :0,1,2,3,4,5|,6,7 => rn :0,1,2,0,1,2|,0,1 While (Xn >= 2L - (2L mod s)) reject the sample. Finally, rn = Xn mod s How efficient is this approach? Two mods and 2/8 of the samples are rejected in this case.
  32. Masking + rejection sampling Generate Xn * in [0, 2M)

    for M such that 2M >= s Xn * = Xn & 2M-1
  33. Masking + rejection sampling Generate Xn * in [0, 2M)

    for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample.
  34. Masking + rejection sampling Generate Xn * in [0, 2M)

    for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample. How efficient is rejection sampling + masking?
  35. Masking + rejection sampling Generate Xn * in [0, 2M)

    for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample. How efficient is rejection sampling + masking? Worst case s=2(M-1)+1 e.g. 1025 if M=11 => approx. 50% rejected.
  36. Masking + rejection sampling Generate Xn * in [0, 2M)

    for M such that 2M >= s Xn * = Xn & 2M-1 While (Xn * >= s) reject the sample. How efficient is rejection sampling + masking? Worst case s=2(M-1)+1 e.g. 1025 if M=11 => approx. 50% rejected. No mods.
  37. Integer scaling Generate Xn in [0, 232) and calc Qn

    = Xn x s Finally, rn = Qn >> 32
  38. Integer scaling Generate Xn in [0, 232) and calc Qn

    = Xn x s Finally, rn = Qn >> 32 How efficient is this approach?
  39. Integer scaling Generate Xn in [0, 232) and calc Qn

    = Xn x s Finally, rn = Qn >> 32 How efficient is this approach? 64-bit is very efficient on modern CPUs.
  40. Integer scaling Generate Xn in [0, 232) and calc Qn

    = Xn x s Finally, rn = Qn >> 32 How efficient is this approach? 64-bit is very efficient on modern CPUs. Does this generate a uniform random number rn in [1, s)?
  41. Integer scaling Generate Xn in [0, 232) and calc Qn

    = Xn x s Finally, rn = Qn >> 32 How efficient is this approach? 64-bit is very efficient on modern CPUs. Does this generate a uniform random number rn in [1, s)? No, some numbers in [1, s) will appear more often than others.
  42. Integer scaling (cont.) E.g. Xn in [0, 2L) and calc

    Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7
  43. Integer scaling (cont.) E.g. Xn in [0, 2L) and calc

    Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7 Qn = Xn x 3 in [0, 24) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
  44. Integer scaling (cont.) E.g. Xn in [0, 2L) and calc

    Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7 Qn = Xn x 3 in [0, 24) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
  45. Integer scaling (cont.) E.g. Xn in [0, 2L) and calc

    Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7 Qn = Xn x 3 in [0, 24) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
  46. Integer scaling (cont.) E.g. Xn in [0, 2L) and calc

    Qn = Xn x s for s = 3 and L = 3 Xn ∈ 0,1,2,3,4,5,6,7 Qn = Xn x 3 in [0, 24) 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 Qn ÷ 23 in [0, 3) {0,1,2,3,4,5,6,7} => 0, {8,9,10,11,12,13,14,15} => 1, {16,17,18,19,20,21,22,23} => 2 appears less often than 0 and 1!
  47. Integer scaling + rejection sampling* * Daniel Lemire. 2018. Fast

    Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  48. Integer scaling + rejection sampling* E.g. Xn in [0, 2L)

    and calc Qn = Xn x s for s = 3 and L = 3 * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  49. Integer scaling + rejection sampling* E.g. Xn in [0, 2L)

    and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  50. Integer scaling + rejection sampling* E.g. Xn in [0, 2L)

    and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in first (2L mod s) positions of an s bin * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  51. Integer scaling + rejection sampling* E.g. Xn in [0, 2L)

    and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in first (2L mod s) positions of an s bin 2 will now appear as often as 0 and 1! * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  52. Integer scaling + rejection sampling* E.g. Xn in [0, 2L)

    and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in first (2L mod s) positions of an s bin 2 will now appear as often as 0 and 1! How efficient is this approach? * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  53. Integer scaling + rejection sampling* E.g. Xn in [0, 2L)

    and calc Qn = Xn x s for s = 3 and L = 3 Qn ÷ 23 in [0, 3) {0,1,|2,3,4,5,6,7} => 0, {8,9,|10,11,12,13,14,15} => 1, {16,17,|18,19,20,21,22,23} => 2 Reject if Qn in first (2L mod s) positions of an s bin 2 will now appear as often as 0 and 1! How efficient is this approach? One mod and chance of rejection is (2L mod s)/2L < s/2L * Daniel Lemire. 2018. Fast Random Integer Generation in an Interval. ACM Transactions on Modeling and Computer Simulation (to appear)
  54. Integer scaling + rejection sampling Mod/divide by 2L is just

    and/shift and therefore efficient. s is an upper bound of t and used so that we only do a mod with chance s/(2L-1).
  55. Integer scaling + rejection sampling Mod/divide by 2L is just

    and/shift and therefore efficient. s is an upper bound of t and used so that we only do a mod with chance s/(2L-1). t is only dependent on s and not on x as well so no need to recalculate t whenever a number is rejected!
  56. Python and Numpy Python uses mod+rejection OR masking+rejection. Numpy uses

    masking+rejection. I did a Numpy PR in mid August, but Charles Harris pointed me to NEP-19 and the RandomGen project.
  57. Python and Numpy Python uses mod+rejection OR masking+rejection. Numpy uses

    masking+rejection. I did a Numpy PR in mid August, but Charles Harris pointed me to NEP-19 and the RandomGen project. Kevin Sheppard from RandomGen seemed keen.
  58. Python and Numpy Python uses mod+rejection OR masking+rejection. Numpy uses

    masking+rejection. I did a Numpy PR in mid August, but Charles Harris pointed me to NEP-19 and the RandomGen project. Kevin Sheppard from RandomGen seemed keen. So I did a RandomGen PR to optionally use Lemire’s algorithm for unsigned integers in an interval (busy testing 64 bit).
  59. RandomGen https://github.com/bashtage/randomgen - "Provides access to more modern PRNGs that

    support modern features such as streams and easy advancement so that they can be easily used on 1000s of nodes." May also be used as alternative to the Python random module. PRNGs and rejection sampling implemented in C. Provides a number of random number generators MT, XorShift, PCG, ThreeFry, Philox.
  60. Results - Masked rejection vs. Lemire Best case performance is

    slightly worse. Reduces average execution time by 54%.
  61. Results - Masked rejection vs. Lemire Best case performance is

    slightly worse. Reduces average execution time by 54%. Reduces worst case execution time by 73%.
  62. Results - Approx. avrg. case With Lemire rejection Speed-up relative

    to NumPy MT 32-bit MT19937 85.0% PCG32 200.7% PCG64 188.3% ThreeFry 71.5% Xoroshiro128 258.0% Xorshift1024 209.6% Speed-up relative to NumPy MT 64-bit MT19937 96.8% PCG32 215.6% PCG64 260.5% ThreeFry 61.7% Xoroshiro128 396.9% Xorshift1024 331.8% Original Masked Rejection Speed-up relative to NumPy MT 32-bit: MT19937 -2.6% PCG32 46.5% PCG64 33.2% ThreeFry -12.1% Xoroshiro128 65.8% Xorshift1024 41.3% Speed-up relative to NumPy MT 64-bit: MT19937 -4.0% PCG32 29.5% PCG64 25.0% ThreeFry -5.4% Xoroshiro128 56.0% Xorshift1024 30.9%
  63. Results - Confirmation http://www.pcg-random.org/posts/bounded-rands.html - 2018-07-22 "From our benchmarks, we

    can see that switching from a commonly- used PRNG (e.g., the 32-bit Mersenne Twister) to a faster PRNG reduced the execution time of our benchmarks by 45%. But switching from a commonly used method for finding a number in a range to our fastest method reduced our benchmark time by about 66%, in other words reducing execution time to one third of the original time."