Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to calculate a variance of floating point numbers

How to calculate a variance of floating point numbers

Introduce enumerable-statistics.gem, and explain the algorithm utilized in this library to calculate variance.

Kenta Murata

August 20, 2016
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. @mrkn • Kenta Murata • Ruby ‣ CRuby committer, bigdecimal

    ‣ SciRuby contributor • Work ‣ Software & Data Engineer
  2. enumerable-statistics.gem 1 require 'enumerable/statistics' 2 require 'csv' 3 4 m,

    v = CSV.foreach(ARGV[0]).mean_variance do |row| 5 row[1].to_f * row[2].to_f 6 end 7 8 puts "mean: #{m}" 9 puts "variance: #{v}"
  3. enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean,

    variance, stdev, etc. - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation
  4. Performance of variance iterations per seconds 0k 30k 60k 90k

    120k sum mean variance inject while enum-stat (large value is better) NOTE: "inject" and "while" don't preserve precision
  5. enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean,

    variance, stdev - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation
  6. What is a variance? 1 162 2 155 3 149

    4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean
  7. How to calculate a mean? µ = E [ x

    ] = 1 n n X k=1 xk = 1 n ( x1 + x2 + · · · + xn)
  8. What is a variance? 1 162 2 155 3 149

    4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean distance
 from mean x6 µ = 8 . 3 x5 µ = 10 . 3
  9. How to calculate variance? 2 = Var[ x ] =

    E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2
  10. How to calculate variance? 2 = Var[ x ] =

    E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2
  11. How to calculate variance? 2 = E [( x µ

    )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 0 @ xk 1 n n X j=1 xj 1 A 2 2 = E [ x 2] E [ x ]2 = 1 n n X k=1 x 2 1 n n X k=1 x !2 need to scan twice enough to scan once
 (online algorithm) The 2nd formula is better than the 1st for large populations Really?
  12. Experiment 1 m = 0 2 n = 5 3

    10_000_000.times do 4 xs = Array.new(n) { 1.0 + 1e-6 * (0.5 - rand) } 5 sq_mean = xs.map {|x| x**2 }.sum / n 6 mean_sq = (xs.sum / n)**2 7 var = sq_mean - mean_sq 8 p [m += 1, var] if var.negative? 9 end
  13. Result $ ruby-trunk ex1.rb [1, -1.1102230246251565e-16] [2, -1.1102230246251565e-16] [3, -2.220446049250313e-16]

    [4, -1.1102230246251565e-16] [5, -1.1102230246251565e-16] [6, -2.220446049250313e-16] [7, -1.1102230246251565e-16] [8, -1.1102230246251565e-16] [9, -1.1102230246251565e-16] [10, -2.220446049250313e-16] : : • The 2nd formula rarely derives negative values • This is due to errors on floating-point arithmetic • We cannot calculate standard deviation if variance is negative
  14. How to calculate variance on floating-point arithmetic? • Use 2-pass

    formula when n is small • Use 1-pass formula for shifted values • Use recurrence relation formula 2 = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 2 = E [( x ˆ x )2] E [ x ˆ x ]2 = 1 n n X k=1 ( x ˆ x )2 ( 1 n n X k=1 ( x ˆ x ) )2
  15. Recurrence relation e.g. online mean Mean of the first n

    items: ¯ xn = 1 n n X k=1 xk = 1 n xn + n 1 X k=1 xk ! = 1 n { xn + ( n 1)¯ xn 1 } = ¯ xn 1 + xn ¯ xn 1 n Previous term Updating term
  16. Recurrence relation formula for sum of squares Sum of squares

    of the first n items: S 2 1 = 0 , S 2 n = n P k=1 ( x ¯ xn)2 = n P k=1 ( x ¯ xn 1 + ¯ xn 1 ¯ xn)2 = n P k=1 ( x ¯ xn 1)2 + 2 n P k=1 ( x ¯ xn 1)(¯ xn 1 ¯ xn) + n P k=1 (¯ xn 1 ¯ xn)2 . . . snip . . . = S 2 n 1 + ( xn ¯ xn 1)2 1 n ( xn ¯ xn 1)2 = S 2 n 1 + ( xn ¯ xn 1)2 ( xn ¯ xn 1)(¯ xn ¯ xn 1) = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn)
  17. Recurrence formula for
 mean and variance s2 n = S2

    n n 1 2 n = S2 n n Mean: Sample variance: Population variance: Sum of squares: S 2 1 = 0 , S 2 n = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn) ¯ x1 = x1, ¯ xn = ¯ xn 1 + xn ¯ xn 1 n
  18. Wrap up • enumerable-statistics.gem provides methods to calculate statistical summaries

    of Enumerable • Use recurrence relation formula for 1- pass and precision preserving calculation • Fast calculation without method calls • gem install enumerable-statistics