Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to calculate a variance of floating point n...

How to calculate a variance of floating point numbers

Introduce enumerable-statistics.gem, and explain the algorithm utilized in this library to calculate variance.

Avatar for Kenta Murata

Kenta Murata

August 20, 2016
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. @mrkn • Kenta Murata • Ruby ‣ CRuby committer, bigdecimal

    ‣ SciRuby contributor • Work ‣ Software & Data Engineer
  2. enumerable-statistics.gem 1 require 'enumerable/statistics' 2 require 'csv' 3 4 m,

    v = CSV.foreach(ARGV[0]).mean_variance do |row| 5 row[1].to_f * row[2].to_f 6 end 7 8 puts "mean: #{m}" 9 puts "variance: #{v}"
  3. enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean,

    variance, stdev, etc. - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation
  4. Performance of variance iterations per seconds 0k 30k 60k 90k

    120k sum mean variance inject while enum-stat (large value is better) NOTE: "inject" and "while" don't preserve precision
  5. enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean,

    variance, stdev - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation
  6. What is a variance? 1 162 2 155 3 149

    4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean
  7. How to calculate a mean? µ = E [ x

    ] = 1 n n X k=1 xk = 1 n ( x1 + x2 + · · · + xn)
  8. What is a variance? 1 162 2 155 3 149

    4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean distance
 from mean x6 µ = 8 . 3 x5 µ = 10 . 3
  9. How to calculate variance? 2 = Var[ x ] =

    E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2
  10. How to calculate variance? 2 = Var[ x ] =

    E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2
  11. How to calculate variance? 2 = E [( x µ

    )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 0 @ xk 1 n n X j=1 xj 1 A 2 2 = E [ x 2] E [ x ]2 = 1 n n X k=1 x 2 1 n n X k=1 x !2 need to scan twice enough to scan once
 (online algorithm) The 2nd formula is better than the 1st for large populations Really?
  12. Experiment 1 m = 0 2 n = 5 3

    10_000_000.times do 4 xs = Array.new(n) { 1.0 + 1e-6 * (0.5 - rand) } 5 sq_mean = xs.map {|x| x**2 }.sum / n 6 mean_sq = (xs.sum / n)**2 7 var = sq_mean - mean_sq 8 p [m += 1, var] if var.negative? 9 end
  13. Result $ ruby-trunk ex1.rb [1, -1.1102230246251565e-16] [2, -1.1102230246251565e-16] [3, -2.220446049250313e-16]

    [4, -1.1102230246251565e-16] [5, -1.1102230246251565e-16] [6, -2.220446049250313e-16] [7, -1.1102230246251565e-16] [8, -1.1102230246251565e-16] [9, -1.1102230246251565e-16] [10, -2.220446049250313e-16] : : • The 2nd formula rarely derives negative values • This is due to errors on floating-point arithmetic • We cannot calculate standard deviation if variance is negative
  14. How to calculate variance on floating-point arithmetic? • Use 2-pass

    formula when n is small • Use 1-pass formula for shifted values • Use recurrence relation formula 2 = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 2 = E [( x ˆ x )2] E [ x ˆ x ]2 = 1 n n X k=1 ( x ˆ x )2 ( 1 n n X k=1 ( x ˆ x ) )2
  15. Recurrence relation e.g. online mean Mean of the first n

    items: ¯ xn = 1 n n X k=1 xk = 1 n xn + n 1 X k=1 xk ! = 1 n { xn + ( n 1)¯ xn 1 } = ¯ xn 1 + xn ¯ xn 1 n Previous term Updating term
  16. Recurrence relation formula for sum of squares Sum of squares

    of the first n items: S 2 1 = 0 , S 2 n = n P k=1 ( x ¯ xn)2 = n P k=1 ( x ¯ xn 1 + ¯ xn 1 ¯ xn)2 = n P k=1 ( x ¯ xn 1)2 + 2 n P k=1 ( x ¯ xn 1)(¯ xn 1 ¯ xn) + n P k=1 (¯ xn 1 ¯ xn)2 . . . snip . . . = S 2 n 1 + ( xn ¯ xn 1)2 1 n ( xn ¯ xn 1)2 = S 2 n 1 + ( xn ¯ xn 1)2 ( xn ¯ xn 1)(¯ xn ¯ xn 1) = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn)
  17. Recurrence formula for
 mean and variance s2 n = S2

    n n 1 2 n = S2 n n Mean: Sample variance: Population variance: Sum of squares: S 2 1 = 0 , S 2 n = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn) ¯ x1 = x1, ¯ xn = ¯ xn 1 + xn ¯ xn 1 n
  18. Wrap up • enumerable-statistics.gem provides methods to calculate statistical summaries

    of Enumerable • Use recurrence relation formula for 1- pass and precision preserving calculation • Fast calculation without method calls • gem install enumerable-statistics