Kenta Murata
August 20, 2016
1.1k

# How to calculate a variance of floating point numbers

Introduce enumerable-statistics.gem, and explain the algorithm utilized in this library to calculate variance.

August 20, 2016

## Transcript

1. ### 浮動小数点数の 分散の求め方 Kenta Murata 2016.08.20 #kwsk01 How to calculate a

variance of floating point numbers
2. ### @mrkn • Kenta Murata • Ruby ‣ CRuby committer, bigdecimal

‣ SciRuby contributor • Work ‣ Software & Data Engineer

4. ### enumerable-statistics.gem 1 require 'enumerable/statistics' 2 require 'csv' 3 4 m,

v = CSV.foreach(ARGV[0]).mean_variance do |row| 5 row[1].to_f * row[2].to_f 6 end 7 8 puts "mean: #{m}" 9 puts "variance: #{v}"
5. ### enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean,

variance, stdev, etc. - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation
6. ### Performance of variance iterations per seconds 0k 30k 60k 90k

120k sum mean variance inject while enum-stat (large value is better) NOTE: "inject" and "while" don't preserve precision
7. ### enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean,

variance, stdev - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation
8. ### What is a variance? 1 162 2 155 3 149

4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean
9. ### How to calculate a mean? µ = E [ x

] = 1 n n X k=1 xk = 1 n ( x1 + x2 + · · · + xn)
10. ### What is a variance? 1 162 2 155 3 149

4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean distance  from mean x6 µ = 8 . 3 x5 µ = 10 . 3
11. ### How to calculate variance? 2 = Var[ x ] =

E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2
12. ### How to calculate variance? 2 = Var[ x ] =

E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2
13. ### How to calculate variance? 2 = E [( x µ

)2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 0 @ xk 1 n n X j=1 xj 1 A 2 2 = E [ x 2] E [ x ]2 = 1 n n X k=1 x 2 1 n n X k=1 x !2 need to scan twice enough to scan once  (online algorithm) The 2nd formula is better than the 1st for large populations Really?
14. ### Experiment 1 m = 0 2 n = 5 3

10_000_000.times do 4 xs = Array.new(n) { 1.0 + 1e-6 * (0.5 - rand) } 5 sq_mean = xs.map {|x| x**2 }.sum / n 6 mean_sq = (xs.sum / n)**2 7 var = sq_mean - mean_sq 8 p [m += 1, var] if var.negative? 9 end
15. ### Result \$ ruby-trunk ex1.rb [1, -1.1102230246251565e-16] [2, -1.1102230246251565e-16] [3, -2.220446049250313e-16]

[4, -1.1102230246251565e-16] [5, -1.1102230246251565e-16] [6, -2.220446049250313e-16] [7, -1.1102230246251565e-16] [8, -1.1102230246251565e-16] [9, -1.1102230246251565e-16] [10, -2.220446049250313e-16] : : • The 2nd formula rarely derives negative values • This is due to errors on ﬂoating-point arithmetic • We cannot calculate standard deviation if variance is negative
16. ### How to calculate variance on ﬂoating-point arithmetic? • Use 2-pass

formula when n is small • Use 1-pass formula for shifted values • Use recurrence relation formula 2 = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 2 = E [( x ˆ x )2] E [ x ˆ x ]2 = 1 n n X k=1 ( x ˆ x )2 ( 1 n n X k=1 ( x ˆ x ) )2
17. ### Recurrence relation e.g. online mean Mean of the ﬁrst n

items: ¯ xn = 1 n n X k=1 xk = 1 n xn + n 1 X k=1 xk ! = 1 n { xn + ( n 1)¯ xn 1 } = ¯ xn 1 + xn ¯ xn 1 n Previous term Updating term
18. ### Recurrence relation formula for sum of squares Sum of squares

of the ﬁrst n items: S 2 1 = 0 , S 2 n = n P k=1 ( x ¯ xn)2 = n P k=1 ( x ¯ xn 1 + ¯ xn 1 ¯ xn)2 = n P k=1 ( x ¯ xn 1)2 + 2 n P k=1 ( x ¯ xn 1)(¯ xn 1 ¯ xn) + n P k=1 (¯ xn 1 ¯ xn)2 . . . snip . . . = S 2 n 1 + ( xn ¯ xn 1)2 1 n ( xn ¯ xn 1)2 = S 2 n 1 + ( xn ¯ xn 1)2 ( xn ¯ xn 1)(¯ xn ¯ xn 1) = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn)
19. ### Recurrence formula for  mean and variance s2 n = S2

n n 1 2 n = S2 n n Mean: Sample variance: Population variance: Sum of squares: S 2 1 = 0 , S 2 n = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn) ¯ x1 = x1, ¯ xn = ¯ xn 1 + xn ¯ xn 1 n
20. ### Wrap up • enumerable-statistics.gem provides methods to calculate statistical summaries

of Enumerable • Use recurrence relation formula for 1- pass and precision preserving calculation • Fast calculation without method calls • gem install enumerable-statistics