Slide 1

Slide 1 text

浮動小数点数の 分散の求め方 Kenta Murata 2016.08.20 #kwsk01 How to calculate a variance of floating point numbers

Slide 2

Slide 2 text

@mrkn • Kenta Murata • Ruby ‣ CRuby committer, bigdecimal ‣ SciRuby contributor • Work ‣ Software & Data Engineer

Slide 3

Slide 3 text

enumerable-statistics.gem https://github.com/mrkn/enumerable-statistics

Slide 4

Slide 4 text

enumerable-statistics.gem 1 require 'enumerable/statistics' 2 require 'csv' 3 4 m, v = CSV.foreach(ARGV[0]).mean_variance do |row| 5 row[1].to_f * row[2].to_f 6 end 7 8 puts "mean: #{m}" 9 puts "variance: #{v}"

Slide 5

Slide 5 text

enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean, variance, stdev, etc. - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation

Slide 6

Slide 6 text

Performance of variance iterations per seconds 0k 30k 60k 90k 120k sum mean variance inject while enum-stat (large value is better) NOTE: "inject" and "while" don't preserve precision

Slide 7

Slide 7 text

enumerable-statistics.gem • Adding the following methods in Enumerable ‣ mean, variance, stdev - scan items only once - preserve precision as possible for Float values ‣ sum (only for Ruby < 2.4) - almost same algorithm as Ruby trunk • Very fast implementation

Slide 8

Slide 8 text

What is a variance? 1 162 2 155 3 149 4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean

Slide 9

Slide 9 text

How to calculate a mean? µ = E [ x ] = 1 n n X k=1 xk = 1 n ( x1 + x2 + · · · + xn)

Slide 10

Slide 10 text

What is a variance? 1 162 2 155 3 149 4 153 5 146 6 165 7 153 8 157 9 166 10 157 x k k x µ = 156.3 mean distance
 from mean x6 µ = 8 . 3 x5 µ = 10 . 3

Slide 11

Slide 11 text

How to calculate variance? 2 = Var[ x ] = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2

Slide 12

Slide 12 text

How to calculate variance? 2 = Var[ x ] = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 ( xk 2 2 µxk + µ 2) = 1 n n X k=1 xk 2 2 µ 1 n n X k=1 xk + µ 2 1 n n X k=1 1 = 1 n n X k=1 xk 2 2 µ 2 + µ 2 1 n ⇥ n = 1 n n X k=1 xk 2 µ 2 = E [ x 2] E [ x ]2

Slide 13

Slide 13 text

How to calculate variance? 2 = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 = 1 n n X k=1 0 @ xk 1 n n X j=1 xj 1 A 2 2 = E [ x 2] E [ x ]2 = 1 n n X k=1 x 2 1 n n X k=1 x !2 need to scan twice enough to scan once
 (online algorithm) The 2nd formula is better than the 1st for large populations Really?

Slide 14

Slide 14 text

Experiment 1 m = 0 2 n = 5 3 10_000_000.times do 4 xs = Array.new(n) { 1.0 + 1e-6 * (0.5 - rand) } 5 sq_mean = xs.map {|x| x**2 }.sum / n 6 mean_sq = (xs.sum / n)**2 7 var = sq_mean - mean_sq 8 p [m += 1, var] if var.negative? 9 end

Slide 15

Slide 15 text

Result $ ruby-trunk ex1.rb [1, -1.1102230246251565e-16] [2, -1.1102230246251565e-16] [3, -2.220446049250313e-16] [4, -1.1102230246251565e-16] [5, -1.1102230246251565e-16] [6, -2.220446049250313e-16] [7, -1.1102230246251565e-16] [8, -1.1102230246251565e-16] [9, -1.1102230246251565e-16] [10, -2.220446049250313e-16] : : • The 2nd formula rarely derives negative values • This is due to errors on floating-point arithmetic • We cannot calculate standard deviation if variance is negative

Slide 16

Slide 16 text

How to calculate variance on floating-point arithmetic? • Use 2-pass formula when n is small • Use 1-pass formula for shifted values • Use recurrence relation formula 2 = E [( x µ )2] = 1 n n X k=1 ( xk µ )2 2 = E [( x ˆ x )2] E [ x ˆ x ]2 = 1 n n X k=1 ( x ˆ x )2 ( 1 n n X k=1 ( x ˆ x ) )2

Slide 17

Slide 17 text

Recurrence relation e.g. online mean Mean of the first n items: ¯ xn = 1 n n X k=1 xk = 1 n xn + n 1 X k=1 xk ! = 1 n { xn + ( n 1)¯ xn 1 } = ¯ xn 1 + xn ¯ xn 1 n Previous term Updating term

Slide 18

Slide 18 text

Recurrence relation formula for sum of squares Sum of squares of the first n items: S 2 1 = 0 , S 2 n = n P k=1 ( x ¯ xn)2 = n P k=1 ( x ¯ xn 1 + ¯ xn 1 ¯ xn)2 = n P k=1 ( x ¯ xn 1)2 + 2 n P k=1 ( x ¯ xn 1)(¯ xn 1 ¯ xn) + n P k=1 (¯ xn 1 ¯ xn)2 . . . snip . . . = S 2 n 1 + ( xn ¯ xn 1)2 1 n ( xn ¯ xn 1)2 = S 2 n 1 + ( xn ¯ xn 1)2 ( xn ¯ xn 1)(¯ xn ¯ xn 1) = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn)

Slide 19

Slide 19 text

Recurrence formula for
 mean and variance s2 n = S2 n n 1 2 n = S2 n n Mean: Sample variance: Population variance: Sum of squares: S 2 1 = 0 , S 2 n = S 2 n 1 + ( xn ¯ xn 1)( xn ¯ xn) ¯ x1 = x1, ¯ xn = ¯ xn 1 + xn ¯ xn 1 n

Slide 20

Slide 20 text

Wrap up • enumerable-statistics.gem provides methods to calculate statistical summaries of Enumerable • Use recurrence relation formula for 1- pass and precision preserving calculation • Fast calculation without method calls • gem install enumerable-statistics

Slide 21

Slide 21 text

Appendix

Slide 22

Slide 22 text

Chan, T. F., et al. (1983).