Thomas Ounas' reading classics seminar on Statistical inference on massive datasets

Slide 1

Slide 1 text

STATISTICAL INFERENCE ON LARGE DATA SETS

Slide 2

Slide 2 text

Outline ‣Introduction ‣The Proposed Estimation Procedure ‣Statistical Applications ‣Example : Internet Traffic Data

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

In the past decade, we have witnessed a revolution in information technology. The amount of storaged data has increase dramatically. •Barclaycard(UK) : For example : 350 million transactions a year •Wal-mart : 7 billion transactions a year •AT&T : 70 billion long distance calls annually TOO LARGE FOR THE PRIMARY MEMORY (RAM) OF A COMPUTER

Slide 5

Slide 5 text

Your Computer Your Data Storegable on a computer

Slide 6

Slide 6 text

Your Computer Your Data Not Storegable on a computer Other Computers

Slide 7

Slide 7 text

The Proposed Estimation Procedure

Slide 8

Slide 8 text

The Proposed Estimation Procedure 1. Overview and an example 2. Sampling Properties 3. Choice of αn

Slide 9

Slide 9 text

Overview and an example

Slide 10

Slide 10 text

The Procedure •Read in the whole data set sequentially block by block •Compute an estimate θ(F) within each block •Take the average of the blocks estimates To estimate a parameter θ(F) of a population F :

Slide 11

Slide 11 text

• Suppose that x 1 ,..., x n iid from a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn

Slide 12

Slide 12 text

•It’ll be shown that the resulting estimate is robust to the choice of αn •The choice of αn will be discussed later Block 1 : θ1n ^ Block i : θin ^ Block βn : θβnn ^ } θ = 1 ∑θin _ _ βn ^ βn i=1 How to estimate θ(F) :

Slide 13

Slide 13 text

An Example •n = 8 000 000 •αn = 8 000 ≈ √n log(log(n)) •βn = 1000 N = 1000 data sets of 8 million iid ∼ Chi2(1) We illustrate our approach by estimating various percentiles of the population. _ Let’s have a look at the baby-software I made

Slide 14

Slide 14 text

Sampling Properties

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Proposition 1. For any positive integer values of αn and βn (a) if θin is affine equivariant, then so is θ; (b) if θin is an unbiased estimator of θ, the so is θ. _ _ ^ ^ Proposition 2. Suppose that x1,...,xn are iid, and αn → ∞ and βn → ∞ as n → ∞ • if θin’s converge weakly to the true value θ, then so does θ; • if θin’s converge in L2 to the true value θ, then so does θ; • if θin’s converge strongly to the true value θ, then so does θ; _ _ _ ^ ^ ^ Denote μn = E[θin] and σn= var(θin) 2 ^ ^ To establish the asymptotic normality of θ, we need one of the following two conditions. _ Condition (a) αn is a constant independent of n and σn < ∞ 2 Condition (b) αn → ∞ and βn → ∞ as n → ∞, and E ⎮θin - μn ⎮ βn σn 2+δ 2+δ δ/2 ⟶0 as n → ∞, for some δ>0

Slide 17

Slide 17 text

Theorem 1. Suppose that x1,..., xn are iid. If either conditon (a) or (b) holds, then in distribution as n → ∞ The proof is based on the Lyapunov’s condition which can be seen as a generalization of the central limit theorem Remark : When αn is a fixed finite number, μn and σn do not depend on n and can be denoted by μ and σ, and then 2 θ - μn σn ( ) _ √ βn __ ⟶ N(0,1) 2 θ - θ σ ( ) _ √ βn __ ⟶ N(0,1) holds if and only if θin‘s are unbiased estimators of θ. ^ ^ θin is a biased estimator, e resulting estimator is not consistent, because the bias μ - θ is a constante

Slide 18

Slide 18 text

Choice of αn

Slide 19

Slide 19 text

When αn → ∞ , it can be shown that θ - θ σn ( ) _ √ βn __ ⟶ N(0,1) holds if and only if μn - θ = o(σn/√ βn) _ _ • If θin is unbiased, then μn - θ = 0 and (1) holds. • In most cases, σn= O(1/√ αn) ⇒ If μn - θ = o(1/√n) then (1) holds (1) _ ^ • For a biased estimator of θ in parametric settings, usually we have μn - θ = O(1/αn)  Thus it is necessary that αn/√n → ∞ _ • In practice, we suggest to take : αn = O(√n loglog(n)) _

Slide 20

Slide 20 text

Statistical ApplicationS

Slide 21

Slide 21 text

Applications 1. Statistical Inference 2. Non-parametric kernel density estimation

Slide 22

Slide 22 text

Statistical Inference

Slide 23

Slide 23 text

Confidence Interval The standard error of θ : _ SE(θ) = σn √βn _ _ ^ where σn = ∑(θin-θ) ^ { } 1 n- 1 ^ _ 2 i=1 βn 1/2 If the asymptotic normality (1) holds then a 100(1-α)% confidence interval is approximately : θ - Φ(1-α/2) SE(θ) ; θ + Φ(1-α/2) SE(θ) -1 -1 ^ ^ _ _ _ _ [ ] IC(1-α) = Where Φ is the normal cumulative distribution function. ^

Slide 24

Slide 24 text

Testing Hypothesis Similary, for testing the hypothesis : H0 : θ = θ0 vs H1 : θ ≠ θ0 The test statistic is given by : T = θ - θ0 SE(θ) ^ _ _ And the rejection region is {∣T∣ > Φ(1-α/2)} with significance level α -1

Slide 25

Slide 25 text

Non-parametric kernel density estimation

Slide 26

Slide 26 text

Few Recalls Definition : Let (x1, x2, …, xn) be an iid sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is : ‣ where K(•) is the kernel (a symmetric but not necessarily positive function that integrates to one) ‣ h > 0 is a smoothing parameter called the bandwidth ‣ we’ll use the gaussian kernel : K(x) = ϕ(x), where ϕ is the standard normal density function

Slide 27

Slide 27 text

Explanations with a graphic :

Slide 28

Slide 28 text

Bandwidth selection The kernel density estimator : • Grey : true density (standard normal). Red : KDE with h=0.05. Green : KDE with h=2. Black : KDE with h=0.337.

Slide 29

Slide 29 text

Optimal Bandwidth ‣ The Expected L2 risk function (Mean Integrated Squared Error) : The most common optimality criterion used to select the bandwidth is : Under weak assumptions on ƒ and K : MISE (h) = AMISE(h) + o(1/(nh) + h) 4 where AMISE is the Asymptotic MISE We have : Where : for a function g, and The minimum of this AMISE is the solution to this differential equation :

Slide 30

Slide 30 text

Optimal Bandwidth Hence the optimal bandwith is : • Remark : neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', • A variety of automatic, data-based methods have been developed for selecting the bandwidth.  ROT : the Rule of Thumb - Silverman (1986)  Plug-in selectors  Cross validation selectors • Remark : Substituting any bandwidth h which has the same asymptotic order n as hAMISE into the AMISE gives that : −1/5 AMISE(h) = O( n ) −4/5 • It can be shown that, under weak assumptions, there cannot exist a non- parametric estimator that converges at a faster rate than the kernel estimator

Slide 31

Slide 31 text

Come Back on the proposed Approach et’s see the influence of αn on the estimation with an example ‣ 1 million iid ∼ 0.5 N(-2,1) + 0.5 N(2,1) want to estimate the density of the mixture based on the random sample. ‣ αn = 500, 1 000 and 5000 ‣ Gaussian Kernel ‣ Bandwith was selected using the rule of thumb  hrot = 0.9 x 1.06 x σ x n −1/5 where σ is the population standard deviation ‣ In our simulation, the σ is substituted by its robust estimate, the mean of absolute deviation (MAD) of the sub-sample xi1 ,...,xiαn within each block. Let’s try the baby-software

Slide 32

Slide 32 text

Come Back on the proposed Approach

Slide 33

Slide 33 text

To asses the performance for different αn, we define the Root of Average of Squared Errors (RASE) : where : ‣ xj are the grid points at which the density were computed ‣ ngrid = 400 we have : ‣ RASE500 = 11x10 −4 ‣ RASE1000 = 9.12 x10 −4 ‣ RASE5000 = 5.94 x10 −4 It is seen that the performance becomes better as αn increases Remark : The performance of the estimator is not very sensitive to the choice of αn

Slide 34

Slide 34 text

Internet Traffic Data

Slide 35

Slide 35 text

QuickTime™ et un décompresseur sont requis pour visionner cette image. Few things about Packets

Slide 36

Slide 36 text

In this section, we analyse an internet traffic data The original data file includes three fields : ‣ Time of the packet (in second) ‣ Direction of the packet ‣ Size of the packet The variable under study is the throughput defined bye : size of packet in bytes time between two packets he data set consists of 8.1 million nonzero throughputs (packet size per second) ‣ αn = 8000 (≈ √n log(log(n))) _

Slide 37

Slide 37 text

he plot of the estimated density curve of internet traffic data : Estimated Quartiles of Internet traffic Data

Slide 38

Slide 38 text

• The density graphic shows that there are 3 typical values of throughput One close to 0 and the other two have a large size of throughput. A common unit for measuring the throughput is the mbps (megabits per second) • Here we have expresse our values in bytes instead of bits We should multiply by 8 for the conversion x 8 • First Quartile : 1.8 mbps 5 mbps 8.3 mbps • Second Quartile : • Third Quartile :

Slide 39

Slide 39 text

Conclusion  We have proposed an estimation procedure for large data sets  Significantly reduces the required amout of computing memory  Without loss of efficiency in many situations  It is applicable to both point and density estimation  Asymptotic properties have been studied  Asymptotic normality has been established  A standard error formula has been proposed and empirically tested  Simulation studies and an example of internet data have been seen

Slide 40

Slide 40 text

THE END