Xi'an
November 23, 2013
3.7k

# Thomas Ounas' reading classics seminar on Statistical inference on massive datasets

Presentation on Nov. 18, 2013, as part of the TSI Master courses, on this paper suggested by Thomas.

#### Xi'an

November 23, 2013

## Transcript

2. ### Outline ‣Introduction ‣The Proposed Estimation Procedure ‣Statistical Applications ‣Example :

Internet Traffic Data

4. ### In the past decade, we have witnessed a revolution in

information technology. The amount of storaged data has increase dramatically. •Barclaycard(UK) : For example : 350 million transactions a year •Wal-mart : 7 billion transactions a year •AT&T : 70 billion long distance calls annually TOO LARGE FOR THE PRIMARY MEMORY (RAM) OF A COMPUTER

Computers

8. ### The Proposed Estimation Procedure 1. Overview and an example 2.

Sampling Properties 3. Choice of αn

10. ### The Procedure •Read in the whole data set sequentially block

by block •Compute an estimate θ(F) within each block •Take the average of the blocks estimates To estimate a parameter θ(F) of a population F :
11. ### • Suppose that x 1 ,..., x n iid from

a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn
12. ### •It’ll be shown that the resulting estimate is robust to

the choice of αn •The choice of αn will be discussed later Block 1 : θ1n ^ Block i : θin ^ Block βn : θβnn ^ } θ = 1 ∑θin _ _ βn ^ βn i=1 How to estimate θ(F) :
13. ### An Example •n = 8 000 000 •αn = 8

000 ≈ √n log(log(n)) •βn = 1000 N = 1000 data sets of 8 million iid ∼ Chi2(1) We illustrate our approach by estimating various percentiles of the population. _ Let’s have a look at the baby-software I made

15. ### • Suppose that x 1 ,..., x n iid from

a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn The proposed estimate : θ = 1 ∑θin _ _ βn ^ βn i=1
16. ### Proposition 1. For any positive integer values of αn and

βn (a) if θin is affine equivariant, then so is θ; (b) if θin is an unbiased estimator of θ, the so is θ. _ _ ^ ^ Proposition 2. Suppose that x1,...,xn are iid, and αn → ∞ and βn → ∞ as n → ∞ • if θin’s converge weakly to the true value θ, then so does θ; • if θin’s converge in L2 to the true value θ, then so does θ; • if θin’s converge strongly to the true value θ, then so does θ; _ _ _ ^ ^ ^ Denote μn = E[θin] and σn= var(θin) 2 ^ ^ To establish the asymptotic normality of θ, we need one of the following two conditions. _ Condition (a) αn is a constant independent of n and σn < ∞ 2 Condition (b) αn → ∞ and βn → ∞ as n → ∞, and E ⎮θin - μn ⎮ βn σn 2+δ 2+δ δ/2 ⟶0 as n → ∞, for some δ>0
17. ### Theorem 1. Suppose that x1,..., xn are iid. If either

conditon (a) or (b) holds, then in distribution as n → ∞ The proof is based on the Lyapunov’s condition which can be seen as a generalization of the central limit theorem Remark : When αn is a fixed finite number, μn and σn do not depend on n and can be denoted by μ and σ, and then 2 θ - μn σn ( ) _ √ βn __ ⟶ N(0,1) 2 θ - θ σ ( ) _ √ βn __ ⟶ N(0,1) holds if and only if θin‘s are unbiased estimators of θ. ^ ^ θin is a biased estimator, e resulting estimator is not consistent, because the bias μ - θ is a constante

19. ### When αn → ∞ , it can be shown that

θ - θ σn ( ) _ √ βn __ ⟶ N(0,1) holds if and only if μn - θ = o(σn/√ βn) _ _ • If θin is unbiased, then μn - θ = 0 and (1) holds. • In most cases, σn= O(1/√ αn) ⇒ If μn - θ = o(1/√n) then (1) holds (1) _ ^ • For a biased estimator of θ in parametric settings, usually we have μn - θ = O(1/αn)  Thus it is necessary that αn/√n → ∞ _ • In practice, we suggest to take : αn = O(√n loglog(n)) _

23. ### Confidence Interval The standard error of θ : _ SE(θ)

= σn √βn _ _ ^ where σn = ∑(θin-θ) ^ { } 1 n- 1 ^ _ 2 i=1 βn 1/2 If the asymptotic normality (1) holds then a 100(1-α)% confidence interval is approximately : θ - Φ(1-α/2) SE(θ) ; θ + Φ(1-α/2) SE(θ) -1 -1 ^ ^ _ _ _ _ [ ] IC(1-α) = Where Φ is the normal cumulative distribution function. ^
24. ### Testing Hypothesis Similary, for testing the hypothesis : H0 :

θ = θ0 vs H1 : θ ≠ θ0 The test statistic is given by : T = θ - θ0 SE(θ) ^ _ _ And the rejection region is {∣T∣ > Φ(1-α/2)} with significance level α -1

26. ### Few Recalls Definition : Let (x1, x2, …, xn) be

an iid sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is : ‣ where K(•) is the kernel (a symmetric but not necessarily positive function that integrates to one) ‣ h > 0 is a smoothing parameter called the bandwidth ‣ we’ll use the gaussian kernel : K(x) = ϕ(x), where ϕ is the standard normal density function

28. ### Bandwidth selection The kernel density estimator : • Grey :

true density (standard normal). Red : KDE with h=0.05. Green : KDE with h=2. Black : KDE with h=0.337.
29. ### Optimal Bandwidth ‣ The Expected L2 risk function (Mean Integrated

Squared Error) : The most common optimality criterion used to select the bandwidth is : Under weak assumptions on ƒ and K : MISE (h) = AMISE(h) + o(1/(nh) + h) 4 where AMISE is the Asymptotic MISE We have : Where : for a function g, and The minimum of this AMISE is the solution to this differential equation :
30. ### Optimal Bandwidth Hence the optimal bandwith is : • Remark

: neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', • A variety of automatic, data-based methods have been developed for selecting the bandwidth.  ROT : the Rule of Thumb - Silverman (1986)  Plug-in selectors  Cross validation selectors • Remark : Substituting any bandwidth h which has the same asymptotic order n as hAMISE into the AMISE gives that : −1/5 AMISE(h) = O( n ) −4/5 • It can be shown that, under weak assumptions, there cannot exist a non- parametric estimator that converges at a faster rate than the kernel estimator
31. ### Come Back on the proposed Approach et’s see the influence

of αn on the estimation with an example ‣ 1 million iid ∼ 0.5 N(-2,1) + 0.5 N(2,1) want to estimate the density of the mixture based on the random sample. ‣ αn = 500, 1 000 and 5000 ‣ Gaussian Kernel ‣ Bandwith was selected using the rule of thumb  hrot = 0.9 x 1.06 x σ x n −1/5 where σ is the population standard deviation ‣ In our simulation, the σ is substituted by its robust estimate, the mean of absolute deviation (MAD) of the sub-sample xi1 ,...,xiαn within each block. Let’s try the baby-software

33. ### To asses the performance for different αn, we define the

Root of Average of Squared Errors (RASE) : where : ‣ xj are the grid points at which the density were computed ‣ ngrid = 400 we have : ‣ RASE500 = 11x10 −4 ‣ RASE1000 = 9.12 x10 −4 ‣ RASE5000 = 5.94 x10 −4 It is seen that the performance becomes better as αn increases Remark : The performance of the estimator is not very sensitive to the choice of αn