Thomas Ounas' reading classics seminar on Statistical inference on massive datasets

Dfbaebe5e96e827d993483f842c74fa2?s=47 Xi'an
November 23, 2013

Thomas Ounas' reading classics seminar on Statistical inference on massive datasets

Presentation on Nov. 18, 2013, as part of the TSI Master courses, on this paper suggested by Thomas.

Dfbaebe5e96e827d993483f842c74fa2?s=128

Xi'an

November 23, 2013
Tweet

Transcript

  1. STATISTICAL INFERENCE ON LARGE DATA SETS

  2. Outline ‣Introduction ‣The Proposed Estimation Procedure ‣Statistical Applications ‣Example :

    Internet Traffic Data
  3. Introduction

  4. In the past decade, we have witnessed a revolution in

    information technology. The amount of storaged data has increase dramatically. •Barclaycard(UK) : For example : 350 million transactions a year •Wal-mart : 7 billion transactions a year •AT&T : 70 billion long distance calls annually TOO LARGE FOR THE PRIMARY MEMORY (RAM) OF A COMPUTER
  5. Your Computer Your Data Storegable on a computer

  6. Your Computer Your Data Not Storegable on a computer Other

    Computers
  7. The Proposed Estimation Procedure

  8. The Proposed Estimation Procedure 1. Overview and an example 2.

    Sampling Properties 3. Choice of αn
  9. Overview and an example

  10. The Procedure •Read in the whole data set sequentially block

    by block •Compute an estimate θ(F) within each block •Take the average of the blocks estimates To estimate a parameter θ(F) of a population F :
  11. • Suppose that x 1 ,..., x n iid from

    a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn
  12. •It’ll be shown that the resulting estimate is robust to

    the choice of αn •The choice of αn will be discussed later Block 1 : θ1n ^ Block i : θin ^ Block βn : θβnn ^ } θ = 1 ∑θin _ _ βn ^ βn i=1 How to estimate θ(F) :
  13. An Example •n = 8 000 000 •αn = 8

    000 ≈ √n log(log(n)) •βn = 1000 N = 1000 data sets of 8 million iid ∼ Chi2(1) We illustrate our approach by estimating various percentiles of the population. _ Let’s have a look at the baby-software I made
  14. Sampling Properties

  15. • Suppose that x 1 ,..., x n iid from

    a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn The proposed estimate : θ = 1 ∑θin _ _ βn ^ βn i=1
  16. Proposition 1. For any positive integer values of αn and

    βn (a) if θin is affine equivariant, then so is θ; (b) if θin is an unbiased estimator of θ, the so is θ. _ _ ^ ^ Proposition 2. Suppose that x1,...,xn are iid, and αn → ∞ and βn → ∞ as n → ∞ • if θin’s converge weakly to the true value θ, then so does θ; • if θin’s converge in L2 to the true value θ, then so does θ; • if θin’s converge strongly to the true value θ, then so does θ; _ _ _ ^ ^ ^ Denote μn = E[θin] and σn= var(θin) 2 ^ ^ To establish the asymptotic normality of θ, we need one of the following two conditions. _ Condition (a) αn is a constant independent of n and σn < ∞ 2 Condition (b) αn → ∞ and βn → ∞ as n → ∞, and E ⎮θin - μn ⎮ βn σn 2+δ 2+δ δ/2 ⟶0 as n → ∞, for some δ>0
  17. Theorem 1. Suppose that x1,..., xn are iid. If either

    conditon (a) or (b) holds, then in distribution as n → ∞ The proof is based on the Lyapunov’s condition which can be seen as a generalization of the central limit theorem Remark : When αn is a fixed finite number, μn and σn do not depend on n and can be denoted by μ and σ, and then 2 θ - μn σn ( ) _ √ βn __ ⟶ N(0,1) 2 θ - θ σ ( ) _ √ βn __ ⟶ N(0,1) holds if and only if θin‘s are unbiased estimators of θ. ^ ^ θin is a biased estimator, e resulting estimator is not consistent, because the bias μ - θ is a constante
  18. Choice of αn

  19. When αn → ∞ , it can be shown that

    θ - θ σn ( ) _ √ βn __ ⟶ N(0,1) holds if and only if μn - θ = o(σn/√ βn) _ _ • If θin is unbiased, then μn - θ = 0 and (1) holds. • In most cases, σn= O(1/√ αn) ⇒ If μn - θ = o(1/√n) then (1) holds (1) _ ^ • For a biased estimator of θ in parametric settings, usually we have μn - θ = O(1/αn)  Thus it is necessary that αn/√n → ∞ _ • In practice, we suggest to take : αn = O(√n loglog(n)) _
  20. Statistical ApplicationS

  21. Applications 1. Statistical Inference 2. Non-parametric kernel density estimation

  22. Statistical Inference

  23. Confidence Interval The standard error of θ : _ SE(θ)

    = σn √βn _ _ ^ where σn = ∑(θin-θ) ^ { } 1 n- 1 ^ _ 2 i=1 βn 1/2 If the asymptotic normality (1) holds then a 100(1-α)% confidence interval is approximately : θ - Φ(1-α/2) SE(θ) ; θ + Φ(1-α/2) SE(θ) -1 -1 ^ ^ _ _ _ _ [ ] IC(1-α) = Where Φ is the normal cumulative distribution function. ^
  24. Testing Hypothesis Similary, for testing the hypothesis : H0 :

    θ = θ0 vs H1 : θ ≠ θ0 The test statistic is given by : T = θ - θ0 SE(θ) ^ _ _ And the rejection region is {∣T∣ > Φ(1-α/2)} with significance level α -1
  25. Non-parametric kernel density estimation

  26. Few Recalls Definition : Let (x1, x2, …, xn) be

    an iid sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is : ‣ where K(•) is the kernel (a symmetric but not necessarily positive function that integrates to one) ‣ h > 0 is a smoothing parameter called the bandwidth ‣ we’ll use the gaussian kernel : K(x) = ϕ(x), where ϕ is the standard normal density function
  27. Explanations with a graphic :

  28. Bandwidth selection The kernel density estimator : • Grey :

    true density (standard normal). Red : KDE with h=0.05. Green : KDE with h=2. Black : KDE with h=0.337.
  29. Optimal Bandwidth ‣ The Expected L2 risk function (Mean Integrated

    Squared Error) : The most common optimality criterion used to select the bandwidth is : Under weak assumptions on ƒ and K : MISE (h) = AMISE(h) + o(1/(nh) + h) 4 where AMISE is the Asymptotic MISE We have : Where : for a function g, and The minimum of this AMISE is the solution to this differential equation :
  30. Optimal Bandwidth Hence the optimal bandwith is : • Remark

    : neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', • A variety of automatic, data-based methods have been developed for selecting the bandwidth.  ROT : the Rule of Thumb - Silverman (1986)  Plug-in selectors  Cross validation selectors • Remark : Substituting any bandwidth h which has the same asymptotic order n as hAMISE into the AMISE gives that : −1/5 AMISE(h) = O( n ) −4/5 • It can be shown that, under weak assumptions, there cannot exist a non- parametric estimator that converges at a faster rate than the kernel estimator
  31. Come Back on the proposed Approach et’s see the influence

    of αn on the estimation with an example ‣ 1 million iid ∼ 0.5 N(-2,1) + 0.5 N(2,1) want to estimate the density of the mixture based on the random sample. ‣ αn = 500, 1 000 and 5000 ‣ Gaussian Kernel ‣ Bandwith was selected using the rule of thumb  hrot = 0.9 x 1.06 x σ x n −1/5 where σ is the population standard deviation ‣ In our simulation, the σ is substituted by its robust estimate, the mean of absolute deviation (MAD) of the sub-sample xi1 ,...,xiαn within each block. Let’s try the baby-software
  32. Come Back on the proposed Approach

  33. To asses the performance for different αn, we define the

    Root of Average of Squared Errors (RASE) : where : ‣ xj are the grid points at which the density were computed ‣ ngrid = 400 we have : ‣ RASE500 = 11x10 −4 ‣ RASE1000 = 9.12 x10 −4 ‣ RASE5000 = 5.94 x10 −4 It is seen that the performance becomes better as αn increases Remark : The performance of the estimator is not very sensitive to the choice of αn
  34. Internet Traffic Data

  35. QuickTime™ et un décompresseur sont requis pour visionner cette image.

    Few things about Packets
  36. In this section, we analyse an internet traffic data The

    original data file includes three fields : ‣ Time of the packet (in second) ‣ Direction of the packet ‣ Size of the packet The variable under study is the throughput defined bye : size of packet in bytes time between two packets he data set consists of 8.1 million nonzero throughputs (packet size per second) ‣ αn = 8000 (≈ √n log(log(n))) _
  37. he plot of the estimated density curve of internet traffic

    data : Estimated Quartiles of Internet traffic Data
  38. • The density graphic shows that there are 3 typical

    values of throughput One close to 0 and the other two have a large size of throughput. A common unit for measuring the throughput is the mbps (megabits per second) • Here we have expresse our values in bytes instead of bits We should multiply by 8 for the conversion x 8 • First Quartile : 1.8 mbps 5 mbps 8.3 mbps • Second Quartile : • Third Quartile :
  39. Conclusion  We have proposed an estimation procedure for large

    data sets  Significantly reduces the required amout of computing memory  Without loss of efficiency in many situations  It is applicable to both point and density estimation  Asymptotic properties have been studied  Asymptotic normality has been established  A standard error formula has been proposed and empirically tested  Simulation studies and an example of internet data have been seen
  40. THE END