Thomas Ounas' reading classics seminar on Statistical inference on massive datasets

STATISTICAL INFERENCE ON LARGE DATA SETS

Outline ‣Introduction ‣The Proposed Estimation Procedure ‣Statistical Applications ‣Example :
Internet Traffic Data

Introduction

In the past decade, we have witnessed a revolution in
information technology. The amount of storaged data has increase dramatically. •Barclaycard(UK) : For example : 350 million transactions a year •Wal-mart : 7 billion transactions a year •AT&T : 70 billion long distance calls annually TOO LARGE FOR THE PRIMARY MEMORY (RAM) OF A COMPUTER

Your Computer Your Data Storegable on a computer

Your Computer Your Data Not Storegable on a computer Other
Computers

The Proposed Estimation Procedure

The Proposed Estimation Procedure 1. Overview and an example 2.
Sampling Properties 3. Choice of αn

Overview and an example

The Procedure •Read in the whole data set sequentially block
by block •Compute an estimate θ(F) within each block •Take the average of the blocks estimates To estimate a parameter θ(F) of a population F :

• Suppose that x 1 ,..., x n iid from
a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn

•It’ll be shown that the resulting estimate is robust to
the choice of αn •The choice of αn will be discussed later Block 1 : θ1n ^ Block i : θin ^ Block βn : θβnn ^ } θ = 1 ∑θin _ _ βn ^ βn i=1 How to estimate θ(F) :

An Example •n = 8 000 000 •αn = 8
000 ≈ √n log(log(n)) •βn = 1000 N = 1000 data sets of 8 million iid ∼ Chi2(1) We illustrate our approach by estimating various percentiles of the population. _ Let’s have a look at the baby-software I made

Sampling Properties

• Suppose that x 1 ,..., x n iid from
a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks  Note that n = αnβn The proposed estimate : θ = 1 ∑θin _ _ βn ^ βn i=1

Proposition 1. For any positive integer values of αn and
βn (a) if θin is affine equivariant, then so is θ; (b) if θin is an unbiased estimator of θ, the so is θ. _ _ ^ ^ Proposition 2. Suppose that x1,...,xn are iid, and αn → ∞ and βn → ∞ as n → ∞ • if θin’s converge weakly to the true value θ, then so does θ; • if θin’s converge in L2 to the true value θ, then so does θ; • if θin’s converge strongly to the true value θ, then so does θ; _ _ _ ^ ^ ^ Denote μn = E[θin] and σn= var(θin) 2 ^ ^ To establish the asymptotic normality of θ, we need one of the following two conditions. _ Condition (a) αn is a constant independent of n and σn < ∞ 2 Condition (b) αn → ∞ and βn → ∞ as n → ∞, and E ⎮θin - μn ⎮ βn σn 2+δ 2+δ δ/2 ⟶0 as n → ∞, for some δ>0

Theorem 1. Suppose that x1,..., xn are iid. If either
conditon (a) or (b) holds, then in distribution as n → ∞ The proof is based on the Lyapunov’s condition which can be seen as a generalization of the central limit theorem Remark : When αn is a fixed finite number, μn and σn do not depend on n and can be denoted by μ and σ, and then 2 θ - μn σn ( ) _ √ βn __ ⟶ N(0,1) 2 θ - θ σ ( ) _ √ βn __ ⟶ N(0,1) holds if and only if θin‘s are unbiased estimators of θ. ^ ^ θin is a biased estimator, e resulting estimator is not consistent, because the bias μ - θ is a constante

Choice of αn

When αn → ∞ , it can be shown that
θ - θ σn ( ) _ √ βn __ ⟶ N(0,1) holds if and only if μn - θ = o(σn/√ βn) _ _ • If θin is unbiased, then μn - θ = 0 and (1) holds. • In most cases, σn= O(1/√ αn) ⇒ If μn - θ = o(1/√n) then (1) holds (1) _ ^ • For a biased estimator of θ in parametric settings, usually we have μn - θ = O(1/αn)  Thus it is necessary that αn/√n → ∞ _ • In practice, we suggest to take : αn = O(√n loglog(n)) _

Statistical ApplicationS

Applications 1. Statistical Inference 2. Non-parametric kernel density estimation

Statistical Inference

Confidence Interval The standard error of θ : _ SE(θ)
= σn √βn _ _ ^ where σn = ∑(θin-θ) ^ { } 1 n- 1 ^ _ 2 i=1 βn 1/2 If the asymptotic normality (1) holds then a 100(1-α)% confidence interval is approximately : θ - Φ(1-α/2) SE(θ) ; θ + Φ(1-α/2) SE(θ) -1 -1 ^ ^ _ _ _ _ [ ] IC(1-α) = Where Φ is the normal cumulative distribution function. ^

Testing Hypothesis Similary, for testing the hypothesis : H0 :
θ = θ0 vs H1 : θ ≠ θ0 The test statistic is given by : T = θ - θ0 SE(θ) ^ _ _ And the rejection region is {∣T∣ > Φ(1-α/2)} with significance level α -1

Non-parametric kernel density estimation

Few Recalls Definition : Let (x1, x2, …, xn) be
an iid sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is : ‣ where K(•) is the kernel (a symmetric but not necessarily positive function that integrates to one) ‣ h > 0 is a smoothing parameter called the bandwidth ‣ we’ll use the gaussian kernel : K(x) = ϕ(x), where ϕ is the standard normal density function

Explanations with a graphic :

Bandwidth selection The kernel density estimator : • Grey :
true density (standard normal). Red : KDE with h=0.05. Green : KDE with h=2. Black : KDE with h=0.337.

Optimal Bandwidth ‣ The Expected L2 risk function (Mean Integrated
Squared Error) : The most common optimality criterion used to select the bandwidth is : Under weak assumptions on ƒ and K : MISE (h) = AMISE(h) + o(1/(nh) + h) 4 where AMISE is the Asymptotic MISE We have : Where : for a function g, and The minimum of this AMISE is the solution to this differential equation :

Optimal Bandwidth Hence the optimal bandwith is : • Remark
: neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', • A variety of automatic, data-based methods have been developed for selecting the bandwidth.  ROT : the Rule of Thumb - Silverman (1986)  Plug-in selectors  Cross validation selectors • Remark : Substituting any bandwidth h which has the same asymptotic order n as hAMISE into the AMISE gives that : −1/5 AMISE(h) = O( n ) −4/5 • It can be shown that, under weak assumptions, there cannot exist a non- parametric estimator that converges at a faster rate than the kernel estimator

Come Back on the proposed Approach et’s see the influence
of αn on the estimation with an example ‣ 1 million iid ∼ 0.5 N(-2,1) + 0.5 N(2,1) want to estimate the density of the mixture based on the random sample. ‣ αn = 500, 1 000 and 5000 ‣ Gaussian Kernel ‣ Bandwith was selected using the rule of thumb  hrot = 0.9 x 1.06 x σ x n −1/5 where σ is the population standard deviation ‣ In our simulation, the σ is substituted by its robust estimate, the mean of absolute deviation (MAD) of the sub-sample xi1 ,...,xiαn within each block. Let’s try the baby-software

Come Back on the proposed Approach

To asses the performance for different αn, we define the
Root of Average of Squared Errors (RASE) : where : ‣ xj are the grid points at which the density were computed ‣ ngrid = 400 we have : ‣ RASE500 = 11x10 −4 ‣ RASE1000 = 9.12 x10 −4 ‣ RASE5000 = 5.94 x10 −4 It is seen that the performance becomes better as αn increases Remark : The performance of the estimator is not very sensitive to the choice of αn

Internet Traffic Data

QuickTime™ et un décompresseur sont requis pour visionner cette image.
Few things about Packets

In this section, we analyse an internet traffic data The
original data file includes three fields : ‣ Time of the packet (in second) ‣ Direction of the packet ‣ Size of the packet The variable under study is the throughput defined bye : size of packet in bytes time between two packets he data set consists of 8.1 million nonzero throughputs (packet size per second) ‣ αn = 8000 (≈ √n log(log(n))) _

he plot of the estimated density curve of internet traffic
data : Estimated Quartiles of Internet traffic Data

• The density graphic shows that there are 3 typical
values of throughput One close to 0 and the other two have a large size of throughput. A common unit for measuring the throughput is the mbps (megabits per second) • Here we have expresse our values in bytes instead of bits We should multiply by 8 for the conversion x 8 • First Quartile : 1.8 mbps 5 mbps 8.3 mbps • Second Quartile : • Third Quartile :

Conclusion  We have proposed an estimation procedure for large
data sets  Significantly reduces the required amout of computing memory  Without loss of efficiency in many situations  It is applicable to both point and density estimation  Asymptotic properties have been studied  Asymptotic normality has been established  A standard error formula has been proposed and empirically tested  Simulation studies and an example of internet data have been seen

THE END

Thomas Ounas' reading classics seminar on Stati...

Thomas Ounas' reading classics seminar on Statistical inference on massive datasets

More Decks by Xi'an

Other Decks in Education

Featured

Transcript