information technology. The amount of storaged data has increase dramatically. •Barclaycard(UK) : For example : 350 million transactions a year •Wal-mart : 7 billion transactions a year •AT&T : 70 billion long distance calls annually TOO LARGE FOR THE PRIMARY MEMORY (RAM) OF A COMPUTER
a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks Note that n = αnβn
the choice of αn •The choice of αn will be discussed later Block 1 : θ1n ^ Block i : θin ^ Block βn : θβnn ^ } θ = 1 ∑θin _ _ βn ^ βn i=1 How to estimate θ(F) :
000 ≈ √n log(log(n)) •βn = 1000 N = 1000 data sets of 8 million iid ∼ Chi2(1) We illustrate our approach by estimating various percentiles of the population. _ Let’s have a look at the baby-software I made
a population F •We are interested in estimating a parameter θ(F) of the population. x 11 ... x 1αn . . . ... . . . x βn1 ... x βnαn We rewrite the sample as : ‣ αn : the block size ‣ βn : the number of blocks Note that n = αnβn The proposed estimate : θ = 1 ∑θin _ _ βn ^ βn i=1
βn (a) if θin is affine equivariant, then so is θ; (b) if θin is an unbiased estimator of θ, the so is θ. _ _ ^ ^ Proposition 2. Suppose that x1,...,xn are iid, and αn → ∞ and βn → ∞ as n → ∞ • if θin’s converge weakly to the true value θ, then so does θ; • if θin’s converge in L2 to the true value θ, then so does θ; • if θin’s converge strongly to the true value θ, then so does θ; _ _ _ ^ ^ ^ Denote μn = E[θin] and σn= var(θin) 2 ^ ^ To establish the asymptotic normality of θ, we need one of the following two conditions. _ Condition (a) αn is a constant independent of n and σn < ∞ 2 Condition (b) αn → ∞ and βn → ∞ as n → ∞, and E ⎮θin - μn ⎮ βn σn 2+δ 2+δ δ/2 ⟶0 as n → ∞, for some δ>0
conditon (a) or (b) holds, then in distribution as n → ∞ The proof is based on the Lyapunov’s condition which can be seen as a generalization of the central limit theorem Remark : When αn is a fixed finite number, μn and σn do not depend on n and can be denoted by μ and σ, and then 2 θ - μn σn ( ) _ √ βn __ ⟶ N(0,1) 2 θ - θ σ ( ) _ √ βn __ ⟶ N(0,1) holds if and only if θin‘s are unbiased estimators of θ. ^ ^ θin is a biased estimator, e resulting estimator is not consistent, because the bias μ - θ is a constante
θ - θ σn ( ) _ √ βn __ ⟶ N(0,1) holds if and only if μn - θ = o(σn/√ βn) _ _ • If θin is unbiased, then μn - θ = 0 and (1) holds. • In most cases, σn= O(1/√ αn) ⇒ If μn - θ = o(1/√n) then (1) holds (1) _ ^ • For a biased estimator of θ in parametric settings, usually we have μn - θ = O(1/αn) Thus it is necessary that αn/√n → ∞ _ • In practice, we suggest to take : αn = O(√n loglog(n)) _
θ = θ0 vs H1 : θ ≠ θ0 The test statistic is given by : T = θ - θ0 SE(θ) ^ _ _ And the rejection region is {∣T∣ > Φ(1-α/2)} with significance level α -1
an iid sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is : ‣ where K(•) is the kernel (a symmetric but not necessarily positive function that integrates to one) ‣ h > 0 is a smoothing parameter called the bandwidth ‣ we’ll use the gaussian kernel : K(x) = ϕ(x), where ϕ is the standard normal density function
Squared Error) : The most common optimality criterion used to select the bandwidth is : Under weak assumptions on ƒ and K : MISE (h) = AMISE(h) + o(1/(nh) + h) 4 where AMISE is the Asymptotic MISE We have : Where : for a function g, and The minimum of this AMISE is the solution to this differential equation :
: neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', • A variety of automatic, data-based methods have been developed for selecting the bandwidth. ROT : the Rule of Thumb - Silverman (1986) Plug-in selectors Cross validation selectors • Remark : Substituting any bandwidth h which has the same asymptotic order n as hAMISE into the AMISE gives that : −1/5 AMISE(h) = O( n ) −4/5 • It can be shown that, under weak assumptions, there cannot exist a non- parametric estimator that converges at a faster rate than the kernel estimator
of αn on the estimation with an example ‣ 1 million iid ∼ 0.5 N(-2,1) + 0.5 N(2,1) want to estimate the density of the mixture based on the random sample. ‣ αn = 500, 1 000 and 5000 ‣ Gaussian Kernel ‣ Bandwith was selected using the rule of thumb hrot = 0.9 x 1.06 x σ x n −1/5 where σ is the population standard deviation ‣ In our simulation, the σ is substituted by its robust estimate, the mean of absolute deviation (MAD) of the sub-sample xi1 ,...,xiαn within each block. Let’s try the baby-software
Root of Average of Squared Errors (RASE) : where : ‣ xj are the grid points at which the density were computed ‣ ngrid = 400 we have : ‣ RASE500 = 11x10 −4 ‣ RASE1000 = 9.12 x10 −4 ‣ RASE5000 = 5.94 x10 −4 It is seen that the performance becomes better as αn increases Remark : The performance of the estimator is not very sensitive to the choice of αn
original data file includes three fields : ‣ Time of the packet (in second) ‣ Direction of the packet ‣ Size of the packet The variable under study is the throughput defined bye : size of packet in bytes time between two packets he data set consists of 8.1 million nonzero throughputs (packet size per second) ‣ αn = 8000 (≈ √n log(log(n))) _
values of throughput One close to 0 and the other two have a large size of throughput. A common unit for measuring the throughput is the mbps (megabits per second) • Here we have expresse our values in bytes instead of bits We should multiply by 8 for the conversion x 8 • First Quartile : 1.8 mbps 5 mbps 8.3 mbps • Second Quartile : • Third Quartile :
data sets Significantly reduces the required amout of computing memory Without loss of efficiency in many situations It is applicable to both point and density estimation Asymptotic properties have been studied Asymptotic normality has been established A standard error formula has been proposed and empirically tested Simulation studies and an example of internet data have been seen