Count-Min Sketch: 10 Years Later

Count-Min Sketch: 10 Years Later Muthu

Puzzle: The NM Game I Circular table with seats numbered
1; : : : ; n arranged around it, player N starts at 1. I In each round: I Player N calls how many places he would like to move. I Player M determines clockwise or counterclockwise that N will move. 1 2 n

NM Game: Contd I Problem: What is f (n), the
largest number of seats player N can sit in? I full knowledge of the table, inﬁnitely smart, inﬁnite time.

largest number of seats player N can sit in? I full knowledge of the table, inﬁnitely smart, inﬁnite time. I f (2) = 2; f (3) T= 3; f (4) = 4; : : : :

largest number of seats player N can sit in? I full knowledge of the table, inﬁnitely smart, inﬁnite time. I f (2) = 2; f (3) T= 3; f (4) = 4; : : : : Z. Nedev and S. Muthukrishnan: Theor. Comput. Sci. 393(1-3): 124-132 (2008).

Big Data Examples I Examples of data: cellphone call logs,
web logs, internet packets.

Big Data Summary I Examples of data: cellphone call logs,
web logs, internet packets. I At least 3 distinct algorithmic theories: I Sequential algorithms on 1 m/c (cellphone log data). I Streaming algorithms (internet packet data) I MapReduce algorithms on 10k ’s of m/c s (web data) I Many other examples. I Non examples of massive data: Paul Erdos.

A Basic Problem in Streams: Indexing I Imagine a virtual
array F[1 ¡¡¡n] I Updates: F[i] + +, F[i] I Assume F[i] ! 0 at all times I Query: F[i] =? I Key: Use o(n) space, may be O(log n) space

Count-Min Sketch I For each update F[i] + +, I
for each j = 1; : : : ; log(1=), update cm[hj (i)] + +. I Estimate ~ F(i) = minj =1;:::;log(1=) cm[hj (i)]. F[i] + + h1 (i) h2 (i) 1 2 log(1/δ) 1 2 e/ +1 +1 +1 +1 cm array

Count-Min Sketch I Claim: F[i] ~ F[i]. I Claim: With
probability at least 1 , ~ F[i] F[i] + " X j T=i F[j ] I Space used is O(1 " log 1 ). I Time per update is O(log 1 ). Indep of n. G. Cormode and S. Muthukrishnan: An improved data stream summary: count-min sketch and its applications. Journal of Algorithms, 55(1): 58-75 (2005).

Count-Min Sketch: The Proof I With probability at least 1
, ~ F[i] F[i] + " X j T=i F[j ]:

, ~ F[i] F[i] + " X j T=i F[j ]: I Xi;j is the expected contribution of F[j ] to the bucket containing i, for any h. E(Xi;j ) = " e X j T=i F[j ]:

, ~ F[i] F[i] + " X j T=i F[j ]: I Xi;j is the expected contribution of F[j ] to the bucket containing i, for any h. E(Xi;j ) = " e X j T=i F[j ]: I Consider Pr( ~ F[i] > F[i] + " P j T=i F[j ]): Pr() = Pr( Vj ; F[i] + Xi;j > F[i] + " X j T=i F[j ]) = Pr( Vj ; Xi;j ! eE(Xi;j )) < e log(1=) =

Improve Count-Min Sketch? I Index Problem: I ALICE has n
long bitstring and sends messages to BOB who wishes to compute the ith bit. I Needs (n) bits of communication. I Reduction of estimating F[i] in data stream model. I I [1 ¡¡¡1=(2")] such that I I [i ] = 1 ! F [i ] = 2 I I [i ] = 0 ! F [i ] = 0 ;F [0 ] F [0 ]+2 I Observe that jjFjj = P i F[i] = 1="

Improve Count-Min Sketch? I Index Problem: I ALICE has n
long bitstring and sends messages to BOB who wishes to compute the ith bit. I Needs (n) bits of communication. I Reduction of estimating F[i] in data stream model. I I [1 ¡¡¡1=(2")] such that I I [i ] = 1 ! F [i ] = 2 I I [i ] = 0 ! F [i ] = 0 ;F [0 ] F [0 ]+2 I Observe that jjFjj = P i F[i] = 1=" I Estimating F[i] ~ F[i] F[i] + " jjFjj implies, I I [i ] = 0 ! F [i ] = 0 ! 0 ~ F [i ] 1 I I [i ] = 1 ! F [i ] = 2 ! 2 ~ F [i ] 3 and reveals I [i]. I Therefore, (1=") space lower bound for index problem.

Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction
are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss

are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss I Not all hashing algorithms are the same: I Pairwise independence

are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss I Not all hashing algorithms are the same: I Pairwise independence I Not all approximations are sampling. I Recovering F[i] to ¦0:1jFj accuracy will retrieve each item precisely.

Using Count-Min Sketch I For each i, determine ~ F[i]
I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest.

I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest. I Faster recovery: Hash into buckets such that in each bucket, recover majority i (F[i] > P j same bucket as i F[j ]=2)

I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest. I Faster recovery: Hash into buckets such that in each bucket, recover majority i (F[i] > P j same bucket as i F[j ]=2) I Takes O(log n) extra time, space I Gives compressed sensing in L1 : jjF ~ Fk jj 1 jjF F£ k jj 1 + " jjFjj 1 Sparse recovery experiments: http://groups.csail.mit.edu/toc/sparse/ wiki/index.php?title=Sparse_Recovery_Experiments

Count-Min Sketch: Summary I Solves many problems: I Heavy hitters,
compressed sensing, inner products, quantiles, least squares regression, low rank matrix approximation, ... I Applications to other CS/EE areas: I NLP, ML, Password checking, Secure privacy, ... I Systems, code, hardware. I Gigascope, CMON, Sawzall, MillWheel, ... Wiki: http://sites.google.com/site/countminsketch/

How is CM Sketch Used in Systems? I Linearity: ^
(F + G)[i] = minj =1;:::;log(1=) cmF [hj (i)] + cmG[hj (i)] I Good estimate since cmF +G = cmF + cmG GS at AT& T (pure DSMS system): I Ex: For each src IP, ﬁnd heavy hitter destination IP. I Two level arch: fast lightweight low level; high level expensive. I Parallelize by hashing on distinct groupbys, heartbeats, load shedding. I http: //www.corp.att.com/attlabs/docs/ att_gigascope_factsheet_071405.pdf

Algorithms in the Field Algorithms research: I Theory I Applied
I Field

I Field GS Application: I High speed memory is expensive, ns update times. I Large universe I 1="2 space is prohibitive

I Field GS Application: I High speed memory is expensive, ns update times. I Large universe I 1="2 space is prohibitive I Extensions: skipping over stream (CMON at Sprint), distributed (Sawzall),.

Finale I I didn’t talk about: Sketch of sketches; Graph,
geometric, probabilistic streams; Distributed, continual streaming Big Data Program at Simons Theory Center at Berkeley

Count-Min Sketch: 10 Years Later

Count-Min Sketch: 10 Years Later

Timon Karnezos

More Decks by Timon Karnezos

Other Decks in Science

Featured

Transcript

Count-Min Sketch: 10 Years Later Muthu

Puzzle: The NM Game I Circular table with seats numbered

NM Game: Contd I Problem: What is f (n), the

NM Game: Contd I Problem: What is f (n), the

NM Game: Contd I Problem: What is f (n), the

Big Data Examples I Examples of data: cellphone call logs,

Big Data Summary I Examples of data: cellphone call logs,

A Basic Problem in Streams: Indexing I Imagine a virtual

Count-Min Sketch I For each update F[i] + +, I

Count-Min Sketch I Claim: F[i] ~ F[i]. I Claim: With

Count-Min Sketch: The Proof I With probability at least 1

Count-Min Sketch: The Proof I With probability at least 1

Count-Min Sketch: The Proof I With probability at least 1

Improve Count-Min Sketch? I Index Problem: I ALICE has n

Improve Count-Min Sketch? I Index Problem: I ALICE has n

Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction

Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction

Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction

Using Count-Min Sketch I For each i, determine ~ F[i]

Using Count-Min Sketch I For each i, determine ~ F[i]

Using Count-Min Sketch I For each i, determine ~ F[i]

Count-Min Sketch: Summary I Solves many problems: I Heavy hitters,

How is CM Sketch Used in Systems? I Linearity: ^

Algorithms in the Field Algorithms research: I Theory I Applied

Algorithms in the Field Algorithms research: I Theory I Applied

Algorithms in the Field Algorithms research: I Theory I Applied

Finale I I didn’t talk about: Sketch of sketches; Graph,