Count-Min Sketch: 10 Years Later

Slide 1

Slide 1 text

Count-Min Sketch: 10 Years Later Muthu

Slide 2

Slide 2 text

Puzzle: The NM Game I Circular table with seats numbered 1; : : : ; n arranged around it, player N starts at 1. I In each round: I Player N calls how many places he would like to move. I Player M determines clockwise or counterclockwise that N will move. 1 2 n

Slide 3

Slide 3 text

NM Game: Contd I Problem: What is f (n), the largest number of seats player N can sit in? I full knowledge of the table, inﬁnitely smart, inﬁnite time.

Slide 4

Slide 4 text

NM Game: Contd I Problem: What is f (n), the largest number of seats player N can sit in? I full knowledge of the table, inﬁnitely smart, inﬁnite time. I f (2) = 2; f (3) T= 3; f (4) = 4; : : : :

Slide 5

Slide 5 text

NM Game: Contd I Problem: What is f (n), the largest number of seats player N can sit in? I full knowledge of the table, inﬁnitely smart, inﬁnite time. I f (2) = 2; f (3) T= 3; f (4) = 4; : : : : Z. Nedev and S. Muthukrishnan: Theor. Comput. Sci. 393(1-3): 124-132 (2008).

Slide 6

Slide 6 text

Big Data Examples I Examples of data: cellphone call logs, web logs, internet packets.

Slide 7

Slide 7 text

Big Data Summary I Examples of data: cellphone call logs, web logs, internet packets. I At least 3 distinct algorithmic theories: I Sequential algorithms on 1 m/c (cellphone log data). I Streaming algorithms (internet packet data) I MapReduce algorithms on 10k ’s of m/c s (web data) I Many other examples. I Non examples of massive data: Paul Erdos.

Slide 8

Slide 8 text

A Basic Problem in Streams: Indexing I Imagine a virtual array F[1 ¡¡¡n] I Updates: F[i] + +, F[i] I Assume F[i] ! 0 at all times I Query: F[i] =? I Key: Use o(n) space, may be O(log n) space

Slide 9

Slide 9 text

Count-Min Sketch I For each update F[i] + +, I for each j = 1; : : : ; log(1=), update cm[hj (i)] + +. I Estimate ~ F(i) = minj =1;:::;log(1=) cm[hj (i)]. F[i] + + h1 (i) h2 (i) 1 2 log(1/δ) 1 2 e/ +1 +1 +1 +1 cm array

Slide 10

Slide 10 text

Count-Min Sketch I Claim: F[i] ~ F[i]. I Claim: With probability at least 1 , ~ F[i] F[i] + " X j T=i F[j ] I Space used is O(1 " log 1 ). I Time per update is O(log 1 ). Indep of n. G. Cormode and S. Muthukrishnan: An improved data stream summary: count-min sketch and its applications. Journal of Algorithms, 55(1): 58-75 (2005).

Slide 11

Slide 11 text

Count-Min Sketch: The Proof I With probability at least 1 , ~ F[i] F[i] + " X j T=i F[j ]:

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Count-Min Sketch: The Proof I With probability at least 1 , ~ F[i] F[i] + " X j T=i F[j ]: I Xi;j is the expected contribution of F[j ] to the bucket containing i, for any h. E(Xi;j ) = " e X j T=i F[j ]: I Consider Pr( ~ F[i] > F[i] + " P j T=i F[j ]): Pr() = Pr( Vj ; F[i] + Xi;j > F[i] + " X j T=i F[j ]) = Pr( Vj ; Xi;j ! eE(Xi;j )) < e log(1=) =

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Improve Count-Min Sketch? I Index Problem: I ALICE has n long bitstring and sends messages to BOB who wishes to compute the ith bit. I Needs (n) bits of communication. I Reduction of estimating F[i] in data stream model. I I [1 ¡¡¡1=(2")] such that I I [i ] = 1 ! F [i ] = 2 I I [i ] = 0 ! F [i ] = 0 ;F [0 ] F [0 ]+2 I Observe that jjFjj = P i F[i] = 1=" I Estimating F[i] ~ F[i] F[i] + " jjFjj implies, I I [i ] = 0 ! F [i ] = 0 ! 0 ~ F [i ] 1 I I [i ] = 1 ! F [i ] = 2 ! 2 ~ F [i ] 3 and reveals I [i]. I Therefore, (1=") space lower bound for index problem.

Slide 16

Slide 16 text

Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss I Not all hashing algorithms are the same: I Pairwise independence I Not all approximations are sampling. I Recovering F[i] to ¦0:1jFj accuracy will retrieve each item precisely.

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Using Count-Min Sketch I For each i, determine ~ F[i] I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest. I Faster recovery: Hash into buckets such that in each bucket, recover majority i (F[i] > P j same bucket as i F[j ]=2) I Takes O(log n) extra time, space I Gives compressed sensing in L1 : jjF ~ Fk jj 1 jjF F£ k jj 1 + " jjFjj 1 Sparse recovery experiments: http://groups.csail.mit.edu/toc/sparse/ wiki/index.php?title=Sparse_Recovery_Experiments

Slide 22

Slide 22 text

Count-Min Sketch: Summary I Solves many problems: I Heavy hitters, compressed sensing, inner products, quantiles, least squares regression, low rank matrix approximation, ... I Applications to other CS/EE areas: I NLP, ML, Password checking, Secure privacy, ... I Systems, code, hardware. I Gigascope, CMON, Sawzall, MillWheel, ... Wiki: http://sites.google.com/site/countminsketch/

Slide 23

Slide 23 text

How is CM Sketch Used in Systems? I Linearity: ^ (F + G)[i] = minj =1;:::;log(1=) cmF [hj (i)] + cmG[hj (i)] I Good estimate since cmF +G = cmF + cmG GS at AT& T (pure DSMS system): I Ex: For each src IP, ﬁnd heavy hitter destination IP. I Two level arch: fast lightweight low level; high level expensive. I Parallelize by hashing on distinct groupbys, heartbeats, load shedding. I http: //www.corp.att.com/attlabs/docs/ att_gigascope_factsheet_071405.pdf

Slide 24

Slide 24 text

Algorithms in the Field Algorithms research: I Theory I Applied I Field

Slide 25

Slide 25 text

Algorithms in the Field Algorithms research: I Theory I Applied I Field GS Application: I High speed memory is expensive, ns update times. I Large universe I 1="2 space is prohibitive

Slide 26

Slide 26 text

Algorithms in the Field Algorithms research: I Theory I Applied I Field GS Application: I High speed memory is expensive, ns update times. I Large universe I 1="2 space is prohibitive I Extensions: skipping over stream (CMON at Sprint), distributed (Sawzall),.

Slide 27

Slide 27 text

Finale I I didn’t talk about: Sketch of sketches; Graph, geometric, probabilistic streams; Distributed, continual streaming Big Data Program at Simons Theory Center at Berkeley