Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Count-Min Sketch: 10 Years Later

Count-Min Sketch: 10 Years Later

S. Muthu Muthukrishnan's talk at the AK Data Science Summit on Streaming and Sketching in Big Data and Analytics on 06/20/2013 at 111 Minna.

For more information: http://blog.aggregateknowledge.com/ak-data-science-summit-june-20-2013

Timon Karnezos

June 20, 2013
Tweet

More Decks by Timon Karnezos

Other Decks in Science

Transcript

  1. Puzzle: The NM Game I Circular table with seats numbered

    1; : : : ; n arranged around it, player N starts at 1. I In each round: I Player N calls how many places he would like to move. I Player M determines clockwise or counterclockwise that N will move. 1 2 n
  2. NM Game: Contd I Problem: What is f (n), the

    largest number of seats player N can sit in? I full knowledge of the table, infinitely smart, infinite time.
  3. NM Game: Contd I Problem: What is f (n), the

    largest number of seats player N can sit in? I full knowledge of the table, infinitely smart, infinite time. I f (2) = 2; f (3) T= 3; f (4) = 4; : : : :
  4. NM Game: Contd I Problem: What is f (n), the

    largest number of seats player N can sit in? I full knowledge of the table, infinitely smart, infinite time. I f (2) = 2; f (3) T= 3; f (4) = 4; : : : : Z. Nedev and S. Muthukrishnan: Theor. Comput. Sci. 393(1-3): 124-132 (2008).
  5. Big Data Summary I Examples of data: cellphone call logs,

    web logs, internet packets. I At least 3 distinct algorithmic theories: I Sequential algorithms on 1 m/c (cellphone log data). I Streaming algorithms (internet packet data) I MapReduce algorithms on 10k ’s of m/c s (web data) I Many other examples. I Non examples of massive data: Paul Erdos.
  6. A Basic Problem in Streams: Indexing I Imagine a virtual

    array F[1 ¡¡¡n] I Updates: F[i] + +, F[i] I Assume F[i] ! 0 at all times I Query: F[i] =? I Key: Use o(n) space, may be O(log n) space
  7. Count-Min Sketch I For each update F[i] + +, I

    for each j = 1; : : : ; log(1=), update cm[hj (i)] + +. I Estimate ~ F(i) = minj =1;:::;log(1=) cm[hj (i)]. F[i] + + h1 (i) h2 (i) 1 2 log(1/δ) 1 2 e/ +1 +1 +1 +1 cm array
  8. Count-Min Sketch I Claim: F[i] ~ F[i]. I Claim: With

    probability at least 1 , ~ F[i] F[i] + " X j T=i F[j ] I Space used is O(1 " log 1  ). I Time per update is O(log 1  ). Indep of n. G. Cormode and S. Muthukrishnan: An improved data stream sum- mary: count-min sketch and its applications. Journal of Algorithms, 55(1): 58-75 (2005).
  9. Count-Min Sketch: The Proof I With probability at least 1

    , ~ F[i] F[i] + " X j T=i F[j ]: I Xi;j is the expected contribution of F[j ] to the bucket containing i, for any h. E(Xi;j ) = " e X j T=i F[j ]:
  10. Count-Min Sketch: The Proof I With probability at least 1

    , ~ F[i] F[i] + " X j T=i F[j ]: I Xi;j is the expected contribution of F[j ] to the bucket containing i, for any h. E(Xi;j ) = " e X j T=i F[j ]: I Consider Pr( ~ F[i] > F[i] + " P j T=i F[j ]): Pr() = Pr( Vj ; F[i] + Xi;j > F[i] + " X j T=i F[j ]) = Pr( Vj ; Xi;j ! eE(Xi;j )) < e log(1=) = 
  11. Improve Count-Min Sketch? I Index Problem: I ALICE has n

    long bitstring and sends messages to BOB who wishes to compute the ith bit. I Needs (n) bits of communication. I Reduction of estimating F[i] in data stream model. I I [1 ¡¡¡1=(2")] such that I I [i ] = 1 ! F [i ] = 2 I I [i ] = 0 ! F [i ] = 0 ;F [0 ] F [0 ]+2 I Observe that jjFjj = P i F[i] = 1="
  12. Improve Count-Min Sketch? I Index Problem: I ALICE has n

    long bitstring and sends messages to BOB who wishes to compute the ith bit. I Needs (n) bits of communication. I Reduction of estimating F[i] in data stream model. I I [1 ¡¡¡1=(2")] such that I I [i ] = 1 ! F [i ] = 2 I I [i ] = 0 ! F [i ] = 0 ;F [0 ] F [0 ]+2 I Observe that jjFjj = P i F[i] = 1=" I Estimating F[i] ~ F[i] F[i] + " jjFjj implies, I I [i ] = 0 ! F [i ] = 0 ! 0  ~ F [i ]  1 I I [i ] = 1 ! F [i ] = 2 ! 2  ~ F [i ]  3 and reveals I [i]. I Therefore, (1=") space lower bound for index problem.
  13. Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction

    are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss
  14. Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction

    are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss I Not all hashing algorithms are the same: I Pairwise independence
  15. Count-Min Sketch, The Challenges I Not all projections, dimensionality reduction

    are the same: I All prior work (1="2) space, via Johnson-Lindenstrauss I Not all hashing algorithms are the same: I Pairwise independence I Not all approximations are sampling. I Recovering F[i] to ¦0:1jFj accuracy will retrieve each item precisely.
  16. Using Count-Min Sketch I For each i, determine ~ F[i]

    I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest.
  17. Using Count-Min Sketch I For each i, determine ~ F[i]

    I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest. I Faster recovery: Hash into buckets such that in each bucket, recover majority i (F[i] > P j same bucket as i F[j ]=2)
  18. Using Count-Min Sketch I For each i, determine ~ F[i]

    I Keep the set S of heavy hitters ( ~ F[i] ! 2" jjFjj). I Guaranteed that S contains i such that F[i] ! 2" jjFjj and no F[i] " jjFjj I Extra log n factor for answering n queries Problem is of database interest. I Faster recovery: Hash into buckets such that in each bucket, recover majority i (F[i] > P j same bucket as i F[j ]=2) I Takes O(log n) extra time, space I Gives compressed sensing in L1 : jjF ~ Fk jj 1 jjF F£ k jj 1 + " jjFjj 1 Sparse recovery experiments: http://groups.csail.mit.edu/toc/sparse/ wiki/index.php?title=Sparse_Recovery_Experiments
  19. Count-Min Sketch: Summary I Solves many problems: I Heavy hitters,

    compressed sensing, inner products, quantiles, least squares regression, low rank matrix approximation, ... I Applications to other CS/EE areas: I NLP, ML, Password checking, Secure privacy, ... I Systems, code, hardware. I Gigascope, CMON, Sawzall, MillWheel, ... Wiki: http://sites.google.com/site/countminsketch/
  20. How is CM Sketch Used in Systems? I Linearity: ^

    (F + G)[i] = minj =1;:::;log(1=) cmF [hj (i)] + cmG[hj (i)] I Good estimate since cmF +G = cmF + cmG GS at AT& T (pure DSMS system): I Ex: For each src IP, find heavy hitter destination IP. I Two level arch: fast lightweight low level; high level expensive. I Parallelize by hashing on distinct groupbys, heartbeats, load shedding. I http: //www.corp.att.com/attlabs/docs/ att_gigascope_factsheet_071405.pdf
  21. Algorithms in the Field Algorithms research: I Theory I Applied

    I Field GS Application: I High speed memory is expensive, ns update times. I Large universe I 1="2 space is prohibitive
  22. Algorithms in the Field Algorithms research: I Theory I Applied

    I Field GS Application: I High speed memory is expensive, ns update times. I Large universe I 1="2 space is prohibitive I Extensions: skipping over stream (CMON at Sprint), distributed (Sawzall),.
  23. Finale I I didn’t talk about: Sketch of sketches; Graph,

    geometric, probabilistic streams; Distributed, continual streaming Big Data Program at Simons Theory Center at Berkeley