Timon Karnezos
June 20, 2013
790

# Count-Min Sketch: 10 Years Later

S. Muthu Muthukrishnan's talk at the AK Data Science Summit on Streaming and Sketching in Big Data and Analytics on 06/20/2013 at 111 Minna.

June 20, 2013

## Transcript

1. Count-Min Sketch:
10 Years Later
Muthu

2. Puzzle: The NM Game
I Circular table with seats numbered 1; : : : ; n arranged
around it, player N starts at 1.
I In each round:
I Player N calls how many places he would like to move.
I Player M determines clockwise or counterclockwise that N
will move.
1
2
n

3. NM Game: Contd
I Problem: What is f (n), the largest number of seats player
N can sit in?
I full knowledge of the table, inﬁnitely smart, inﬁnite time.

4. NM Game: Contd
I Problem: What is f (n), the largest number of seats player
N can sit in?
I full knowledge of the table, inﬁnitely smart, inﬁnite time.
I f (2) = 2; f (3)
T= 3; f (4) = 4; : : : :

5. NM Game: Contd
I Problem: What is f (n), the largest number of seats player
N can sit in?
I full knowledge of the table, inﬁnitely smart, inﬁnite time.
I f (2) = 2; f (3)
T= 3; f (4) = 4; : : : :
Z. Nedev and S. Muthukrishnan: Theor. Comput. Sci. 393(1-3):
124-132 (2008).

6. Big Data Examples
I Examples of data: cellphone call logs, web logs, internet
packets.

7. Big Data Summary
I Examples of data: cellphone call logs, web logs, internet
packets.
I At least 3 distinct algorithmic theories:
I Sequential algorithms on 1 m/c (cellphone log data).
I Streaming algorithms (internet packet data)
I MapReduce algorithms on 10k ’s of m/c s (web data)
I Many other examples.
I Non examples of massive data: Paul Erdos.

8. A Basic Problem in Streams: Indexing
I Imagine a virtual array F[1 ¡¡¡n]
I Updates: F[i] + +, F[i]
I Assume F[i]
! 0 at all times
I Query: F[i] =?
I Key: Use o(n) space, may be O(log n) space

9. Count-Min Sketch
I For each update F[i] + +,
I for each j = 1; : : : ; log(1=), update cm[hj
(i)] + +.
I Estimate ~
F(i) = minj =1;:::;log(1=)
cm[hj (i)].
F[i] + +
h1
(i)
h2
(i)
1
2
log(1/δ)
1 2 e/
+1
+1
+1
+1
cm array

10. Count-Min Sketch
I Claim: F[i] ~
F[i].
I Claim: With probability at least 1 ,
~
F[i] F[i] + "
X
j
T=i
F[j ]
I Space used is O(1
"
log 1

).
I Time per update is O(log 1

). Indep of n.
G. Cormode and S. Muthukrishnan: An improved data stream sum-
mary: count-min sketch and its applications. Journal of Algorithms,
55(1): 58-75 (2005).

11. Count-Min Sketch: The Proof
I With probability at least 1 ,
~
F[i] F[i] + "
X
j
T=i
F[j ]:

12. Count-Min Sketch: The Proof
I With probability at least 1 ,
~
F[i] F[i] + "
X
j
T=i
F[j ]:
I Xi;j is the expected contribution of F[j ] to the bucket
containing i, for any h.
E(Xi;j ) =
"
e
X
j
T=i
F[j ]:

13. Count-Min Sketch: The Proof
I With probability at least 1 ,
~
F[i] F[i] + "
X
j
T=i
F[j ]:
I Xi;j is the expected contribution of F[j ] to the bucket
containing i, for any h.
E(Xi;j ) =
"
e
X
j
T=i
F[j ]:
I Consider Pr( ~
F[i] > F[i] + "
P
j
T=i
F[j ]):
Pr() = Pr(
Vj ; F[i] + Xi;j > F[i] + "
X
j
T=i
F[j ])
= Pr(
Vj ; Xi;j
! eE(Xi;j )) < e log(1=) =

14. Improve Count-Min Sketch?
I Index Problem:
I ALICE has n long bitstring and sends messages to BOB who
wishes to compute the ith bit.
I Needs
(n) bits of communication.
I Reduction of estimating F[i] in data stream model.
I I [1 ¡¡¡1=(2")] such that
I I
[i
] = 1
! F
[i
] = 2
I I
[i
] = 0
! F
[i
] = 0
;F
[0
] F
[0
]+2
I Observe that jjFjj =
P
i
F[i] = 1="

15. Improve Count-Min Sketch?
I Index Problem:
I ALICE has n long bitstring and sends messages to BOB who
wishes to compute the ith bit.
I Needs
(n) bits of communication.
I Reduction of estimating F[i] in data stream model.
I I [1 ¡¡¡1=(2")] such that
I I
[i
] = 1
! F
[i
] = 2
I I
[i
] = 0
! F
[i
] = 0
;F
[0
] F
[0
]+2
I Observe that jjFjj =
P
i
F[i] = 1="
I Estimating F[i] ~
F[i] F[i] + "
jjFjj implies,
I I
[i
] = 0
! F
[i
] = 0
! 0
~
F
[i
] 1
I I
[i
] = 1
! F
[i
] = 2
! 2
~
F
[i
] 3
and reveals I [i].
I Therefore,
(1=") space lower bound for index problem.

16. Count-Min Sketch, The Challenges
I Not all projections, dimensionality reduction are the same:
I All prior work
(1="2) space, via Johnson-Lindenstrauss

17. Count-Min Sketch, The Challenges
I Not all projections, dimensionality reduction are the same:
I All prior work
(1="2) space, via Johnson-Lindenstrauss
I Not all hashing algorithms are the same:
I Pairwise independence

18. Count-Min Sketch, The Challenges
I Not all projections, dimensionality reduction are the same:
I All prior work
(1="2) space, via Johnson-Lindenstrauss
I Not all hashing algorithms are the same:
I Pairwise independence
I Not all approximations are sampling.
I Recovering F[i] to ¦0:1jFj accuracy will retrieve each item
precisely.

19. Using Count-Min Sketch
I For each i, determine ~
F[i]
I Keep the set S of heavy hitters ( ~
F[i]
! 2"
jjFjj).
I Guaranteed that S contains i such that F[i] ! 2"
jjFjj and
no F[i] "
jjFjj
I Extra log n factor for answering n queries
Problem is of database interest.

20. Using Count-Min Sketch
I For each i, determine ~
F[i]
I Keep the set S of heavy hitters ( ~
F[i]
! 2"
jjFjj).
I Guaranteed that S contains i such that F[i] ! 2"
jjFjj and
no F[i] "
jjFjj
I Extra log n factor for answering n queries
Problem is of database interest.
I Faster recovery: Hash into buckets such that in each
bucket, recover majority i
(F[i] >
P
j same bucket as i F[j ]=2)

21. Using Count-Min Sketch
I For each i, determine ~
F[i]
I Keep the set S of heavy hitters ( ~
F[i]
! 2"
jjFjj).
I Guaranteed that S contains i such that F[i] ! 2"
jjFjj and
no F[i] "
jjFjj
I Extra log n factor for answering n queries
Problem is of database interest.
I Faster recovery: Hash into buckets such that in each
bucket, recover majority i
(F[i] >
P
j same bucket as i F[j ]=2)
I Takes O(log n) extra time, space
I Gives compressed sensing in L1
:
jjF
~
Fk
jj
1
jjF F£
k
jj
1
+ "
jjFjj
1
Sparse recovery experiments: http://groups.csail.mit.edu/toc/sparse/
wiki/index.php?title=Sparse_Recovery_Experiments

22. Count-Min Sketch: Summary
I Solves many problems:
I Heavy hitters, compressed sensing, inner products,
quantiles, least squares regression, low rank matrix
approximation, ...
I Applications to other CS/EE areas:
I NLP, ML, Password checking, Secure privacy, ...
I Systems, code, hardware.
I Gigascope, CMON, Sawzall, MillWheel, ...

23. How is CM Sketch Used in Systems?
I Linearity:
^
(F + G)[i] = minj =1;:::;log(1=)
cmF [hj (i)] + cmG[hj (i)]
I Good estimate since cmF
+G
= cmF
+ cmG
GS at AT& T (pure DSMS system):
I Ex: For each src IP, ﬁnd heavy
hitter destination IP.
I Two level arch: fast lightweight
low level; high level expensive.
I Parallelize by hashing on distinct
shedding.
I http:
//www.corp.att.com/attlabs/docs/
att_gigascope_factsheet_071405.pdf

24. Algorithms in the Field
Algorithms research:
I Theory
I Applied
I Field

25. Algorithms in the Field
Algorithms research:
I Theory
I Applied
I Field
GS Application:
I High speed memory is expensive, ns update times.
I Large universe
I 1="2 space is prohibitive

26. Algorithms in the Field
Algorithms research:
I Theory
I Applied
I Field
GS Application:
I High speed memory is expensive, ns update times.
I Large universe
I 1="2 space is prohibitive
I Extensions: skipping over stream (CMON at Sprint),
distributed (Sawzall),.

27. Finale
I I didn’t talk about: Sketch of sketches; Graph, geometric,
probabilistic streams; Distributed, continual streaming
Big Data Program at Simons Theory Center at Berkeley