Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Count-Min Sketch: 10 Years Later

Count-Min Sketch: 10 Years Later

S. Muthu Muthukrishnan's talk at the AK Data Science Summit on Streaming and Sketching in Big Data and Analytics on 06/20/2013 at 111 Minna.

For more information: http://blog.aggregateknowledge.com/ak-data-science-summit-june-20-2013

Timon Karnezos

June 20, 2013
Tweet

More Decks by Timon Karnezos

Other Decks in Science

Transcript

  1. Count-Min Sketch:
    10 Years Later
    Muthu

    View Slide

  2. Puzzle: The NM Game
    I Circular table with seats numbered 1; : : : ; n arranged
    around it, player N starts at 1.
    I In each round:
    I Player N calls how many places he would like to move.
    I Player M determines clockwise or counterclockwise that N
    will move.
    1
    2
    n

    View Slide

  3. NM Game: Contd
    I Problem: What is f (n), the largest number of seats player
    N can sit in?
    I full knowledge of the table, infinitely smart, infinite time.

    View Slide

  4. NM Game: Contd
    I Problem: What is f (n), the largest number of seats player
    N can sit in?
    I full knowledge of the table, infinitely smart, infinite time.
    I f (2) = 2; f (3)
    T= 3; f (4) = 4; : : : :

    View Slide

  5. NM Game: Contd
    I Problem: What is f (n), the largest number of seats player
    N can sit in?
    I full knowledge of the table, infinitely smart, infinite time.
    I f (2) = 2; f (3)
    T= 3; f (4) = 4; : : : :
    Z. Nedev and S. Muthukrishnan: Theor. Comput. Sci. 393(1-3):
    124-132 (2008).

    View Slide

  6. Big Data Examples
    I Examples of data: cellphone call logs, web logs, internet
    packets.

    View Slide

  7. Big Data Summary
    I Examples of data: cellphone call logs, web logs, internet
    packets.
    I At least 3 distinct algorithmic theories:
    I Sequential algorithms on 1 m/c (cellphone log data).
    I Streaming algorithms (internet packet data)
    I MapReduce algorithms on 10k ’s of m/c s (web data)
    I Many other examples.
    I Non examples of massive data: Paul Erdos.

    View Slide

  8. A Basic Problem in Streams: Indexing
    I Imagine a virtual array F[1 ¡¡¡n]
    I Updates: F[i] + +, F[i]
    I Assume F[i]
    ! 0 at all times
    I Query: F[i] =?
    I Key: Use o(n) space, may be O(log n) space

    View Slide

  9. Count-Min Sketch
    I For each update F[i] + +,
    I for each j = 1; : : : ; log(1=), update cm[hj
    (i)] + +.
    I Estimate ~
    F(i) = minj =1;:::;log(1=)
    cm[hj (i)].
    F[i] + +
    h1
    (i)
    h2
    (i)
    1
    2
    log(1/δ)
    1 2 e/
    +1
    +1
    +1
    +1
    cm array

    View Slide

  10. Count-Min Sketch
    I Claim: F[i] ~
    F[i].
    I Claim: With probability at least 1 ,
    ~
    F[i] F[i] + "
    X
    j
    T=i
    F[j ]
    I Space used is O(1
    "
    log 1

    ).
    I Time per update is O(log 1

    ). Indep of n.
    G. Cormode and S. Muthukrishnan: An improved data stream sum-
    mary: count-min sketch and its applications. Journal of Algorithms,
    55(1): 58-75 (2005).

    View Slide

  11. Count-Min Sketch: The Proof
    I With probability at least 1 ,
    ~
    F[i] F[i] + "
    X
    j
    T=i
    F[j ]:

    View Slide

  12. Count-Min Sketch: The Proof
    I With probability at least 1 ,
    ~
    F[i] F[i] + "
    X
    j
    T=i
    F[j ]:
    I Xi;j is the expected contribution of F[j ] to the bucket
    containing i, for any h.
    E(Xi;j ) =
    "
    e
    X
    j
    T=i
    F[j ]:

    View Slide

  13. Count-Min Sketch: The Proof
    I With probability at least 1 ,
    ~
    F[i] F[i] + "
    X
    j
    T=i
    F[j ]:
    I Xi;j is the expected contribution of F[j ] to the bucket
    containing i, for any h.
    E(Xi;j ) =
    "
    e
    X
    j
    T=i
    F[j ]:
    I Consider Pr( ~
    F[i] > F[i] + "
    P
    j
    T=i
    F[j ]):
    Pr() = Pr(
    Vj ; F[i] + Xi;j > F[i] + "
    X
    j
    T=i
    F[j ])
    = Pr(
    Vj ; Xi;j
    ! eE(Xi;j )) < e log(1=) =

    View Slide

  14. Improve Count-Min Sketch?
    I Index Problem:
    I ALICE has n long bitstring and sends messages to BOB who
    wishes to compute the ith bit.
    I Needs
    (n) bits of communication.
    I Reduction of estimating F[i] in data stream model.
    I I [1 ¡¡¡1=(2")] such that
    I I
    [i
    ] = 1
    ! F
    [i
    ] = 2
    I I
    [i
    ] = 0
    ! F
    [i
    ] = 0
    ;F
    [0
    ] F
    [0
    ]+2
    I Observe that jjFjj =
    P
    i
    F[i] = 1="

    View Slide

  15. Improve Count-Min Sketch?
    I Index Problem:
    I ALICE has n long bitstring and sends messages to BOB who
    wishes to compute the ith bit.
    I Needs
    (n) bits of communication.
    I Reduction of estimating F[i] in data stream model.
    I I [1 ¡¡¡1=(2")] such that
    I I
    [i
    ] = 1
    ! F
    [i
    ] = 2
    I I
    [i
    ] = 0
    ! F
    [i
    ] = 0
    ;F
    [0
    ] F
    [0
    ]+2
    I Observe that jjFjj =
    P
    i
    F[i] = 1="
    I Estimating F[i] ~
    F[i] F[i] + "
    jjFjj implies,
    I I
    [i
    ] = 0
    ! F
    [i
    ] = 0
    ! 0
    ~
    F
    [i
    ] 1
    I I
    [i
    ] = 1
    ! F
    [i
    ] = 2
    ! 2
    ~
    F
    [i
    ] 3
    and reveals I [i].
    I Therefore,
    (1=") space lower bound for index problem.

    View Slide

  16. Count-Min Sketch, The Challenges
    I Not all projections, dimensionality reduction are the same:
    I All prior work
    (1="2) space, via Johnson-Lindenstrauss

    View Slide

  17. Count-Min Sketch, The Challenges
    I Not all projections, dimensionality reduction are the same:
    I All prior work
    (1="2) space, via Johnson-Lindenstrauss
    I Not all hashing algorithms are the same:
    I Pairwise independence

    View Slide

  18. Count-Min Sketch, The Challenges
    I Not all projections, dimensionality reduction are the same:
    I All prior work
    (1="2) space, via Johnson-Lindenstrauss
    I Not all hashing algorithms are the same:
    I Pairwise independence
    I Not all approximations are sampling.
    I Recovering F[i] to ¦0:1jFj accuracy will retrieve each item
    precisely.

    View Slide

  19. Using Count-Min Sketch
    I For each i, determine ~
    F[i]
    I Keep the set S of heavy hitters ( ~
    F[i]
    ! 2"
    jjFjj).
    I Guaranteed that S contains i such that F[i] ! 2"
    jjFjj and
    no F[i] "
    jjFjj
    I Extra log n factor for answering n queries
    Problem is of database interest.

    View Slide

  20. Using Count-Min Sketch
    I For each i, determine ~
    F[i]
    I Keep the set S of heavy hitters ( ~
    F[i]
    ! 2"
    jjFjj).
    I Guaranteed that S contains i such that F[i] ! 2"
    jjFjj and
    no F[i] "
    jjFjj
    I Extra log n factor for answering n queries
    Problem is of database interest.
    I Faster recovery: Hash into buckets such that in each
    bucket, recover majority i
    (F[i] >
    P
    j same bucket as i F[j ]=2)

    View Slide

  21. Using Count-Min Sketch
    I For each i, determine ~
    F[i]
    I Keep the set S of heavy hitters ( ~
    F[i]
    ! 2"
    jjFjj).
    I Guaranteed that S contains i such that F[i] ! 2"
    jjFjj and
    no F[i] "
    jjFjj
    I Extra log n factor for answering n queries
    Problem is of database interest.
    I Faster recovery: Hash into buckets such that in each
    bucket, recover majority i
    (F[i] >
    P
    j same bucket as i F[j ]=2)
    I Takes O(log n) extra time, space
    I Gives compressed sensing in L1
    :
    jjF
    ~
    Fk
    jj
    1
    jjF F£
    k
    jj
    1
    + "
    jjFjj
    1
    Sparse recovery experiments: http://groups.csail.mit.edu/toc/sparse/
    wiki/index.php?title=Sparse_Recovery_Experiments

    View Slide

  22. Count-Min Sketch: Summary
    I Solves many problems:
    I Heavy hitters, compressed sensing, inner products,
    quantiles, least squares regression, low rank matrix
    approximation, ...
    I Applications to other CS/EE areas:
    I NLP, ML, Password checking, Secure privacy, ...
    I Systems, code, hardware.
    I Gigascope, CMON, Sawzall, MillWheel, ...
    Wiki: http://sites.google.com/site/countminsketch/

    View Slide

  23. How is CM Sketch Used in Systems?
    I Linearity:
    ^
    (F + G)[i] = minj =1;:::;log(1=)
    cmF [hj (i)] + cmG[hj (i)]
    I Good estimate since cmF
    +G
    = cmF
    + cmG
    GS at AT& T (pure DSMS system):
    I Ex: For each src IP, find heavy
    hitter destination IP.
    I Two level arch: fast lightweight
    low level; high level expensive.
    I Parallelize by hashing on distinct
    groupbys, heartbeats, load
    shedding.
    I http:
    //www.corp.att.com/attlabs/docs/
    att_gigascope_factsheet_071405.pdf

    View Slide

  24. Algorithms in the Field
    Algorithms research:
    I Theory
    I Applied
    I Field

    View Slide

  25. Algorithms in the Field
    Algorithms research:
    I Theory
    I Applied
    I Field
    GS Application:
    I High speed memory is expensive, ns update times.
    I Large universe
    I 1="2 space is prohibitive

    View Slide

  26. Algorithms in the Field
    Algorithms research:
    I Theory
    I Applied
    I Field
    GS Application:
    I High speed memory is expensive, ns update times.
    I Large universe
    I 1="2 space is prohibitive
    I Extensions: skipping over stream (CMON at Sprint),
    distributed (Sawzall),.

    View Slide

  27. Finale
    I I didn’t talk about: Sketch of sketches; Graph, geometric,
    probabilistic streams; Distributed, continual streaming
    Big Data Program at Simons Theory Center at Berkeley

    View Slide