$30 off During Our Annual Pro Sale. View Details »

Aggregators: modeling data queries functionally

Aggregators: modeling data queries functionally

This talk introduces Aggregators, an abstraction that makes it easy to build composable queries against big data sets. The Algebird library gives you access to many powerful aggregators that you can compose and use with Scalding, Spark or in your own code.

P. Oscar Boykin

January 10, 2015
Tweet

More Decks by P. Oscar Boykin

Other Decks in Programming

Transcript

  1. Aggregators:
    modeling data queries functionally
    Oscar Boykin, Twitter
    @posco

    View Slide

  2. Or:
    Aggregators:
    composable aggregation for scalding,
    spark, summingbird, and plain scala

    View Slide

  3. @Twitter
    How to compute size of a list in Map/Reduce?
    3
    2 3 5 7 11 13 17

    View Slide

  4. @Twitter
    How to compute size of a list in Map/Reduce?
    4
    2 3 5 7 11 13 17
    1 1 1 1 1 1 1
    map(x => 1)

    View Slide

  5. @Twitter
    How to compute size of a list in Map/Reduce?
    5
    2 3 5 7 11 13 17
    1 1 1 1 1 1 1
    2
    2
    2
    3
    7
    4
    reduce {(x, y) => x+y}

    View Slide

  6. Associative functions:
    f(a,f(b,c)) == f(f(a,b),c)
    also called “semigroups”

    View Slide

  7. we want
    map+semigroup in one
    abstraction!

    View Slide

  8. @Twitter
    Getting the average
    8
    2 3 5 7 11 13 17

    View Slide

  9. @Twitter
    Getting the average
    9
    2 3 5 7 11 13 17
    (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
    map(x => (1,x))

    View Slide

  10. @Twitter
    Getting the average
    10
    2 3 5 7 11 13 17
    (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
    2,24
    2, 5
    3,41
    7,58
    4,17
    2,12
    reduce(Semigroup.plus)

    View Slide

  11. @Twitter
    Getting the average
    11
    2 3 5 7 11 13 17
    (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
    7,58 8.285
    map(case (c, s) => s/c.toDouble)

    View Slide

  12. We really want
    map+semigroup+map
    in one abstraction!

    View Slide

  13. trait Aggregator[In, Middle, Out] {
    def prepare(i: In): Middle
    def semigroup: Semigroup[Middle]
    def present(m: Middle): Out
    }
    https://github.com/twitter/algebird

    View Slide

  14. How do we use this?

    View Slide

  15. @Twitter 15

    View Slide

  16. @Twitter 16

    View Slide

  17. @Twitter 17

    View Slide

  18. @Twitter 18

    View Slide

  19. Not such a new idea. Scalding had a
    mapReduceMap function in the first
    release:

    View Slide

  20. But why should we be excited?

    View Slide

  21. map (prepare)
    reduce (semigroup)
    map (present)

    View Slide

  22. “Does not compose”
    is the new
    “is a piece of crap”
    paraphrasing Dan Rosen @mergeconflict

    View Slide

  23. Aggregators Compose
    !=
    Aggregator

    View Slide

  24. map (prepare)
    reduce (semigroup)
    map (present)

    View Slide

  25. map (prepare)
    reduce (semigroup)
    map (present)
    composePrepare

    View Slide

  26. map (prepare)
    reduce (semigroup)
    map (present)
    composePrepare
    Function + Aggregator = Aggregator

    View Slide

  27. map (prepare)
    reduce (semigroup)
    map (present)

    View Slide

  28. map (prepare)
    reduce (semigroup)
    map (present)
    andThenPresent

    View Slide

  29. map (prepare)
    reduce (semigroup)
    map (present)
    andThenPresent
    Aggregator + Function = Aggregator

    View Slide

  30. map (prepare)
    reduce (semigroup)
    map (present)

    View Slide

  31. map (prepare)
    reduce (semigroup)
    map (present)
    Aggregator 1 Aggregator 2

    View Slide

  32. map (prepare)
    reduce (semigroup)
    map (present)
    Joined Aggregator
    Aggregator * Aggregator = Aggregator

    View Slide

  33. Aggregators are Applicative Functors
    Functor: has a map method
    map(t: A[T])(fn: T => U): A[U]
    Applicative: has a join method:
    def join(t: A[T], u: A[U]): A[(T, U)]
    Monad: has a flatMap method:
    def flatMap(t: A[T])(fn: T => A[U]): A[U]

    View Slide

  34. Aggregators are Applicative Functors
    Functor: has a map method
    map(t: A[T])(fn: T => U): A[U]
    Applicative: has a join method:
    def join(t: A[T], u: A[U]): A[(T, U)]
    Monad: has a flatMap method:
    def flatMap(t: A[T])(fn: T => A[U]): A[U]

    View Slide

  35. Let’s go to the REPL
    http://bit.ly/AggregatingWithAlice
    https://gist.github.com/johnynek/
    814fc1e77aad1d295bb7

    View Slide

  36. Aggregators “just work” with scala collections
    Aggregators are built in to Scalding
    Aggregators are easy to use with Spark

    View Slide

  37. @Twitter
    Algebird with spark:
    https://github.com/twitter/algebird/pull/397
    37

    View Slide

  38. @Twitter
    Algebird with spark:
    https://github.com/twitter/algebird/pull/397
    38

    View Slide

  39. Key Points
    1) Aggregators encapsulate very general query
    logic independent of how it is executed (in
    memory, scalding, spark, you name it)
    2) Aggregators compose so you can define parts
    you use, and easily glue them together
    3) Algebird has many advanced, well tested
    Aggregators: TopK, HyperLogLog,
    CountMinSketch, Mean, Stddev, …

    View Slide

  40. Oscar Boykin @posco / [email protected]
    Algebird has these aggregators and more:
    https://github.com/twitter/algebird

    View Slide