Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Aggregators: modeling data queries functionally

Aggregators: modeling data queries functionally

This talk introduces Aggregators, an abstraction that makes it easy to build composable queries against big data sets. The Algebird library gives you access to many powerful aggregators that you can compose and use with Scalding, Spark or in your own code.

P. Oscar Boykin

January 10, 2015
Tweet

More Decks by P. Oscar Boykin

Other Decks in Programming

Transcript

  1. @Twitter How to compute size of a list in Map/Reduce?

    4 2 3 5 7 11 13 17 1 1 1 1 1 1 1 map(x => 1)
  2. @Twitter How to compute size of a list in Map/Reduce?

    5 2 3 5 7 11 13 17 1 1 1 1 1 1 1 2 2 2 3 7 4 reduce {(x, y) => x+y}
  3. @Twitter Getting the average 9 2 3 5 7 11

    13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) map(x => (1,x))
  4. @Twitter Getting the average 10 2 3 5 7 11

    13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) 2,24 2, 5 3,41 7,58 4,17 2,12 reduce(Semigroup.plus)
  5. @Twitter Getting the average 11 2 3 5 7 11

    13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) 7,58 8.285 map(case (c, s) => s/c.toDouble)
  6. trait Aggregator[In, Middle, Out] { def prepare(i: In): Middle def

    semigroup: Semigroup[Middle] def present(m: Middle): Out } https://github.com/twitter/algebird
  7. “Does not compose” is the new “is a piece of

    crap” paraphrasing Dan Rosen @mergeconflict
  8. Aggregators are Applicative Functors Functor: has a map method map(t:

    A[T])(fn: T => U): A[U] Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]
  9. Aggregators are Applicative Functors Functor: has a map method map(t:

    A[T])(fn: T => U): A[U] Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]
  10. Aggregators “just work” with scala collections Aggregators are built in

    to Scalding Aggregators are easy to use with Spark
  11. Key Points 1) Aggregators encapsulate very general query logic independent

    of how it is executed (in memory, scalding, spark, you name it) 2) Aggregators compose so you can define parts you use, and easily glue them together 3) Algebird has many advanced, well tested Aggregators: TopK, HyperLogLog, CountMinSketch, Mean, Stddev, …