Save 37% off PRO during our Black Friday Sale! »

Aggregators: modeling data queries functionally

Aggregators: modeling data queries functionally

This talk introduces Aggregators, an abstraction that makes it easy to build composable queries against big data sets. The Algebird library gives you access to many powerful aggregators that you can compose and use with Scalding, Spark or in your own code.

0caf621c9ff9879374574f6cdd41e247?s=128

P. Oscar Boykin

January 10, 2015
Tweet

Transcript

  1. Aggregators: modeling data queries functionally Oscar Boykin, Twitter @posco

  2. Or: Aggregators: composable aggregation for scalding, spark, summingbird, and plain

    scala
  3. @Twitter How to compute size of a list in Map/Reduce?

    3 2 3 5 7 11 13 17
  4. @Twitter How to compute size of a list in Map/Reduce?

    4 2 3 5 7 11 13 17 1 1 1 1 1 1 1 map(x => 1)
  5. @Twitter How to compute size of a list in Map/Reduce?

    5 2 3 5 7 11 13 17 1 1 1 1 1 1 1 2 2 2 3 7 4 reduce {(x, y) => x+y}
  6. Associative functions: f(a,f(b,c)) == f(f(a,b),c) also called “semigroups”

  7. we want map+semigroup in one abstraction!

  8. @Twitter Getting the average 8 2 3 5 7 11

    13 17
  9. @Twitter Getting the average 9 2 3 5 7 11

    13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) map(x => (1,x))
  10. @Twitter Getting the average 10 2 3 5 7 11

    13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) 2,24 2, 5 3,41 7,58 4,17 2,12 reduce(Semigroup.plus)
  11. @Twitter Getting the average 11 2 3 5 7 11

    13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) 7,58 8.285 map(case (c, s) => s/c.toDouble)
  12. We really want map+semigroup+map in one abstraction!

  13. trait Aggregator[In, Middle, Out] { def prepare(i: In): Middle def

    semigroup: Semigroup[Middle] def present(m: Middle): Out } https://github.com/twitter/algebird
  14. How do we use this?

  15. @Twitter 15

  16. @Twitter 16

  17. @Twitter 17

  18. @Twitter 18

  19. Not such a new idea. Scalding had a mapReduceMap function

    in the first release:
  20. But why should we be excited?

  21. map (prepare) reduce (semigroup) map (present)

  22. “Does not compose” is the new “is a piece of

    crap” paraphrasing Dan Rosen @mergeconflict
  23. Aggregators Compose != Aggregator

  24. map (prepare) reduce (semigroup) map (present)

  25. map (prepare) reduce (semigroup) map (present) composePrepare

  26. map (prepare) reduce (semigroup) map (present) composePrepare Function + Aggregator

    = Aggregator
  27. map (prepare) reduce (semigroup) map (present)

  28. map (prepare) reduce (semigroup) map (present) andThenPresent

  29. map (prepare) reduce (semigroup) map (present) andThenPresent Aggregator + Function

    = Aggregator
  30. map (prepare) reduce (semigroup) map (present)

  31. map (prepare) reduce (semigroup) map (present) Aggregator 1 Aggregator 2

  32. map (prepare) reduce (semigroup) map (present) Joined Aggregator Aggregator *

    Aggregator = Aggregator
  33. Aggregators are Applicative Functors Functor: has a map method map(t:

    A[T])(fn: T => U): A[U] Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]
  34. Aggregators are Applicative Functors Functor: has a map method map(t:

    A[T])(fn: T => U): A[U] Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]
  35. Let’s go to the REPL http://bit.ly/AggregatingWithAlice https://gist.github.com/johnynek/ 814fc1e77aad1d295bb7

  36. Aggregators “just work” with scala collections Aggregators are built in

    to Scalding Aggregators are easy to use with Spark
  37. @Twitter Algebird with spark: https://github.com/twitter/algebird/pull/397 37

  38. @Twitter Algebird with spark: https://github.com/twitter/algebird/pull/397 38

  39. Key Points 1) Aggregators encapsulate very general query logic independent

    of how it is executed (in memory, scalding, spark, you name it) 2) Aggregators compose so you can define parts you use, and easily glue them together 3) Algebird has many advanced, well tested Aggregators: TopK, HyperLogLog, CountMinSketch, Mean, Stddev, …
  40. Oscar Boykin @posco / oscar@twitter.com Algebird has these aggregators and

    more: https://github.com/twitter/algebird