Slide 1

Slide 1 text

Aggregators: modeling data queries functionally Oscar Boykin, Twitter @posco

Slide 2

Slide 2 text

Or: Aggregators: composable aggregation for scalding, spark, summingbird, and plain scala

Slide 3

Slide 3 text

@Twitter How to compute size of a list in Map/Reduce? 3 2 3 5 7 11 13 17

Slide 4

Slide 4 text

@Twitter How to compute size of a list in Map/Reduce? 4 2 3 5 7 11 13 17 1 1 1 1 1 1 1 map(x => 1)

Slide 5

Slide 5 text

@Twitter How to compute size of a list in Map/Reduce? 5 2 3 5 7 11 13 17 1 1 1 1 1 1 1 2 2 2 3 7 4 reduce {(x, y) => x+y}

Slide 6

Slide 6 text

Associative functions: f(a,f(b,c)) == f(f(a,b),c) also called “semigroups”

Slide 7

Slide 7 text

we want map+semigroup in one abstraction!

Slide 8

Slide 8 text

@Twitter Getting the average 8 2 3 5 7 11 13 17

Slide 9

Slide 9 text

@Twitter Getting the average 9 2 3 5 7 11 13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) map(x => (1,x))

Slide 10

Slide 10 text

@Twitter Getting the average 10 2 3 5 7 11 13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) 2,24 2, 5 3,41 7,58 4,17 2,12 reduce(Semigroup.plus)

Slide 11

Slide 11 text

@Twitter Getting the average 11 2 3 5 7 11 13 17 (1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17) 7,58 8.285 map(case (c, s) => s/c.toDouble)

Slide 12

Slide 12 text

We really want map+semigroup+map in one abstraction!

Slide 13

Slide 13 text

trait Aggregator[In, Middle, Out] { def prepare(i: In): Middle def semigroup: Semigroup[Middle] def present(m: Middle): Out } https://github.com/twitter/algebird

Slide 14

Slide 14 text

How do we use this?

Slide 15

Slide 15 text

@Twitter 15

Slide 16

Slide 16 text

@Twitter 16

Slide 17

Slide 17 text

@Twitter 17

Slide 18

Slide 18 text

@Twitter 18

Slide 19

Slide 19 text

Not such a new idea. Scalding had a mapReduceMap function in the first release:

Slide 20

Slide 20 text

But why should we be excited?

Slide 21

Slide 21 text

map (prepare) reduce (semigroup) map (present)

Slide 22

Slide 22 text

“Does not compose” is the new “is a piece of crap” paraphrasing Dan Rosen @mergeconflict

Slide 23

Slide 23 text

Aggregators Compose != Aggregator

Slide 24

Slide 24 text

map (prepare) reduce (semigroup) map (present)

Slide 25

Slide 25 text

map (prepare) reduce (semigroup) map (present) composePrepare

Slide 26

Slide 26 text

map (prepare) reduce (semigroup) map (present) composePrepare Function + Aggregator = Aggregator

Slide 27

Slide 27 text

map (prepare) reduce (semigroup) map (present)

Slide 28

Slide 28 text

map (prepare) reduce (semigroup) map (present) andThenPresent

Slide 29

Slide 29 text

map (prepare) reduce (semigroup) map (present) andThenPresent Aggregator + Function = Aggregator

Slide 30

Slide 30 text

map (prepare) reduce (semigroup) map (present)

Slide 31

Slide 31 text

map (prepare) reduce (semigroup) map (present) Aggregator 1 Aggregator 2

Slide 32

Slide 32 text

map (prepare) reduce (semigroup) map (present) Joined Aggregator Aggregator * Aggregator = Aggregator

Slide 33

Slide 33 text

Aggregators are Applicative Functors Functor: has a map method map(t: A[T])(fn: T => U): A[U] Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]

Slide 34

Slide 34 text

Aggregators are Applicative Functors Functor: has a map method map(t: A[T])(fn: T => U): A[U] Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]

Slide 35

Slide 35 text

Let’s go to the REPL http://bit.ly/AggregatingWithAlice https://gist.github.com/johnynek/ 814fc1e77aad1d295bb7

Slide 36

Slide 36 text

Aggregators “just work” with scala collections Aggregators are built in to Scalding Aggregators are easy to use with Spark

Slide 37

Slide 37 text

@Twitter Algebird with spark: https://github.com/twitter/algebird/pull/397 37

Slide 38

Slide 38 text

@Twitter Algebird with spark: https://github.com/twitter/algebird/pull/397 38

Slide 39

Slide 39 text

Key Points 1) Aggregators encapsulate very general query logic independent of how it is executed (in memory, scalding, spark, you name it) 2) Aggregators compose so you can define parts you use, and easily glue them together 3) Algebird has many advanced, well tested Aggregators: TopK, HyperLogLog, CountMinSketch, Mean, Stddev, …

Slide 40

Slide 40 text

Oscar Boykin @posco / oscar@twitter.com Algebird has these aggregators and more: https://github.com/twitter/algebird