P. Oscar Boykin
January 10, 2015
2k

# Aggregators: modeling data queries functionally

This talk introduces Aggregators, an abstraction that makes it easy to build composable queries against big data sets. The Algebird library gives you access to many powerful aggregators that you can compose and use with Scalding, Spark or in your own code.

January 10, 2015

## Transcript

1. Aggregators:
modeling data queries functionally
@posco

2. Or:
Aggregators:
composable aggregation for scalding,
spark, summingbird, and plain scala

How to compute size of a list in Map/Reduce?
3
2 3 5 7 11 13 17

How to compute size of a list in Map/Reduce?
4
2 3 5 7 11 13 17
1 1 1 1 1 1 1
map(x => 1)

How to compute size of a list in Map/Reduce?
5
2 3 5 7 11 13 17
1 1 1 1 1 1 1
2
2
2
3
7
4
reduce {(x, y) => x+y}

6. Associative functions:
f(a,f(b,c)) == f(f(a,b),c)
also called “semigroups”

7. we want
map+semigroup in one
abstraction!

Getting the average
8
2 3 5 7 11 13 17

Getting the average
9
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
map(x => (1,x))

Getting the average
10
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
2,24
2, 5
3,41
7,58
4,17
2,12
reduce(Semigroup.plus)

Getting the average
11
2 3 5 7 11 13 17
(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)
7,58 8.285
map(case (c, s) => s/c.toDouble)

12. We really want
map+semigroup+map
in one abstraction!

13. trait Aggregator[In, Middle, Out] {
def prepare(i: In): Middle
def semigroup: Semigroup[Middle]
def present(m: Middle): Out
}

14. How do we use this?

19. Not such a new idea. Scalding had a
mapReduceMap function in the ﬁrst
release:

20. But why should we be excited?

21. map (prepare)
reduce (semigroup)
map (present)

22. “Does not compose”
is the new
“is a piece of crap”
paraphrasing Dan Rosen @mergeconﬂict

23. Aggregators Compose
!=
Aggregator

24. map (prepare)
reduce (semigroup)
map (present)

25. map (prepare)
reduce (semigroup)
map (present)
composePrepare

26. map (prepare)
reduce (semigroup)
map (present)
composePrepare
Function + Aggregator = Aggregator

27. map (prepare)
reduce (semigroup)
map (present)

28. map (prepare)
reduce (semigroup)
map (present)
andThenPresent

29. map (prepare)
reduce (semigroup)
map (present)
andThenPresent
Aggregator + Function = Aggregator

30. map (prepare)
reduce (semigroup)
map (present)

31. map (prepare)
reduce (semigroup)
map (present)
Aggregator 1 Aggregator 2

32. map (prepare)
reduce (semigroup)
map (present)
Joined Aggregator
Aggregator * Aggregator = Aggregator

33. Aggregators are Applicative Functors
Functor: has a map method
map(t: A[T])(fn: T => U): A[U]
Applicative: has a join method:
def join(t: A[T], u: A[U]): A[(T, U)]
def ﬂatMap(t: A[T])(fn: T => A[U]): A[U]

34. Aggregators are Applicative Functors
Functor: has a map method
map(t: A[T])(fn: T => U): A[U]
Applicative: has a join method:
def join(t: A[T], u: A[U]): A[(T, U)]
def ﬂatMap(t: A[T])(fn: T => A[U]): A[U]

35. Let’s go to the REPL
http://bit.ly/AggregatingWithAlice
https://gist.github.com/johnynek/

36. Aggregators “just work” with scala collections
Aggregators are built in to Scalding
Aggregators are easy to use with Spark

Algebird with spark:
37

Algebird with spark:
38

39. Key Points
1) Aggregators encapsulate very general query
logic independent of how it is executed (in
memory, scalding, spark, you name it)
2) Aggregators compose so you can deﬁne parts
you use, and easily glue them together
3) Algebird has many advanced, well tested
Aggregators: TopK, HyperLogLog,
CountMinSketch, Mean, Stddev, …

40. Oscar Boykin @posco / [email protected]
Algebird has these aggregators and more: