Sriram
November 02, 2017
40

# Using Monoids for large scale aggregation - Scala.io, Lyon 2017

In this talk, you will see, how Monoids acts as a powerful abstraction to build distributed stats aggregation system. You will also see a high level architecture of how an in-house system named able was built based on this premise.

## Sriram

November 02, 2017

## Transcript

1. Using Monoids for Large Scale
Aggregates
Scala.io 2017

2. Sriram
Ramachandrasekaran
@brewkode
Principal Engineer, Indix

3. Ashwanth
Kumar
@_ashwanthkumar
Principal Engineer, Indix

4. class Crawler {
def crawl(url: String) {
agent.doCrawl(url)
metric.count(“urls_crawled”, 1L)
}
}

5. 1 1 1 1 1 1 1 1 1
+

6. 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1
+
+

7. 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1
8 1
+
= 9
Total URLs Crawled
+
+

8. 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1
8 1
= 9
Total URLs Crawled
(1+1+1+1)+(1+1+1+1)+1 = 4+4+1
(4+5) = (8+1) = 9
+
+
+

9. (1+1+1+1)+(1+1+1+1)+1 = 4+4+1
(4+5) = (8+1) = 9
1 1 1 1 1 1 1 1 1
4 1 1 1 1 1
8 1
= 9
Total URLs Crawled
Associativity
+
+
+

10. class Crawler {
def crawl(url: String) {
val page = agent.doCrawl(url)
metric.average(“response_times”,
page.responseTime)
}
}

11. 1200 3600 4800

12. 1200 3600 4800 [1200,1] [3600,1] [4800,1]
+

13. 1200 3600 4800 [1200,1] [3600,1] [4800,1]
[4800,2] [4800,1]
+
+

14. 1200 3600 4800 [1200,1] [3600,1] [4800,1]
[4800,2] [4800,1]
= 3200
Average Response Time
+
+

15. Generalizing Sum and Average
● Takes 2 numbers and produces another number (binary
operation)
- Add: simple add of two numbers
- Average: maintain two values - sum and count &
“adds” each of them
● Ordering of operations don’t matter (commutative)
● Grouping of operations don’t matter (associative)
● Ignores 0s

16. Abstraction
● We are dealing with Sets
● Associative binary operations
● Identity element exists (for additions - it’s zero)

17. Abstraction
● We are dealing with Sets
● Associative binary operations
● Identity element exists (for additions - it’s zero)
= Monoid

18. Abstraction
● We are dealing with Sets
● Associative binary operations
● Identity element exists (for additions - it’s zero)
● Add Commutativity to the mix
= Commutative Monoid

19. Aggregations at Scale
● Associative and Commutative
○ Makes it an EMBARRASSINGLY PARALLEL* problem
● User Queries are handled via Scatter Gather
○ Reduce on individual nodes
○ Re-reduce on the results and return as the response

20. Talk is cheap, show me the code!

21. Introducing Abel

22. ● Monoid based aggregations
● Durable delivery via Kafka
● Persistence via RocksDB
● User queries handled via Scatter Gather
● Scala all the way
Abel
twitter/algebird
ashwanthkumar/suuchi

23. stats.service.ix
Count(“a”, 1L) Count(“c”, 1L)
Unique(“ua”, “a”)
Count(“b”, 1L)
Abel Data Flow
Average(“a”, 5682)

24. Abel Internals
Metric = Key * Aggregate (Monoid)
case class Metric[T <: Aggregate[T]] (key: Key, value: T with Aggregate[T])

25. Abel Internals
Metric = Key * Aggregate (Monoid)
trait Aggregate[T <: Aggregate[_]] { self: T =>
def plus(another: T): T
def show: JsValue
}

26. Abel Internals
Key = Name * Tags * Time
case class Time(time: Long, granularity: Long)
case class Key(name:String, tags:SortedSet[String],
time:Time = Time.Forever)

27. Abel Internals
client.send(Metric(Key(
name = “unique-url-per-hour”,
tags = SortedSet(“www.amazon.com”),
time = Time.ThisHour
), UniqueCount(“http://...”))

28. client.send(Metrics(
“unique-urls”,
tag(“site:www.amazon.com”) *
(perday | forever) * now,
UniqueCount(“http://...”)
))
Abel Internals
To find Unique count of URLs crawled per site for
every day and forever.

29. client.send(Metrics(
“unique-urls”,
(tag(“site:www.amazon.com”) | `#`) *
(perday | forever) * now,
UniqueCount(“http://...”)
))
Abel Internals
To find Unique count of URLs crawled per site and across sites for
every day and forever.

30. client.send(Metrics(
“unique-urls”,
(tag(“site:www.amazon.com”) | `#`) *
(perday | forever) * now,
UniqueCount(“http://...”)
))
Abel Internals
To find Unique count of URLs crawled per site and across sites for
every day and forever.
It is implemented
as a Ring.

31. stats.service.ix
Count(“a”, 1L) Count(“c”, 1L)
Abel v1
Average(“a”, 5682)

32. stats.service.ix
Count(“a”, 1L) Count(“c”, 1L)
Unique(“ua”, “a”)
Count(“b”, 1L)
Abel v1
Average(“a”, 5682)

33. A
stats.service.ix
1.1.1.1
1.1.1.2
1.1.1.3
aggregate.plus
1.1.1.2
Count(“a”, 1L) Average(“a”, 5682) Count(“c”, 1L)
Count(“b”, 1L)
Abel in Distributed Mode
aggregate.plus
1.1.1.1
aggregate.plus
1.1.1.3
DNS based
Load
Balancing
Unique(“ua”, “a”)

34. Scatter Gather - Average
(123, 8)
(3, 1)
(12303, 24)
Reduce
Reduce
Reduce

35. (123, 8)
(3, 1)
(12303, 24)
Re-reduce
(12429, 33)
=
376.6
Scatter Gather - Average

36. ● Monoid based aggregations
● Durable delivery via Kafka
● Persistence via RocksDB
● User queries handled via Scatter Gather
● Scala all the way
Abel
twitter/algebird
ashwanthkumar/suuchi

37. Monoid Cheatsheet
Stat / Metric Type Abstraction
Count of Urls Sum
Average Response Time Sum with Count & Total
Unique count of urls crawled HyperLogLog
HTTP Response Code Distribution Count-Min Sketch
Top K Websites with poor response time Heap with K elements
Website response times percentiles QTree (loosely based on q-digest)
Histogram of response times Array(to model bins) and slotwise Sum

38. Credits
VinothKumar Raman
@eventaken
Swathi Ravichandran
@swathrav
Thank you

39. Meta

40. Ring
● Abelian group under Addition
○ Associative
○ Commutative
○ Identity
○ Inverse
● Monoid under multiplication
○ Associative
○ Multiplicative Identity
● Multiplication is distributive with respect to addition
○ (a + b) . c = (ac + bc) Right Distributivity
○ a . (b + c) = (ab + ac) Left Distributivity

41. suuchi
toolkit for building distributed
function shipping applications
github.com/ashwanthkumar/suuchi

42. Slides designed by
www.swathiravichandran.com