Using Monoids for large scale aggregation - Scala.io, Lyon 2017
In this talk, you will see, how Monoids acts as a powerful abstraction to build distributed stats aggregation system. You will also see a high level architecture of how an in-house system named able was built based on this premise.
Generalizing Sum and Average ● Takes 2 numbers and produces another number (binary operation) - Add: simple add of two numbers - Average: maintain two values - sum and count & “adds” each of them ● Ordering of operations don’t matter (commutative) ● Grouping of operations don’t matter (associative) ● Ignores 0s
Abstraction ● We are dealing with Sets ● Associative binary operations ● Identity element exists (for additions - it’s zero) ● Add Commutativity to the mix = Commutative Monoid
Aggregations at Scale ● Associative and Commutative ○ Makes it an EMBARRASSINGLY PARALLEL* problem ● User Queries are handled via Scatter Gather ○ Reduce on individual nodes ○ Re-reduce on the results and return as the response
● Monoid based aggregations ● Durable delivery via Kafka ● Persistence via RocksDB ● User queries handled via Scatter Gather ● Scala all the way Abel twitter/algebird ashwanthkumar/suuchi
Abel Internals Key = Name * Tags * Time case class Time(time: Long, granularity: Long) case class Key(name:String, tags:SortedSet[String], time:Time = Time.Forever)
Abel Internals client.send(Metric(Key( name = “unique-url-per-hour”, tags = SortedSet(“www.amazon.com”), time = Time.ThisHour ), UniqueCount(“http://...”))
client.send(Metrics( “unique-urls”, tag(“site:www.amazon.com”) * (perday | forever) * now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site for every day and forever.
client.send(Metrics( “unique-urls”, (tag(“site:www.amazon.com”) | `#`) * (perday | forever) * now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site and across sites for every day and forever.
client.send(Metrics( “unique-urls”, (tag(“site:www.amazon.com”) | `#`) * (perday | forever) * now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site and across sites for every day and forever. It is implemented as a Ring.
● Monoid based aggregations ● Durable delivery via Kafka ● Persistence via RocksDB ● User queries handled via Scatter Gather ● Scala all the way Abel twitter/algebird ashwanthkumar/suuchi
Monoid Cheatsheet Stat / Metric Type Abstraction Count of Urls Sum Average Response Time Sum with Count & Total Unique count of urls crawled HyperLogLog HTTP Response Code Distribution Count-Min Sketch Top K Websites with poor response time Heap with K elements Website response times percentiles QTree (loosely based on q-digest) Histogram of response times Array(to model bins) and slotwise Sum
Ring ● Abelian group under Addition ○ Associative ○ Commutative ○ Identity ○ Inverse ● Monoid under multiplication ○ Associative ○ Multiplicative Identity ● Multiplication is distributive with respect to addition ○ (a + b) . c = (ac + bc) Right Distributivity ○ a . (b + c) = (ab + ac) Left Distributivity