Using Monoids for Large Scale Aggregates

Using Monoids for Large Scale Aggregates

Co-presented the talk with @brewkode at Scala.io 2017.
Video - https://www.youtube.com/watch?v=UW3Z_rIPn3w&list=PLjkHSzY9VuL9UI2oYMc4HKKu_Dl9TOnPL&t=0s&index=61

D90acaa01cb59a2b8b7e986958953eee?s=128

Ashwanth Kumar

November 02, 2017
Tweet

Transcript

  1. 4.
  2. 7.

    1 1 1 1 1 1 1 1 1 4

    1 1 1 1 1 + +
  3. 8.

    1 1 1 1 1 1 1 1 1 4

    1 1 1 1 1 8 1 + = 9 Total URLs Crawled + +
  4. 9.

    1 1 1 1 1 1 1 1 1 4

    1 1 1 1 1 8 1 = 9 Total URLs Crawled (1+1+1+1)+(1+1+1+1)+1 = 4+4+1 (4+5) = (8+1) = 9 + + +
  5. 10.

    (1+1+1+1)+(1+1+1+1)+1 = 4+4+1 (4+5) = (8+1) = 9 1 1

    1 1 1 1 1 1 1 4 1 1 1 1 1 8 1 = 9 Total URLs Crawled Associativity + + +
  6. 11.

    class Crawler { def crawl(url: String) { val page =

    agent.doCrawl(url) metric.average(“response_times”, page.responseTime) } }
  7. 16.

    Generalizing Sum and Average • Takes 2 numbers and produces

    another number (binary operation) - Add: simple add of two numbers - Average: maintain two values - sum and count & “adds” each of them • Ordering of operations don’t matter (commutative) • Grouping of operations don’t matter (associative) • Ignores 0s
  8. 17.

    Abstraction • We are dealing with Sets • Associative binary

    operations • Identity element exists (for additions - it’s zero)
  9. 18.

    Abstraction • We are dealing with Sets • Associative binary

    operations • Identity element exists (for additions - it’s zero) = Monoid
  10. 19.

    Abstraction • We are dealing with Sets • Associative binary

    operations • Identity element exists (for additions - it’s zero) • Add Commutativity to the mix = Commutative Monoid
  11. 20.

    Aggregations at Scale • Associative and Commutative ◦ Makes it

    an EMBARRASSINGLY PARALLEL* problem • User Queries are handled via Scatter Gather ◦ Reduce on individual nodes ◦ Re-reduce on the results and return as the response
  12. 23.

    • Monoid based aggregations • Durable delivery via Kafka •

    Persistence via RocksDB • User queries handled via Scatter Gather • Scala all the way Abel twitter/algebird ashwanthkumar/suuchi
  13. 25.

    Abel Internals Metric = Key * Aggregate (Monoid) case class

    Metric[T <: Aggregate[T]] (key: Key, value: T with Aggregate[T])
  14. 26.

    Abel Internals Metric = Key * Aggregate (Monoid) trait Aggregate[T

    <: Aggregate[_]] { self: T => def plus(another: T): T def show: JsValue }
  15. 27.

    Abel Internals Key = Name * Tags * Time case

    class Time(time: Long, granularity: Long) case class Key(name:String, tags:SortedSet[String], time:Time = Time.Forever)
  16. 29.

    client.send(Metrics( “unique-urls”, tag(“site:www.amazon.com”) * (perday | forever) * now, UniqueCount(“http://...”)

    )) Abel Internals To find Unique count of URLs crawled per site for every day and forever.
  17. 30.

    client.send(Metrics( “unique-urls”, (tag(“site:www.amazon.com”) | `#`) * (perday | forever) *

    now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site and across sites for every day and forever.
  18. 31.

    client.send(Metrics( “unique-urls”, (tag(“site:www.amazon.com”) | `#`) * (perday | forever) *

    now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site and across sites for every day and forever. It is implemented as a Ring.
  19. 34.

    A stats.service.ix 1.1.1.1 1.1.1.2 1.1.1.3 aggregate.plus 1.1.1.2 Count(“a”, 1L) Average(“a”,

    5682) Count(“c”, 1L) Count(“b”, 1L) Abel in Distributed Mode aggregate.plus 1.1.1.1 aggregate.plus 1.1.1.3 DNS based Load Balancing Unique(“ua”, “a”)
  20. 36.
  21. 37.

    • Monoid based aggregations • Durable delivery via Kafka •

    Persistence via RocksDB • User queries handled via Scatter Gather • Scala all the way Abel twitter/algebird ashwanthkumar/suuchi
  22. 38.

    Monoid Cheatsheet Stat / Metric Type Abstraction Count of Urls

    Sum Average Response Time Sum with Count & Total Unique count of urls crawled HyperLogLog HTTP Response Code Distribution Count-Min Sketch Top K Websites with poor response time Heap with K elements Website response times percentiles QTree (loosely based on q-digest) Histogram of response times Array(to model bins) and slotwise Sum
  23. 40.
  24. 41.

    Ring • Abelian group under Addition ◦ Associative ◦ Commutative

    ◦ Identity ◦ Inverse • Monoid under multiplication ◦ Associative ◦ Multiplicative Identity • Multiplication is distributive with respect to addition ◦ (a + b) . c = (ac + bc) Right Distributivity ◦ a . (b + c) = (ab + ac) Left Distributivity
  25. 42.
  26. 43.