Using Monoids for large scale aggregation - Scala.io, Lyon 2017

Using Monoids for Large Scale Aggregates Scala.io 2017

Sriram Ramachandrasekaran @brewkode Principal Engineer, Indix

Ashwanth Kumar @_ashwanthkumar Principal Engineer, Indix

class Crawler { def crawl(url: String) { agent.doCrawl(url) metric.count(“urls_crawled”, 1L)
} }

1 1 1 1 1 1 1 1 1 +

1 1 1 1 1 1 1 1 1 4
1 1 1 1 1 + +

1 1 1 1 1 1 1 1 1 4
1 1 1 1 1 8 1 + = 9 Total URLs Crawled + +

1 1 1 1 1 1 1 1 1 4
1 1 1 1 1 8 1 = 9 Total URLs Crawled (1+1+1+1)+(1+1+1+1)+1 = 4+4+1 (4+5) = (8+1) = 9 + + +

(1+1+1+1)+(1+1+1+1)+1 = 4+4+1 (4+5) = (8+1) = 9 1 1
1 1 1 1 1 1 1 4 1 1 1 1 1 8 1 = 9 Total URLs Crawled Associativity + + +

class Crawler { def crawl(url: String) { val page =
agent.doCrawl(url) metric.average(“response_times”, page.responseTime) } }

1200 3600 4800

1200 3600 4800 [1200,1] [3600,1] [4800,1] +

1200 3600 4800 [1200,1] [3600,1] [4800,1] [4800,2] [4800,1] + +

1200 3600 4800 [1200,1] [3600,1] [4800,1] [4800,2] [4800,1] = 3200
Average Response Time + +

Generalizing Sum and Average • Takes 2 numbers and produces
another number (binary operation) - Add: simple add of two numbers - Average: maintain two values - sum and count & “adds” each of them • Ordering of operations don’t matter (commutative) • Grouping of operations don’t matter (associative) • Ignores 0s

Abstraction • We are dealing with Sets • Associative binary
operations • Identity element exists (for additions - it’s zero)

operations • Identity element exists (for additions - it’s zero) = Monoid

operations • Identity element exists (for additions - it’s zero) • Add Commutativity to the mix = Commutative Monoid

Aggregations at Scale • Associative and Commutative ◦ Makes it
an EMBARRASSINGLY PARALLEL* problem • User Queries are handled via Scatter Gather ◦ Reduce on individual nodes ◦ Re-reduce on the results and return as the response

Talk is cheap, show me the code!

Introducing Abel

• Monoid based aggregations • Durable delivery via Kafka •
Persistence via RocksDB • User queries handled via Scatter Gather • Scala all the way Abel twitter/algebird ashwanthkumar/suuchi

stats.service.ix Count(“a”, 1L) Count(“c”, 1L) Unique(“ua”, “a”) Count(“b”, 1L) Abel
Data Flow Average(“a”, 5682)

Abel Internals Metric = Key * Aggregate (Monoid) case class
Metric[T <: Aggregate[T]] (key: Key, value: T with Aggregate[T])

Abel Internals Metric = Key * Aggregate (Monoid) trait Aggregate[T
<: Aggregate[_]] { self: T => def plus(another: T): T def show: JsValue }

Abel Internals Key = Name * Tags * Time case
class Time(time: Long, granularity: Long) case class Key(name:String, tags:SortedSet[String], time:Time = Time.Forever)

Abel Internals client.send(Metric(Key( name = “unique-url-per-hour”, tags = SortedSet(“www.amazon.com”), time
= Time.ThisHour ), UniqueCount(“http://...”))

client.send(Metrics( “unique-urls”, tag(“site:www.amazon.com”) * (perday | forever) * now, UniqueCount(“http://...”)
)) Abel Internals To find Unique count of URLs crawled per site for every day and forever.

client.send(Metrics( “unique-urls”, (tag(“site:www.amazon.com”) | `#`) * (perday | forever) *
now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site and across sites for every day and forever.

client.send(Metrics( “unique-urls”, (tag(“site:www.amazon.com”) | `#`) * (perday | forever) *
now, UniqueCount(“http://...”) )) Abel Internals To find Unique count of URLs crawled per site and across sites for every day and forever. It is implemented as a Ring.

stats.service.ix Count(“a”, 1L) Count(“c”, 1L) Abel v1 Average(“a”, 5682)

stats.service.ix Count(“a”, 1L) Count(“c”, 1L) Unique(“ua”, “a”) Count(“b”, 1L) Abel
v1 Average(“a”, 5682)

A stats.service.ix 1.1.1.1 1.1.1.2 1.1.1.3 aggregate.plus 1.1.1.2 Count(“a”, 1L) Average(“a”,
5682) Count(“c”, 1L) Count(“b”, 1L) Abel in Distributed Mode aggregate.plus 1.1.1.1 aggregate.plus 1.1.1.3 DNS based Load Balancing Unique(“ua”, “a”)

Scatter Gather - Average (123, 8) (3, 1) (12303, 24)
Reduce Reduce Reduce

(123, 8) (3, 1) (12303, 24) Re-reduce (12429, 33) =
376.6 Scatter Gather - Average

• Monoid based aggregations • Durable delivery via Kafka •
Persistence via RocksDB • User queries handled via Scatter Gather • Scala all the way Abel twitter/algebird ashwanthkumar/suuchi

Monoid Cheatsheet Stat / Metric Type Abstraction Count of Urls
Sum Average Response Time Sum with Count & Total Unique count of urls crawled HyperLogLog HTTP Response Code Distribution Count-Min Sketch Top K Websites with poor response time Heap with K elements Website response times percentiles QTree (loosely based on q-digest) Histogram of response times Array(to model bins) and slotwise Sum

Credits VinothKumar Raman @eventaken Swathi Ravichandran @swathrav Thank you

Ring • Abelian group under Addition ◦ Associative ◦ Commutative
◦ Identity ◦ Inverse • Monoid under multiplication ◦ Associative ◦ Multiplicative Identity • Multiplication is distributive with respect to addition ◦ (a + b) . c = (ac + bc) Right Distributivity ◦ a . (b + c) = (ab + ac) Left Distributivity

suuchi toolkit for building distributed function shipping applications github.com/ashwanthkumar/suuchi

Slides designed by www.swathiravichandran.com

Using Monoids for large scale aggregation - Sca...

Using Monoids for large scale aggregation - Scala.io, Lyon 2017

More Decks by Sriram

Other Decks in Technology

Featured

Transcript