ashwanth kumar
@_ashwanthkumar
principal engineer
using monoids for
large scale
business stats
Slide 2
Slide 2 text
overview
- Stats for Batch Jobs
- Stats for Streaming Workloads
- Generalizing aggregations
with Monoids
- Abel
- Some cool logos!
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
but stats for MR jobs?
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
Used Scalding (from Twitter)
+ Simple to express aggregations
stats as map-reduce jobs
Slide 7
Slide 7 text
Used Scalding (from Twitter)
+ Simple to express aggregations
- Have to include intermediary data in the output just for stats
stats as map-reduce jobs
Slide 8
Slide 8 text
Used Scalding (from Twitter)
+ Simple to express aggregations
- Have to include intermediary data in the output just for stats
- Have to think about writing stats after writing production code
Stats as Map-Reduce Jobs
Slide 9
Slide 9 text
Used Scalding (from Twitter)
+ Simple to express aggregations
- Have to include intermediary data in the output just for stats
- Have to think about writing stats after writing production code
- Not updated at-least till next run (Not “realtime”)
stats as map-reduce jobs
Slide 10
Slide 10 text
but what about
services running
forever?
Slide 11
Slide 11 text
(riemann.io)
Slide 12
Slide 12 text
Service A
Service B
Service C
Slide 13
Slide 13 text
stats for streaming workloads
Push computed rollups to InfluxDB
+ Realtime
Slide 14
Slide 14 text
stats for streaming workloads
Push computed rollups to InfluxDB
+ Realtime
+ Allows arbitrary functions as rollups - code as config
Slide 15
Slide 15 text
stats for streaming workloads
Push computed rollups to InfluxDB
+ Realtime
+ Allows arbitrary functions as rollups - code as config
- Not distributable - since it allows arbitrary functions
Slide 16
Slide 16 text
stats for streaming workloads
Push computed rollups to InfluxDB
+ Realtime
+ Allows arbitrary functions as rollups - code as config
- Not distributable - since it allows arbitrary functions
- Stats emission and rollups are at separate places - making it difficult
to test and keep them in sync
Slide 17
Slide 17 text
lessons so far
Slide 18
Slide 18 text
- Have to include intermediary data in the
output just for stats
- Have to think about writing stats after writing
production code
- Not updated at-least till next run (Not
“realtime”)
- Not distributable - since it allows arbitrary
functions
- Stats emission and rollups are at separate
places - making it difficult to test and keep
them in sync
stats for map-reduce jobs stats for streaming workloads
Slide 19
Slide 19 text
what do we really
want in stats?
Slide 20
Slide 20 text
aggregations
distribution / parallelism
real time
Slide 21
Slide 21 text
aggregates as monoids
Slide 22
Slide 22 text
aggregates as monoids
Slide 23
Slide 23 text
An operation is considered a monoid if:
(x . y) . z = x . (y . z)
(associativity aka semigroup)
identity . x = x . identity = x
(identity)
monoids
trait Semigroup[T] {
def plus(left: T, right: T): T
}
trait Monoid[T] extends Semigroup[T] {
def zero: T
}
Slide 24
Slide 24 text
A monoid can also be commutative
x . y = y . x
monoids
Commutative property of
monoids are used for parallel
processing on large datasets
Slide 25
Slide 25 text
monoids - count / sum
sum is associative
sum(sum(2, 6), 6) == sum(2, sum(6, 6))
sum(8, 6) == sum(2, 12)
14 == 14
Slide 26
Slide 26 text
monoids - average
Average of an average is not an average, aka, not associative
avg(avg(2, 6), 6) != avg(2, avg(6, 6))
avg(4, 6) != avg(2, 6)
5 != 4
Slide 27
Slide 27 text
monoids - average
But Average can be associative, if we have total & count individually
case class Average(total: Double, count: Long) {
def toAvg: Double = total / count
}
Slide 28
Slide 28 text
monoids : themes
Slide 29
Slide 29 text
monoids : parallelism
Slide 30
Slide 30 text
parallel aggregations
3 4 ... 7 2 1 3 ... 8 7 5 ... 1
3 4 ... 7
Σ
A
2 1 3 ... 8 7 5 ... 1
Σ
C
Σ
B
Σ = Σ
A
+Σ
B
+Σ
C
Slide 31
Slide 31 text
monoids : approximates
Slide 32
Slide 32 text
- While sum is accurate, distinct counts in constant memory are not
monoids - approximates
Slide 33
Slide 33 text
- While sum is accurate, distinct counts in constant memory are not
- Approximate structures like HyperLogLog can find unique counts in
constant memory (and a known error bound)
monoids - approximates
Slide 34
Slide 34 text
- While sum is accurate, distinct counts in constant memory are not
- Approximate structures like HyperLogLog can find unique counts in
constant memory (and a known error bound)
- 2 more HLL can be merged and their merge is both associative and
commutative - can be expressed as a monoid
monoids - approximates
Slide 35
Slide 35 text
approximate stats now
is better than accurate
stats tomorrow
Slide 36
Slide 36 text
- Stats naturally can be expressed as Monoids
learnings so far
Slide 37
Slide 37 text
- Stats naturally can be expressed as Monoids
- Monoids given they are associative (and some are also
commutative), we can exploit them for massive parallel processing
learnings so far
Slide 38
Slide 38 text
- Stats naturally can be expressed as Monoids
- Monoids given they are associative (and some are also
commutative), we can exploit them for massive parallel processing
- We need our stats to be real time even if they’re approximate for
some metrics as long as the error bounds are known
learnings so far
Slide 39
Slide 39 text
abel
Slide 40
Slide 40 text
Written in Scala
Backed by RocksDB
Uses twitter/algebird for HLL
Uses Kafka for stats delivery
abel
Consumes stats in (near) Realtime
Expose aggregations over HTTP
Crunches 1M events in less than
15 seconds on 1 machine
Slide 41
Slide 41 text
stats.service.ix
abel data flow
Count(“a”, 1L) Unique(“a”, 1L) Count(“c”, 1L)
Unique(“a”, 1L)
Count(“b”, 1L)
Slide 42
Slide 42 text
abel: internals
Slide 43
Slide 43 text
case class Metric[T <: Aggregate[T]] (key: Key, value: T with Aggregate[T])
abel internals
Metric = Key * Aggregate (Semigroup)
Slide 44
Slide 44 text
trait Aggregate[T <: Aggregate[_]] { self: T =>
def plus(another: T): T
def show: JsValue
}
abel internals
Slide 45
Slide 45 text
case class Time(time: Long, granularity: Long)
case class Key(name:String, tags:SortedSet[String],
time:Time = Time.Forever)
abel internals
Key = Name * Tags * Granularity * Timestamp
Slide 46
Slide 46 text
client.send(Metric(Key(
name = “unique-ups-per-hour”,
tags = SortedSet(“site:www.amazon.com”),
time = Time.ThisHour
), UniqueCount(“825633348769”))
abel internals
Slide 47
Slide 47 text
Let’s find Unique count of a UPC occurring per site and
across all sites at the granularities of every hour, every
day and overall. That would need 6 metrics per record.
abel internals
abel: distributed (experimental)
made possible with suuchi
(github.com/ashwanthkumar/suuchi)
Slide 50
Slide 50 text
- Peer-to-Peer system built using Suuchi
- Kafka consumer auto rebalances the partitions across instances
- Uses scatter-gather primitive in Suuchi to perform query time
reductions of the metrics before serving it to the users
distributed abel
Slide 51
Slide 51 text
1.1.1.1 1.1.1.2 1.1.1.3
A
stats.service.ix
1.1.1.1
1.1.1.2
1.1.1.3
DNS based
Load
Balancing
distributed abel
architecture
Count(“a”, 1L) Unique(“a”, 1L) Count(“c”, 1L)
Unique(“a”, 1L)
Count(“b”, 1L)
monoid.plus monoid.plus monoid.plus
suuchi
toolkit for building distributed
function shipping applications
github.com/ashwanthkumar/suuchi
Slide 56
Slide 56 text
rocksdb
Slide 57
Slide 57 text
Open source by facebook
Fast persistent KV store
Server Workloads
Embeddable
Optimized for SSDs
rocksdb
Fork of LevelDB
Modelled after BigTable
LSM Tree based SST files
Written in C++
Slide 58
Slide 58 text
Simple C++ API
Has bindings in
C
Java
Go
Python
rocksdb
Slide 59
Slide 59 text
rocksdb @indix
Slide 60
Slide 60 text
- Serving our API in production for 3+ years
- Search on hierarchical documents
- Dynamic fields didn’t scale well on Solr
- Brand / Store / Category Counts for a filter
- Price History Service
- More than a billion prices and serve online to REST queries
rocksdb @indix
Slide 61
Slide 61 text
- Stats (as Monoids) Storage System
- All we want was approximate aggregates real-time
- HTML Archive System
- Stores ~120TB of url and timestamp indexed HTML pages
- Real-time scheduler for our crawlers
- Finds out which of the 20 urls to crawl now out of 3+ billion urls
- Helps crawler crawl 20+ million urls everyday
rocksdb @indix
Slide 62
Slide 62 text
recursive reduction
Slide 63
Slide 63 text
- sum / multiplication
- (sorted) top-K elements
- operations on a graph - eg. link reach on twitter graph
- function should be associative and optionally
commutative
recursive reduction