Slide 1

Slide 1 text

Using Monoids for Business Metrics Ashwanth Kumar Principal Engineer @_ashwanthkumar

Slide 2

Slide 2 text

wassup?

Slide 3

Slide 3 text

Business Metrics

Slide 4

Slide 4 text

Everyday Rollups

Slide 5

Slide 5 text

Unique Commits per day Source - github.com/ashwanthkumar

Slide 6

Slide 6 text

we saw aggregates

Slide 7

Slide 7 text

we saw aggregates & unique counts

Slide 8

Slide 8 text

we saw aggregates & unique counts rolled up at various intervals

Slide 9

Slide 9 text

we didn’t see them work at Github scale

Slide 10

Slide 10 text

aggregates (aka) counts

Slide 11

Slide 11 text

counts

Slide 12

Slide 12 text

counts

Slide 13

Slide 13 text

+ = counts

Slide 14

Slide 14 text

+ = counts

Slide 15

Slide 15 text

counts @ scale

Slide 16

Slide 16 text

uniques

Slide 17

Slide 17 text

uniques

Slide 18

Slide 18 text

uniques

Slide 19

Slide 19 text

U = uniques

Slide 20

Slide 20 text

U = uniques

Slide 21

Slide 21 text

Uniques @ scale Host 1 Host 2 Host 3 Host 4

Slide 22

Slide 22 text

Uniques @ scale Host 1 Host 2 Host 3 Host 4 Reduce at individual hosts

Slide 23

Slide 23 text

Uniques @ scale Master Host * Not the right way to perform uniques, but the simplest

Slide 24

Slide 24 text

Uniques @ scale Master Host

Slide 25

Slide 25 text

we saw how aggregates & unique counts can be computed at scale

Slide 26

Slide 26 text

Wouldn’t it be nice if we can express both kind of aggregations in a consistent way?

Slide 27

Slide 27 text

Monoid is the hero we need & deserve!

Slide 28

Slide 28 text

An operation ( . ) is considered a monoid if: (x . y) . z = x . (y . z) (associativity aka semigroup) identity . x = x . identity = x (identity) trait Semigroup[T] { def plus(left: T, right: T): T } trait Monoid[T] extends Semigroup[T] { def zero: T } monoid

Slide 29

Slide 29 text

Monoids being associative & sometimes commutative are exploited for massive parallel processing

Slide 30

Slide 30 text

➔ While sum can be achieved in constant memory, distinct counts cannot be. ➔ Approximate structures like HyperLogLog can find unique counts in constant memory (under a known error bound). ➔ 2 or more HLLs can be merged and their result is a monoid. approx. monoids

Slide 31

Slide 31 text

approximate stats now is better than accurate stats tomorrow

Slide 32

Slide 32 text

Aggregations in a distributed environment are computed using scatter and gather primitive.

Slide 33

Slide 33 text

Uniques @ scale Host 1 Host 2 Host 3 Host 4 Scatter

Slide 34

Slide 34 text

Uniques @ scale Host 1 Host 2 Host 3 Host 4 Reduce at individual hosts using HLL Scatter

Slide 35

Slide 35 text

Uniques @ scale Master Host Re-Reduce at Master Host using HLL Gather

Slide 36

Slide 36 text

Uniques @ scale Master Host Gather Re-Reduce at Master Host using HLL

Slide 37

Slide 37 text

We built a system using these @indix called Abel

Slide 38

Slide 38 text

distributed abel architecture 1.1.1.1 1.1.1.2 1.1.1.3 A stats.service.ix 1.1.1.1 1.1.1.2 1.1.1.3 DNS based Load Balancing Count(“a”, 1L) Unique(“a”, 1L) Count(“c”, 1L) Unique(“a”, 1L) Count(“b”, 1L) monoid.plus monoid.plus monoid.plus

Slide 39

Slide 39 text

We are looking forward to open source it soon.

Slide 40

Slide 40 text

Thank you! Image credits jdhancock.com