Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Monoids for Business Metrics

Using Monoids for Business Metrics

Presentation given at TW Geeknight Chennai, Octobeer '17.
Video can be found at https://www.youtube.com/watch?v=RJepu3sbmkU

D90acaa01cb59a2b8b7e986958953eee?s=128

Ashwanth Kumar

October 26, 2017
Tweet

Transcript

  1. Using Monoids for Business Metrics Ashwanth Kumar Principal Engineer @_ashwanthkumar

  2. wassup?

  3. Business Metrics

  4. Everyday Rollups

  5. Unique Commits per day Source - github.com/ashwanthkumar

  6. we saw aggregates

  7. we saw aggregates & unique counts

  8. we saw aggregates & unique counts rolled up at various

    intervals
  9. we didn’t see them work at Github scale

  10. aggregates (aka) counts

  11. counts

  12. counts

  13. + = counts

  14. + = counts

  15. counts @ scale

  16. uniques

  17. uniques

  18. uniques

  19. U = uniques

  20. U = uniques

  21. Uniques @ scale Host 1 Host 2 Host 3 Host

    4
  22. Uniques @ scale Host 1 Host 2 Host 3 Host

    4 Reduce at individual hosts
  23. Uniques @ scale Master Host * Not the right way

    to perform uniques, but the simplest
  24. Uniques @ scale Master Host

  25. we saw how aggregates & unique counts can be computed

    at scale
  26. Wouldn’t it be nice if we can express both kind

    of aggregations in a consistent way?
  27. Monoid is the hero we need & deserve!

  28. An operation ( . ) is considered a monoid if:

    (x . y) . z = x . (y . z) (associativity aka semigroup) identity . x = x . identity = x (identity) trait Semigroup[T] { def plus(left: T, right: T): T } trait Monoid[T] extends Semigroup[T] { def zero: T } monoid
  29. Monoids being associative & sometimes commutative are exploited for massive

    parallel processing
  30. ➔ While sum can be achieved in constant memory, distinct

    counts cannot be. ➔ Approximate structures like HyperLogLog can find unique counts in constant memory (under a known error bound). ➔ 2 or more HLLs can be merged and their result is a monoid. approx. monoids
  31. approximate stats now is better than accurate stats tomorrow

  32. Aggregations in a distributed environment are computed using scatter and

    gather primitive.
  33. Uniques @ scale Host 1 Host 2 Host 3 Host

    4 Scatter
  34. Uniques @ scale Host 1 Host 2 Host 3 Host

    4 Reduce at individual hosts using HLL Scatter
  35. Uniques @ scale Master Host Re-Reduce at Master Host using

    HLL Gather
  36. Uniques @ scale Master Host Gather Re-Reduce at Master Host

    using HLL
  37. We built a system using these @indix called Abel

  38. distributed abel architecture 1.1.1.1 1.1.1.2 1.1.1.3 A stats.service.ix 1.1.1.1 1.1.1.2

    1.1.1.3 DNS based Load Balancing Count(“a”, 1L) Unique(“a”, 1L) Count(“c”, 1L) Unique(“a”, 1L) Count(“b”, 1L) monoid.plus monoid.plus monoid.plus
  39. We are looking forward to open source it soon.

  40. Thank you! Image credits jdhancock.com