Save 37% off PRO during our Black Friday Sale! »

Evolution of stats @ Indix

Evolution of stats @ Indix

At Indix we collect and process lot of data. We monitor the correct behaviour of our system through collection of business metrics. Over the time, we moved most of our system from batch map-reduce jobs to kafka stream tasks. Hence we had to move the stats to be more real time. So we built a system called Abel, which aggregates millions of events that it gets and collects stats for the same.

D4e90d894d0869862f4d5c871b5ec986?s=128

VinothKumar Raman

February 13, 2017
Tweet

Transcript

  1. Evolution of Stats @ Indix

  2. Existing stats and Cockpit

  3. Stats as Mapreduce jobs

  4. + Scalding - and it’s simple

  5. - Have to include intermediary data also into output

  6. - Have to think about writing stats after writing prod

    code
  7. - Not updated at-least till next run (Not “realtime”)

  8. Remember days when we had to log something to get

    alerts via splunk?
  9. Riemann

  10. + “Realtime”

  11. + Allows arbitrary functions as rollups - You really can

    do anything, of course as long as it’s in clojure
  12. - Not distributable/reliable, since it allows arbitrary functions

  13. - Have emission and roll ups at two different places,

    hence not so easy to test and goes out of sync
  14. - Well it’s primarily meant for system monitoring.

  15. What do we really want in stats? - Aggregates

  16. Approximate stats now is better than accurate stats tomorrow.

  17. Aggregates in general

  18. Monoids

  19. Binary op - Closure

  20. Associativity

  21. Identity

  22. Commutative Monoids

  23. Semigroups

  24. Abstract crap

  25. Approximate structures

  26. Unique Counts, Percentiles

  27. Abel

  28. Metric = Key * Aggregate (Semigroup)

  29. None
  30. Key = Name * Tags * Granularity * Timestamp

  31. None
  32. Partition-Id and query time scans

  33. Unique count of UPCs per site, every hour, every day

    and overall is 6 metric per record
  34. Explosions

  35. None
  36. Sum, Product, Nothing and Others

  37. Around 1 M events crunched in less than 15 seconds

    running in 1 machine
  38. Suuchi Scans and Marathon deployment

  39. Over and Out