Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Three weird tips for high performance analytics applications (DataEngConf 2018)

26290e7e829b985a6bcb44da8213029e?s=47 Imply
April 18, 2018

Three weird tips for high performance analytics applications (DataEngConf 2018)

On the Druid project, our goal is to build the engine that powers your analytics applications. Apps like interactive dashboards, slice-and-dice tools, and real-time monitoring systems, all need to respond quickly in order to be effective. This talk is about three of the virtues that keep analytic apps running smoothly. (And they're not that weird, we promise!)

In this talk, we'll discuss:

1. The virtue of brevity: it's best known as the soul of wit, but is also your gateway to the performance-improving benefits of compression.
2. The virtue of foresight: thinking ahead and storing rollups of your data saves work at query time.
3. The virtue of flexibility: sketches and approximate algorithms can get you results fast, cheap, and 98% right.

26290e7e829b985a6bcb44da8213029e?s=128

Imply

April 18, 2018
Tweet

Transcript

  1. Gian Merlino gian@imply.io Three weird tips for high performance analytics

    applications
  2. Who am I? Gian Merlino Committer & PMC member on

    Cofounder & CTO at 2
  3. Agenda Intro to Druid The virtue of brevity The virtue

    of foresight The virtue of flexibility 3
  4. 4 open source, high-performance, column-oriented, distributed data store

  5. What is Druid? • “high performance”: low query latency, high

    ingest rates • “column-oriented”: best possible scan rates • “distributed”: deployed in clusters, typically 10s–100s of nodes • “data store”: the cluster stores a copy of your data 5
  6. The Problem 6

  7. Powered by Druid 7 Source: http://druid.io/druid-powered.html Not an endorsement.

  8. Powered by Druid “The performance is great ... some of

    the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 8 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:
  9. Brevity 9

  10. Columnar storage 10 Dictionary • 0 → Justin Bieber, 1

    → Ke$ha Data • page → [0 0 0 1 1 1] • language → [0 0 0 0 0 0] Compression • Dictionary • Bit packing (future: PFOR, RLE, …) • LZ4 (optional)
  11. Indexing 11 Query • Justin Bieber OR Ke$ha → [111111]

    Index • Justin Bieber → [0, 1, 2] → [111000] • Ke$ha → [3, 4, 5] → [000111] Compression • Roaring (https://roaringbitmap.org/) • CONCISE
  12. Foresight 12

  13. Rollup • Projection of a table • OLAP cube •

    Serious reduction in row count (2x – 100x) 13
  14. Rollup 14 time make model year sale_price sale_fee 2016-01-28 10:12:00

    Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70
  15. Rollup 15 time make model year sale_price sale_fee 2016-01-28 10:12:00

    Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70 time make model year count sum_sale_price sum_sale_fee min_sale_fee 2016-01-28 00:00:00 Honda Civic 2009 3 26500 590 70 2016-01-28 00:00:00 Honda CRV 2009 1 9000 150 150 2016-01-28 00:00:00 Toyota Prius 2011 1 9000 100 100 2016-01-28 00:00:00 Toyota Corolla 2010 2 21000 330 130
  16. Flexibility 16

  17. Approximation • Ranking • Distinct count • Distinct count with

    set operations • Histograms and quantiles 17
  18. Approximation • Memory bounded (or slow growing) • Error sometimes

    bounded 18
  19. Example: HyperLogLog • Count distinct • Based on bit-independence of

    hashing 19
  20. Example: HyperLogLog 20 50% start with: 1… 25% start with:

    01… 12.5% start with: 001… 6.25% start with: 0001… 3.125% start with: 00001…
  21. Example: HyperLogLog 21 50% start with: 1… ~2 uniques 25%

    start with: 01… ~4 uniques 12.5% start with: 001… ~8 uniques 6.25% start with: 0001… ~16 uniques 3.125% start with: 00001… ~32 uniques
  22. Example: HyperLogLog 22 000001010001010110010101000101011 k = 6 (64 buckets) look

    at leading-zero distribution here
  23. Try this at home 23

  24. Download Imply distribution: https://imply.io/get-started Druid community site: http://druid.io/ 24

  25. Contribute 25 http://druid.io/community https://github.com/druid-io/druid

  26. Contribute 26 Druid has recently begun migration to the Apache

    Incubator. Apache Druid is coming soon!
  27. We’re hiring! https://imply.io/careers 27