Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Three weird tips for high performance analytics applications (DataEngConf 2018)

Imply
April 18, 2018

Three weird tips for high performance analytics applications (DataEngConf 2018)

On the Druid project, our goal is to build the engine that powers your analytics applications. Apps like interactive dashboards, slice-and-dice tools, and real-time monitoring systems, all need to respond quickly in order to be effective. This talk is about three of the virtues that keep analytic apps running smoothly. (And they're not that weird, we promise!)

In this talk, we'll discuss:

1. The virtue of brevity: it's best known as the soul of wit, but is also your gateway to the performance-improving benefits of compression.
2. The virtue of foresight: thinking ahead and storing rollups of your data saves work at query time.
3. The virtue of flexibility: sketches and approximate algorithms can get you results fast, cheap, and 98% right.

Imply

April 18, 2018
Tweet

More Decks by Imply

Other Decks in Technology

Transcript

  1. Agenda Intro to Druid The virtue of brevity The virtue

    of foresight The virtue of flexibility 3
  2. What is Druid? • “high performance”: low query latency, high

    ingest rates • “column-oriented”: best possible scan rates • “distributed”: deployed in clusters, typically 10s–100s of nodes • “data store”: the cluster stores a copy of your data 5
  3. Powered by Druid “The performance is great ... some of

    the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 8 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:
  4. Columnar storage 10 Dictionary • 0 → Justin Bieber, 1

    → Ke$ha Data • page → [0 0 0 1 1 1] • language → [0 0 0 0 0 0] Compression • Dictionary • Bit packing (future: PFOR, RLE, …) • LZ4 (optional)
  5. Indexing 11 Query • Justin Bieber OR Ke$ha → [111111]

    Index • Justin Bieber → [0, 1, 2] → [111000] • Ke$ha → [3, 4, 5] → [000111] Compression • Roaring (https://roaringbitmap.org/) • CONCISE
  6. Rollup • Projection of a table • OLAP cube •

    Serious reduction in row count (2x – 100x) 13
  7. Rollup 14 time make model year sale_price sale_fee 2016-01-28 10:12:00

    Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70
  8. Rollup 15 time make model year sale_price sale_fee 2016-01-28 10:12:00

    Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70 time make model year count sum_sale_price sum_sale_fee min_sale_fee 2016-01-28 00:00:00 Honda Civic 2009 3 26500 590 70 2016-01-28 00:00:00 Honda CRV 2009 1 9000 150 150 2016-01-28 00:00:00 Toyota Prius 2011 1 9000 100 100 2016-01-28 00:00:00 Toyota Corolla 2010 2 21000 330 130
  9. Approximation • Ranking • Distinct count • Distinct count with

    set operations • Histograms and quantiles 17
  10. Example: HyperLogLog 20 50% start with: 1… 25% start with:

    01… 12.5% start with: 001… 6.25% start with: 0001… 3.125% start with: 00001…
  11. Example: HyperLogLog 21 50% start with: 1… ~2 uniques 25%

    start with: 01… ~4 uniques 12.5% start with: 001… ~8 uniques 6.25% start with: 0001… ~16 uniques 3.125% start with: 00001… ~32 uniques
  12. Contribute 26 Druid has recently begun migration to the Apache

    Incubator. Apache Druid is coming soon!