Slide 1

Slide 1 text

Gian Merlino [email protected] Three weird tips for high performance analytics applications

Slide 2

Slide 2 text

Who am I? Gian Merlino Committer & PMC member on Cofounder & CTO at 2

Slide 3

Slide 3 text

Agenda Intro to Druid The virtue of brevity The virtue of foresight The virtue of flexibility 3

Slide 4

Slide 4 text

4 open source, high-performance, column-oriented, distributed data store

Slide 5

Slide 5 text

What is Druid? ● “high performance”: low query latency, high ingest rates ● “column-oriented”: best possible scan rates ● “distributed”: deployed in clusters, typically 10s–100s of nodes ● “data store”: the cluster stores a copy of your data 5

Slide 6

Slide 6 text

The Problem 6

Slide 7

Slide 7 text

Powered by Druid 7 Source: http://druid.io/druid-powered.html Not an endorsement.

Slide 8

Slide 8 text

Powered by Druid “The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 8 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:

Slide 9

Slide 9 text

Brevity 9

Slide 10

Slide 10 text

Columnar storage 10 Dictionary ● 0 → Justin Bieber, 1 → Ke$ha Data ● page → [0 0 0 1 1 1] ● language → [0 0 0 0 0 0] Compression ● Dictionary ● Bit packing (future: PFOR, RLE, …) ● LZ4 (optional)

Slide 11

Slide 11 text

Indexing 11 Query ● Justin Bieber OR Ke$ha → [111111] Index ● Justin Bieber → [0, 1, 2] → [111000] ● Ke$ha → [3, 4, 5] → [000111] Compression ● Roaring (https://roaringbitmap.org/) ● CONCISE

Slide 12

Slide 12 text

Foresight 12

Slide 13

Slide 13 text

Rollup ● Projection of a table ● OLAP cube ● Serious reduction in row count (2x – 100x) 13

Slide 14

Slide 14 text

Rollup 14 time make model year sale_price sale_fee 2016-01-28 10:12:00 Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70

Slide 15

Slide 15 text

Rollup 15 time make model year sale_price sale_fee 2016-01-28 10:12:00 Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70 time make model year count sum_sale_price sum_sale_fee min_sale_fee 2016-01-28 00:00:00 Honda Civic 2009 3 26500 590 70 2016-01-28 00:00:00 Honda CRV 2009 1 9000 150 150 2016-01-28 00:00:00 Toyota Prius 2011 1 9000 100 100 2016-01-28 00:00:00 Toyota Corolla 2010 2 21000 330 130

Slide 16

Slide 16 text

Flexibility 16

Slide 17

Slide 17 text

Approximation ● Ranking ● Distinct count ● Distinct count with set operations ● Histograms and quantiles 17

Slide 18

Slide 18 text

Approximation ● Memory bounded (or slow growing) ● Error sometimes bounded 18

Slide 19

Slide 19 text

Example: HyperLogLog ● Count distinct ● Based on bit-independence of hashing 19

Slide 20

Slide 20 text

Example: HyperLogLog 20 50% start with: 1… 25% start with: 01… 12.5% start with: 001… 6.25% start with: 0001… 3.125% start with: 00001…

Slide 21

Slide 21 text

Example: HyperLogLog 21 50% start with: 1… ~2 uniques 25% start with: 01… ~4 uniques 12.5% start with: 001… ~8 uniques 6.25% start with: 0001… ~16 uniques 3.125% start with: 00001… ~32 uniques

Slide 22

Slide 22 text

Example: HyperLogLog 22 000001010001010110010101000101011 k = 6 (64 buckets) look at leading-zero distribution here

Slide 23

Slide 23 text

Try this at home 23

Slide 24

Slide 24 text

Download Imply distribution: https://imply.io/get-started Druid community site: http://druid.io/ 24

Slide 25

Slide 25 text

Contribute 25 http://druid.io/community https://github.com/druid-io/druid

Slide 26

Slide 26 text

Contribute 26 Druid has recently begun migration to the Apache Incubator. Apache Druid is coming soon!

Slide 27

Slide 27 text

We’re hiring! https://imply.io/careers 27