Three weird tips for high performance analytics applications (DataEngConf 2018)

Gian Merlino [email protected] Three weird tips for high performance analytics
applications

Who am I? Gian Merlino Committer & PMC member on
Cofounder & CTO at 2

Agenda Intro to Druid The virtue of brevity The virtue
of foresight The virtue of flexibility 3

4 open source, high-performance, column-oriented, distributed data store

What is Druid? • “high performance”: low query latency, high
ingest rates • “column-oriented”: best possible scan rates • “distributed”: deployed in clusters, typically 10s–100s of nodes • “data store”: the cluster stores a copy of your data 5

The Problem 6

Powered by Druid 7 Source: http://druid.io/druid-powered.html Not an endorsement.

Powered by Druid “The performance is great ... some of
the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.” 8 Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html From Yahoo:

Brevity 9

Columnar storage 10 Dictionary • 0 → Justin Bieber, 1
→ Ke$ha Data • page → [0 0 0 1 1 1] • language → [0 0 0 0 0 0] Compression • Dictionary • Bit packing (future: PFOR, RLE, …) • LZ4 (optional)

Indexing 11 Query • Justin Bieber OR Ke$ha → [111111]
Index • Justin Bieber → [0, 1, 2] → [111000] • Ke$ha → [3, 4, 5] → [000111] Compression • Roaring (https://roaringbitmap.org/) • CONCISE

Foresight 12

Rollup • Projection of a table • OLAP cube •
Serious reduction in row count (2x – 100x) 13

Rollup 14 time make model year sale_price sale_fee 2016-01-28 10:12:00
Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70

Rollup 15 time make model year sale_price sale_fee 2016-01-28 10:12:00
Honda Civic 2009 10000 200 2016-01-28 13:14:00 Honda CRV 2009 9000 150 2016-01-28 15:12:40 Honda Civic 2009 9500 320 2016-01-28 18:11:40 Toyota Prius 2011 9000 100 2016-01-28 20:35:40 Toyota Corolla 2010 11000 130 2016-01-28 22:42:40 Toyota Corolla 2010 10000 200 2016-01-28 23:12:40 Honda Civic 2009 7000 70 time make model year count sum_sale_price sum_sale_fee min_sale_fee 2016-01-28 00:00:00 Honda Civic 2009 3 26500 590 70 2016-01-28 00:00:00 Honda CRV 2009 1 9000 150 150 2016-01-28 00:00:00 Toyota Prius 2011 1 9000 100 100 2016-01-28 00:00:00 Toyota Corolla 2010 2 21000 330 130

Flexibility 16

Approximation • Ranking • Distinct count • Distinct count with
set operations • Histograms and quantiles 17

Approximation • Memory bounded (or slow growing) • Error sometimes
bounded 18

Example: HyperLogLog • Count distinct • Based on bit-independence of
hashing 19

Example: HyperLogLog 20 50% start with: 1… 25% start with:
01… 12.5% start with: 001… 6.25% start with: 0001… 3.125% start with: 00001…

Example: HyperLogLog 21 50% start with: 1… ~2 uniques 25%
start with: 01… ~4 uniques 12.5% start with: 001… ~8 uniques 6.25% start with: 0001… ~16 uniques 3.125% start with: 00001… ~32 uniques

Example: HyperLogLog 22 000001010001010110010101000101011 k = 6 (64 buckets) look
at leading-zero distribution here

Try this at home 23

Download Imply distribution: https://imply.io/get-started Druid community site: http://druid.io/ 24

Contribute 25 http://druid.io/community https://github.com/druid-io/druid

Contribute 26 Druid has recently begun migration to the Apache
Incubator. Apache Druid is coming soon!

We’re hiring! https://imply.io/careers 27

Three weird tips for high performance analytics...

Three weird tips for high performance analytics applications (DataEngConf 2018)

Imply

More Decks by Imply

Other Decks in Technology

Featured

Transcript

Gian Merlino [email protected] Three weird tips for high performance analytics

Who am I? Gian Merlino Committer & PMC member on

Agenda Intro to Druid The virtue of brevity The virtue

4 open source, high-performance, column-oriented, distributed data store

What is Druid? • “high performance”: low query latency, high

The Problem 6

Powered by Druid 7 Source: http://druid.io/druid-powered.html Not an endorsement.

Powered by Druid “The performance is great ... some of

Brevity 9

Columnar storage 10 Dictionary • 0 → Justin Bieber, 1

Indexing 11 Query • Justin Bieber OR Ke$ha → [111111]

Foresight 12

Rollup • Projection of a table • OLAP cube •

Rollup 14 time make model year sale_price sale_fee 2016-01-28 10:12:00

Rollup 15 time make model year sale_price sale_fee 2016-01-28 10:12:00

Flexibility 16

Approximation • Ranking • Distinct count • Distinct count with

Approximation • Memory bounded (or slow growing) • Error sometimes

Example: HyperLogLog • Count distinct • Based on bit-independence of

Example: HyperLogLog 20 50% start with: 1… 25% start with:

Example: HyperLogLog 21 50% start with: 1… ~2 uniques 25%

Example: HyperLogLog 22 000001010001010110010101000101011 k = 6 (64 buckets) look

Try this at home 23

Download Imply distribution: https://imply.io/get-started Druid community site: http://druid.io/ 24

Contribute 25 http://druid.io/community https://github.com/druid-io/druid

Contribute 26 Druid has recently begun migration to the Apache

We’re hiring! https://imply.io/careers 27