Setting the stage for fast analytics with Druid

Setting the stage for fast analytics with Druid Surekha Saharan
[email protected] Benjamin Hopp [email protected]

Agenda • Introduction to Apache Druid • Data modeling optimizations
• Query optimizations

The Problem Data Exploration Data Ingestion Data Availability

Where Druid ﬁts in 4 Data lakes Message buses Raw
data Storage Analyze Application

What is Druid? open source distributed column-oriented data-store fast-aggregation slice-n-dice
fault tolerant event-driven data

What is Druid ? 6 Search platform OLAP • Real-time
ingestion • Flexible schema • Full text search • Batch ingestion • Eﬃcient storage • Fast analytic queries Timeseries database • Optimized for time-based datasets • Time-based functions Druid combines ideas to power a new type of analytics application.

Powered by Druid 7 Source: http://druid.io/druid-powered.html + many more!

“Bad programmers worry about the code. Good programmers worry about
data structures and their relationships.” ― Linus Torvalds

How data is structured • Druid stores data in immutable
segments • Column-oriented compressed format • Dictionary-encoded at column level • Secondary indexes on individual columns • Bitmap Index Compression : concise & roaring • Rollup (partial aggregation)

Druid’s logical data model Timestamp Dimensions Metrics

Druid Segments 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin
Bieber LA 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T01:06:33Z Ke$ha LA 30 45 2011-01-01T01:08:51Z Ke$ha LA 16 8 2011-01-01T01:09:17Z Miley Cyrus DC 75 10 2011-01-01T02:23:30Z Miley Cyrus DC 22 12 2011-01-01T02:49:33Z Miley Cyrus DC 90 41 Segment 2011-01-01T00/2011-01-01T01 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T02/2011-01-01T03 timestamp page city added deleted

Anatomy of Druid Segment 112 67 53 94 5690 1100
8423 9080 Dict encoded (sorted) Bitmap index (stored compressed)

Filter Query Path timestamp page 2011-01-01T00:01:35Z Justin Bieber 2011-01-01T00:03:45Z Justin
Bieber 2011-01-01T00:05:62Z Justin Bieber 2011-01-01T00:06:33Z Ke$ha 2011-01-01T00:08:51Z Ke$ha JB or KS [ 1 1 1 1 1] Justin Bieber [1 1 1 0 0] Ke$ha [0 0 0 1 1]

Get your data model right • Prepare your data before
ingestion • Optimize segment size • Partition data right • Choose column types and indexes • To join or not • To roll-up or not • Approx algorithms

Optimize segment size Ideally 300 - 700 mb (~ 5
million rows) To control segment size • Alter segment granularity • Specify partition spec • Use Automatic Compaction

Partitioning beyond time • Druid always partitions by time •
Decide which dimension to partition on… next • Partition by some dimension you often ﬁlter on • Improves locality, compression, storage size, query performance

Modeling data for fast search Exact match or preﬁx ﬁltering
◦ Uses binary search ◦ Only dictionary + index section of dimension is needed ◦ Example, store SSN backwards : 123-45-6789 if searching last-4 digits frequently. select count(*) from wikiticker where "comment" like 'A%' select count(*) from wikiticker where "comment" like '%A%'

Joins/Lookups • Alternative to reloading segments for slowly changing dimensions
• Allows enrichment of data • When possible, use at ingestion

Rollup • Pre-aggregation at ingestion time • Saves space, better
compression • Query performance boost

Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber
SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41

Roll-up vs no roll-up Do roll-up • Working with space
constraint. • No need to retain high cardinality dimensions (like user id, precise location information). • Maximize price / performance. Don’t roll-up • Need the ability to retrieve individual events. • May need to group or ﬁlter on any column.

Choose column types carefully String column indexed fast aggregation fast
grouping Numeric column indexed fast aggregation fast grouping

Approx Algorithms • Data sketches are lossy data structures •
Tradeoﬀ accuracy for reduced storage and improved performance. • Summarize data at ingestion time using sketches • Improves roll-up, reduce memory footprint

Summarize with data sketches timestamp page city count sum_ added
sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41

The highlights • Return only the data that you immediately
need (smaller more frequent queries, rather than all data at once) • Timestamps are magical in Druid - harness their power • Use approximate algorithms whenever you can • SQL makes Druid easier • When you can’t use SQL, be sure to choose the best query type

The magical 4th dimension • Always ﬁlter by timestamp -
this is pushed down and limits the segments that get scanned • Aggregate to the interval you need within Druid

Use Druid SQL • Easier to learn/more familiar • Will
attempt to make intelligent query type choices (timeseries vs topN vs groupBy) • There are some limitations - such as multi-value dimensions, not all aggregations are supported

Native JSON

When close enough is good enough Approximate queries can provide
up to 99% accuracy while greatly improving performance • Bloom Filters ◦ Self Joins • Theta Sketches ◦ Union/Intersection/Diﬀerence • HLL Sketches ◦ Count Distinct • Quantile Sketches ◦ Median, percentiles

When close enough is good enough • Hashes can be
calculated at query time or ingestion ◦ Pre-computed hashes can save up to 50% query time • K value determines precision and performance • Default values will count within 5% accuracy 99% of the time. • HLL and Theta sketch can both provide COUNT DISTINCT support, but HLL will do it faster and more accurately with a smaller data footprint. • Theta Sketches are more ﬂexible, but require more storage.

Pick your query carefully • Timeseries - When you don’t
want to group by dimension • TopN - When you want to group by a single dimension • GroupBy - Least performant/most ﬂexible • Scan - For returning streaming raw data

Download Apache Druid community site (new): https://druid.apache.org/ Apache Druid community
site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started 36

Contribute 37 https://github.com/apache/druid

Stay in touch 38 @druidio Join the community! http://druid.apache.org/ Follow
the Druid project on Twitter!

Setting the stage for fast analytics with Druid

Setting the stage for fast analytics with Druid

More Decks by Imply

Other Decks in Technology

Featured

Transcript