Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Setting the stage for fast analytics with Druid

26290e7e829b985a6bcb44da8213029e?s=47 Imply
May 22, 2019

Setting the stage for fast analytics with Druid

Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets. This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit. Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.

The most important contributor to a fast analytical setup is getting the data model right. The talk will center around various choices you can make to prepare your data to get best possible query performance.

26290e7e829b985a6bcb44da8213029e?s=128

Imply

May 22, 2019
Tweet

Transcript

  1. Setting the stage for fast analytics with Druid Surekha Saharan

    surekha.saharan@imply.io Benjamin Hopp benjamin.hopp@imply.io
  2. Agenda • Introduction to Apache Druid • Data modeling optimizations

    • Query optimizations
  3. The Problem Data Exploration Data Ingestion Data Availability

  4. Where Druid fits in 4 Data lakes Message buses Raw

    data Storage Analyze Application
  5. What is Druid? open source distributed column-oriented data-store fast-aggregation slice-n-dice

    fault tolerant event-driven data
  6. What is Druid ? 6 Search platform OLAP • Real-time

    ingestion • Flexible schema • Full text search • Batch ingestion • Efficient storage • Fast analytic queries Timeseries database • Optimized for time-based datasets • Time-based functions Druid combines ideas to power a new type of analytics application.
  7. Powered by Druid 7 Source: http://druid.io/druid-powered.html + many more!

  8. “Bad programmers worry about the code. Good programmers worry about

    data structures and their relationships.” ― Linus Torvalds
  9. How data is structured • Druid stores data in immutable

    segments • Column-oriented compressed format • Dictionary-encoded at column level • Secondary indexes on individual columns • Bitmap Index Compression : concise & roaring • Rollup (partial aggregation)
  10. Druid’s logical data model Timestamp Dimensions Metrics

  11. Druid Segments 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin

    Bieber LA 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T01:06:33Z Ke$ha LA 30 45 2011-01-01T01:08:51Z Ke$ha LA 16 8 2011-01-01T01:09:17Z Miley Cyrus DC 75 10 2011-01-01T02:23:30Z Miley Cyrus DC 22 12 2011-01-01T02:49:33Z Miley Cyrus DC 90 41 Segment 2011-01-01T00/2011-01-01T01 Segment 2011-01-01T01/2011-01-01T02 Segment 2011-01-01T02/2011-01-01T03 timestamp page city added deleted
  12. Anatomy of Druid Segment 112 67 53 94 5690 1100

    8423 9080 Dict encoded (sorted) Bitmap index (stored compressed)
  13. Filter Query Path timestamp page 2011-01-01T00:01:35Z Justin Bieber 2011-01-01T00:03:45Z Justin

    Bieber 2011-01-01T00:05:62Z Justin Bieber 2011-01-01T00:06:33Z Ke$ha 2011-01-01T00:08:51Z Ke$ha JB or KS [ 1 1 1 1 1] Justin Bieber [1 1 1 0 0] Ke$ha [0 0 0 1 1]
  14. Agenda • Introduction to Apache Druid • Data modeling optimizations

    • Query optimizations
  15. Get your data model right • Prepare your data before

    ingestion • Optimize segment size • Partition data right • Choose column types and indexes • To join or not • To roll-up or not • Approx algorithms
  16. Optimize segment size Ideally 300 - 700 mb (~ 5

    million rows) To control segment size • Alter segment granularity • Specify partition spec • Use Automatic Compaction
  17. Partitioning beyond time • Druid always partitions by time •

    Decide which dimension to partition on… next • Partition by some dimension you often filter on • Improves locality, compression, storage size, query performance
  18. Modeling data for fast search Exact match or prefix filtering

    ◦ Uses binary search ◦ Only dictionary + index section of dimension is needed ◦ Example, store SSN backwards : 123-45-6789 if searching last-4 digits frequently. select count(*) from wikiticker where "comment" like 'A%' select count(*) from wikiticker where "comment" like '%A%'
  19. Joins/Lookups • Alternative to reloading segments for slowly changing dimensions

    • Allows enrichment of data • When possible, use at ingestion
  20. Rollup • Pre-aggregation at ingestion time • Saves space, better

    compression • Query performance boost
  21. Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber

    SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41
  22. Roll-up vs no roll-up Do roll-up • Working with space

    constraint. • No need to retain high cardinality dimensions (like user id, precise location information). • Maximize price / performance. Don’t roll-up • Need the ability to retrieve individual events. • May need to group or filter on any column.
  23. Choose column types carefully String column indexed fast aggregation fast

    grouping Numeric column indexed fast aggregation fast grouping
  24. Approx Algorithms • Data sketches are lossy data structures •

    Tradeoff accuracy for reduced storage and improved performance. • Summarize data at ingestion time using sketches • Improves roll-up, reduce memory footprint
  25. Summarize with data sketches timestamp page city count sum_ added

    sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41
  26. Agenda • Introduction to Apache Druid • Data modeling optimizations

    • Query optimizations
  27. None
  28. The highlights • Return only the data that you immediately

    need (smaller more frequent queries, rather than all data at once) • Timestamps are magical in Druid - harness their power • Use approximate algorithms whenever you can • SQL makes Druid easier • When you can’t use SQL, be sure to choose the best query type
  29. The magical 4th dimension • Always filter by timestamp -

    this is pushed down and limits the segments that get scanned • Aggregate to the interval you need within Druid
  30. Use Druid SQL • Easier to learn/more familiar • Will

    attempt to make intelligent query type choices (timeseries vs topN vs groupBy) • There are some limitations - such as multi-value dimensions, not all aggregations are supported
  31. SQL

  32. Native JSON

  33. When close enough is good enough Approximate queries can provide

    up to 99% accuracy while greatly improving performance • Bloom Filters ◦ Self Joins • Theta Sketches ◦ Union/Intersection/Difference • HLL Sketches ◦ Count Distinct • Quantile Sketches ◦ Median, percentiles
  34. When close enough is good enough • Hashes can be

    calculated at query time or ingestion ◦ Pre-computed hashes can save up to 50% query time • K value determines precision and performance • Default values will count within 5% accuracy 99% of the time. • HLL and Theta sketch can both provide COUNT DISTINCT support, but HLL will do it faster and more accurately with a smaller data footprint. • Theta Sketches are more flexible, but require more storage.
  35. Pick your query carefully • Timeseries - When you don’t

    want to group by dimension • TopN - When you want to group by a single dimension • GroupBy - Least performant/most flexible • Scan - For returning streaming raw data
  36. Download Apache Druid community site (new): https://druid.apache.org/ Apache Druid community

    site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started 36
  37. Contribute 37 https://github.com/apache/druid

  38. Stay in touch 38 @druidio Join the community! http://druid.apache.org/ Follow

    the Druid project on Twitter!