Slide 1

Slide 1 text

Druid 2030 The Origin and Evolution of Real-Time Analytics Applications Darin Briskman [email protected]

Slide 2

Slide 2 text

©2023, Imply A few things before we start 2 I’m a member of the Druid Community, giving my own opinions and thoughts. I do not speak for the Community, nor for Imply Data. Thanks to the RTA organizers for inviting me and Imply to participate in today’s events. Pinot and Druid are open source cousins, and we’re all working together to advance real-time analytics for everyone.

Slide 3

Slide 3 text

©2023, Imply About me and Imply 3 Imply was founded by the creators of Apache Druid® to help developers get to successful projects faster

Slide 4

Slide 4 text

©2023, Imply Druid today is very different from how it began 4

Slide 5

Slide 5 text

©2023, Imply In the Beginning ● Needed fast processing and rollups for AdTech ● Requirement: ingest a billion rows in under 1 minute & query them in under 1 second ● Tried and failed with RDBMS: Greenplum, Postgres, MySQL, and InfoBright ● Tried and failed with NoSQL ● Decided to create a new database. How hard could it be? ● Announced Druid on 30 Apr 2011

Slide 6

Slide 6 text

©2023, Imply Evolution and Open Source Rapid iteration as Druid drove the AdTech Business After getting interest from Netflix and others, released Druid as Open Source in 2012. Moved to Apache license in 2015. Promoted to Apache Software Foundation top-level in 2017. Used by over 1900 organizations.

Slide 7

Slide 7 text

©2023, Imply Global and vibrant Community Companies using Druid Active Contributors YoY Increase in Community Activity Community Members 1,900+ 150% 14,000+ 500+

Slide 8

Slide 8 text

©2023, Imply Druid Today: a Real-Time Analytics Database Sub-second queries at any scale Interactive analytics on TB-PBs of data High concurrency at the lowest cost 100s to 1000s QPS via a highly efficient engine Real-time and historical insights True stream ingestion for Kafka and Kinesis Plus, non-stop reliability with automated fault tolerance and continuous backup 1 2 3 For analytics applications that require:

Slide 9

Slide 9 text

©2023, Imply The right database for analytics apps makes a difference OR Without Druid With Druid

Slide 10

Slide 10 text

©2023, Imply 10 Applications Analytics Applications Druid is built for the intersection of analytics and applications. Apache Druid

Slide 11

Slide 11 text

©2023, Imply Real-Time Analytics Applications Real-time Analytics Database Real-Time Analytics Requires a Real-Time Database 11 Analytics Data Warehouses Applications Transactional Databases Read-optimized TB-PBs of Data High Cardinality Sub-Sec Response High Concurrency Real-time Data BI Reporting Monthly Reporting Static Dashboards ACID Compliance Small Data Write-optimized BI Reporting Monthly Reporting Static Dashboards ACID Compliance Small Data Write-optimized ✓ ✓ ✓ ✓ ✓ ✓

Slide 12

Slide 12 text

©2023, Imply Examples of real-time analytics applications 12 Operational Visibility at Scale Rapid Data Exploration Customer-facing Analytics Real-time Decisioning ICE Security Ops Platform Citrix Analytics Service Salesforce Edge Intelligence Reddit Real-Time Ads Powered by

Slide 13

Slide 13 text

©2023, Imply What makes Druid different 13 Storage Segmentation Maturity Scale Community Success

Slide 14

Slide 14 text

©2023, Imply Focus on maturity 14 Data Warehousing

Slide 15

Slide 15 text

©2023, Imply And many more! Focus on community success Retail Financial Gaming Networking/Energy Technology Security Ad Tech Media

Slide 16

Slide 16 text

©2023, Imply Coming in May: Druid 26.0 16 Schema auto-discovery Get both high performance & flexibility More ANSI SQL Compatibility Unnest & Arrays Shuffle Joins Simpler Data Ingestion Plus additional features including Sessionization, Interpolation, and Advanced Dictionary Compression

Slide 17

Slide 17 text

©2023, Imply With Druid you now get both high performance & flexibility 17 Strong Data Types High Performance Data Type Discovery Ease of Use like schemaless databases N ew

Slide 18

Slide 18 text

©2023, Imply 18 Voice Assistant Sends data about each request plus periodic status updates IoT Use Case Streaming Pipeline Analytics Database Data Analyst

Slide 19

Slide 19 text

©2023, Imply 19 Voice Assistant Temperature & Humidity sensors enabled IoT Use Case Temp & Humidity Data Streaming Pipeline Analytics Database Data Analyst Adding Humidity & Temperature Data No Broken Schemas New columns for temperature & Humidity are automatically discovered and added with the right data type to the Druid table Auto-discover and add New Data Field & Type

Slide 20

Slide 20 text

©2023, Imply Joins Prepare Incoming Data for Fast Analytics in Druid 20 Partition 1 Partition 2 Partition n Partition 1 Partition 2 Partition n Table 1 Table 2 Table 3

Slide 21

Slide 21 text

©2023, Imply Pre-joining Data was Necessary to Bring Datasets Together 21 Fact Table A Fact Table A Ingest Store Query Third Party Tools Real-time Analytics

Slide 22

Slide 22 text

©2023, Imply Ingestion Becomes Simpler, Easier, and Less Expensive 22 Fact Table A Fact Table A Ingest Store Query Druid Can Now Join Datasets at Ingestion

Slide 23

Slide 23 text

©2023, Imply 23 Unnest “I want to UNNEST this repeated record into its own little temporary table.” Array “I wish I could do a SQL join without getting duplicate rows back.” Extend Standard SQL Features Keeping up with Evolving ANSI SQL Standards

Slide 24

Slide 24 text

©2023, Imply 24 What is the average basket size with and without groceries? And how is it trending over the last 18 months? Billions transactions/month Standard Query Take too long + Isn’t necessary

Slide 25

Slide 25 text

©2023, Imply 25 Statistics teaches us, when there is too much data, sample Sample GROUP BY Query a subset of data Druid ensures it is statistically valid Billions transactions/month What is the average basket size with and without groceries? And how is it trending over the last 18 months?

Slide 26

Slide 26 text

©2023, Imply 26 What was the impact on sales from an approaching hurricane? Automatically figure out time boundaries of a data set Find day parts & start/end of impactful events Time weighted averages and interpolation

Slide 27

Slide 27 text

©2023, Imply Advanced String Dictionary Compression 27 100 TBs 70 TBs Saving up to 30% on string storage w/ ZERO impact on performance Efficiency of querying numbers while retaining all the flexibility of the human language

Slide 28

Slide 28 text

©2023, Imply Coming Later this Year: Async Queries + Cold Storage 28

Slide 29

Slide 29 text

©2023, Imply Streams Everywhere 29

Slide 30

Slide 30 text

©2023, Imply More Streaming = More Real-Time Data 30

Slide 31

Slide 31 text

©2023, Imply Next Generation Infrastructure 31 2023-24 Memory Pooling Effective 50%+ reduction in costs of database infrastructure 2025 - 26 Data Processing Accelerators Co-processors that improve database performance per CPU by 3x - 20x

Slide 32

Slide 32 text

©2023, Imply Cloud Maturity Today: ~45% of global IT infrastructure on the Cloud End of 2027: ~85% of global IT infrastructure on the Cloud Streaming Data and Real-Time Analytics Applications are the default on the Cloud Managed services become even more dominant as a delivery model 32

Slide 33

Slide 33 text

©2023, Imply RTA and Machine Learning AI/ML becomes just another piece of the analytic toolset, like regressions and sketches. By 2025, all data-center class CPUs include packaged GPUs for machine learning inference acceleration. Active GANs optimize segmentation and query optimization. LLMs automate data gathering and (ironically) data quality 33

Slide 34

Slide 34 text

©2023, Imply Real-Time Analytics Applications in 2030 34 EB scale and beyond Still subsecond response Ubiquitous streaming Most queries still SQL

Slide 35

Slide 35 text

©2023, Imply Real-Time Analytics Applications in 2030 Open Source wins Druid is one of the projects that lead Real-Time Analytics databases (no single winner) Streaming and Analytics delivered as managed services Mix of edge and centralized computing 35 + ???

Slide 36

Slide 36 text

©2023, Imply 36 Questions? … maybe Answers

Slide 37

Slide 37 text

©2023, Imply 37 Darin Briskman Director of Technology [email protected] Join the Druid Community! https://druid.apache.org Druid Architecture and Concepts https://imply.io/druid-architecture-concepts/ Building Real-Time Analytics Applications https://bit.ly/40AIlB6 Try Polaris, the Druid DBaaS https://imply.io/polaris