Apache Druid 2030 (Darin Briskman, Imply) | RTA Summit 2023

Druid 2030 The Origin and Evolution of Real-Time Analytics Applications
Darin Briskman [email protected]

©2023, Imply A few things before we start 2 I’m
a member of the Druid Community, giving my own opinions and thoughts. I do not speak for the Community, nor for Imply Data. Thanks to the RTA organizers for inviting me and Imply to participate in today’s events. Pinot and Druid are open source cousins, and we’re all working together to advance real-time analytics for everyone.

©2023, Imply About me and Imply 3 Imply was founded
by the creators of Apache Druid® to help developers get to successful projects faster

©2023, Imply Druid today is very diﬀerent from how it
began 4

©2023, Imply In the Beginning • Needed fast processing and
rollups for AdTech • Requirement: ingest a billion rows in under 1 minute & query them in under 1 second • Tried and failed with RDBMS: Greenplum, Postgres, MySQL, and InfoBright • Tried and failed with NoSQL • Decided to create a new database. How hard could it be? • Announced Druid on 30 Apr 2011

©2023, Imply Evolution and Open Source Rapid iteration as Druid
drove the AdTech Business After getting interest from Netﬂix and others, released Druid as Open Source in 2012. Moved to Apache license in 2015. Promoted to Apache Software Foundation top-level in 2017. Used by over 1900 organizations.

©2023, Imply Global and vibrant Community Companies using Druid Active
Contributors YoY Increase in Community Activity Community Members 1,900+ 150% 14,000+ 500+

©2023, Imply Druid Today: a Real-Time Analytics Database Sub-second queries
at any scale Interactive analytics on TB-PBs of data High concurrency at the lowest cost 100s to 1000s QPS via a highly eﬃcient engine Real-time and historical insights True stream ingestion for Kafka and Kinesis Plus, non-stop reliability with automated fault tolerance and continuous backup 1 2 3 For analytics applications that require:

©2023, Imply The right database for analytics apps makes a
diﬀerence OR Without Druid With Druid

©2023, Imply 10 Applications Analytics Applications Druid is built for
the intersection of analytics and applications. Apache Druid

©2023, Imply Real-Time Analytics Applications Real-time Analytics Database Real-Time Analytics
Requires a Real-Time Database 11 Analytics Data Warehouses Applications Transactional Databases Read-optimized TB-PBs of Data High Cardinality Sub-Sec Response High Concurrency Real-time Data BI Reporting Monthly Reporting Static Dashboards ACID Compliance Small Data Write-optimized BI Reporting Monthly Reporting Static Dashboards ACID Compliance Small Data Write-optimized ✓ ✓ ✓ ✓ ✓ ✓

©2023, Imply Examples of real-time analytics applications 12 Operational Visibility
at Scale Rapid Data Exploration Customer-facing Analytics Real-time Decisioning ICE Security Ops Platform Citrix Analytics Service Salesforce Edge Intelligence Reddit Real-Time Ads Powered by

©2023, Imply What makes Druid diﬀerent 13 Storage Segmentation Maturity
Scale Community Success

©2023, Imply And many more! Focus on community success Retail
Financial Gaming Networking/Energy Technology Security Ad Tech Media

©2023, Imply Coming in May: Druid 26.0 16 Schema auto-discovery
Get both high performance & ﬂexibility More ANSI SQL Compatibility Unnest & Arrays Shuﬄe Joins Simpler Data Ingestion Plus additional features including Sessionization, Interpolation, and Advanced Dictionary Compression

©2023, Imply With Druid you now get both high performance
& ﬂexibility 17 Strong Data Types High Performance Data Type Discovery Ease of Use like schemaless databases N ew

©2023, Imply 18 Voice Assistant Sends data about each request
plus periodic status updates IoT Use Case Streaming Pipeline Analytics Database Data Analyst

©2023, Imply 19 Voice Assistant Temperature & Humidity sensors enabled
IoT Use Case Temp & Humidity Data Streaming Pipeline Analytics Database Data Analyst Adding Humidity & Temperature Data No Broken Schemas New columns for temperature & Humidity are automatically discovered and added with the right data type to the Druid table Auto-discover and add New Data Field & Type

©2023, Imply Joins Prepare Incoming Data for Fast Analytics in
Druid 20 Partition 1 Partition 2 Partition n Partition 1 Partition 2 Partition n Table 1 Table 2 Table 3

©2023, Imply Pre-joining Data was Necessary to Bring Datasets Together
21 Fact Table A Fact Table A Ingest Store Query Third Party Tools Real-time Analytics

©2023, Imply Ingestion Becomes Simpler, Easier, and Less Expensive 22
Fact Table A Fact Table A Ingest Store Query Druid Can Now Join Datasets at Ingestion

©2023, Imply 23 Unnest “I want to UNNEST this repeated
record into its own little temporary table.” Array “I wish I could do a SQL join without getting duplicate rows back.” Extend Standard SQL Features Keeping up with Evolving ANSI SQL Standards

©2023, Imply 24 What is the average basket size with
and without groceries? And how is it trending over the last 18 months? Billions transactions/month Standard Query Take too long + Isn’t necessary

©2023, Imply 25 Statistics teaches us, when there is too
much data, sample Sample GROUP BY Query a subset of data Druid ensures it is statistically valid Billions transactions/month What is the average basket size with and without groceries? And how is it trending over the last 18 months?

©2023, Imply 26 What was the impact on sales from
an approaching hurricane? Automatically ﬁgure out time boundaries of a data set Find day parts & start/end of impactful events Time weighted averages and interpolation

©2023, Imply Advanced String Dictionary Compression 27 100 TBs 70
TBs Saving up to 30% on string storage w/ ZERO impact on performance Eﬃciency of querying numbers while retaining all the ﬂexibility of the human language

©2023, Imply Next Generation Infrastructure 31 2023-24 Memory Pooling Eﬀective
50%+ reduction in costs of database infrastructure 2025 - 26 Data Processing Accelerators Co-processors that improve database performance per CPU by 3x - 20x

©2023, Imply Cloud Maturity Today: ~45% of global IT infrastructure
on the Cloud End of 2027: ~85% of global IT infrastructure on the Cloud Streaming Data and Real-Time Analytics Applications are the default on the Cloud Managed services become even more dominant as a delivery model 32

©2023, Imply RTA and Machine Learning AI/ML becomes just another
piece of the analytic toolset, like regressions and sketches. By 2025, all data-center class CPUs include packaged GPUs for machine learning inference acceleration. Active GANs optimize segmentation and query optimization. LLMs automate data gathering and (ironically) data quality 33

©2023, Imply Real-Time Analytics Applications in 2030 34 EB scale
and beyond Still subsecond response Ubiquitous streaming Most queries still SQL

©2023, Imply Real-Time Analytics Applications in 2030 Open Source wins
Druid is one of the projects that lead Real-Time Analytics databases (no single winner) Streaming and Analytics delivered as managed services Mix of edge and centralized computing 35 + ???

©2023, Imply 37 Darin Briskman Director of Technology [email protected] Join
the Druid Community! https://druid.apache.org Druid Architecture and Concepts https://imply.io/druid-architecture-concepts/ Building Real-Time Analytics Applications https://bit.ly/40AIlB6 Try Polaris, the Druid DBaaS https://imply.io/polaris

Apache Druid 2030 (Darin Briskman, Imply) | RTA...

Apache Druid 2030 (Darin Briskman, Imply) | RTA Summit 2023

More Decks by StarTree

Other Decks in Technology

Featured

Transcript