$30 off During Our Annual Pro Sale. View Details »

Apache Druid 2030 (Darin Briskman, Imply) | RTA Summit 2023

Apache Druid 2030 (Darin Briskman, Imply) | RTA Summit 2023

First created in 2011, Apache® Druid is now in its second decade of empowering real-time analytics. How does Druid provide subsecond queries at Petabyte scale, high concurrency, and combined stream and batch analytics? What have we learned from the first decade of Druid, and what is changing? What will Druid look like in 2030?

StarTree
PRO

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Druid 2030
    The Origin and Evolution of Real-Time Analytics Applications
    Darin Briskman
    [email protected]

    View Slide

  2. ©2023, Imply
    A few things before we start
    2
    I’m a member of the Druid Community, giving my own opinions and
    thoughts.
    I do not speak for the Community, nor for Imply Data.
    Thanks to the RTA organizers for inviting me and Imply to participate
    in today’s events. Pinot and Druid are open source cousins, and
    we’re all working together to advance real-time analytics for
    everyone.

    View Slide

  3. ©2023, Imply
    About me and Imply
    3
    Imply was founded by the
    creators of Apache Druid® to
    help developers get to
    successful projects faster

    View Slide

  4. ©2023, Imply
    Druid today is very different from how it began
    4

    View Slide

  5. ©2023, Imply
    In the Beginning
    ● Needed fast processing and rollups for
    AdTech
    ● Requirement: ingest a billion rows in
    under 1 minute & query them in under
    1 second
    ● Tried and failed with RDBMS:
    Greenplum, Postgres, MySQL, and
    InfoBright
    ● Tried and failed with NoSQL
    ● Decided to create a new database. How
    hard could it be?
    ● Announced Druid on 30 Apr 2011

    View Slide

  6. ©2023, Imply
    Evolution and Open Source
    Rapid iteration as Druid drove the AdTech Business
    After getting interest from Netflix and others, released Druid as Open Source in 2012.
    Moved to Apache license in 2015.
    Promoted to Apache Software Foundation top-level in 2017.
    Used by over 1900 organizations.

    View Slide

  7. ©2023, Imply
    Global and vibrant Community
    Companies using Druid
    Active Contributors
    YoY Increase in Community Activity
    Community Members
    1,900+ 150%
    14,000+ 500+

    View Slide

  8. ©2023, Imply
    Druid Today: a Real-Time Analytics Database
    Sub-second queries at any scale
    Interactive analytics on TB-PBs of data
    High concurrency at the lowest cost
    100s to 1000s QPS via a highly efficient engine
    Real-time and historical insights
    True stream ingestion for Kafka and Kinesis
    Plus, non-stop reliability with automated fault
    tolerance and continuous backup
    1
    2
    3
    For analytics applications that require:

    View Slide

  9. ©2023, Imply
    The right database for analytics apps makes a difference
    OR
    Without Druid With Druid

    View Slide

  10. ©2023, Imply 10
    Applications
    Analytics Applications
    Druid is built for
    the intersection
    of analytics and
    applications.
    Apache
    Druid

    View Slide

  11. ©2023, Imply
    Real-Time Analytics Applications
    Real-time Analytics Database
    Real-Time Analytics Requires a Real-Time Database
    11
    Analytics
    Data Warehouses
    Applications
    Transactional Databases
    Read-optimized
    TB-PBs of Data
    High Cardinality
    Sub-Sec Response
    High Concurrency
    Real-time Data
    BI Reporting
    Monthly Reporting
    Static Dashboards
    ACID Compliance
    Small Data
    Write-optimized
    BI Reporting
    Monthly Reporting
    Static Dashboards
    ACID Compliance
    Small Data
    Write-optimized






    View Slide

  12. ©2023, Imply
    Examples of real-time analytics applications
    12
    Operational
    Visibility at Scale
    Rapid Data
    Exploration
    Customer-facing
    Analytics
    Real-time
    Decisioning
    ICE Security Ops Platform Citrix Analytics Service Salesforce Edge Intelligence Reddit Real-Time Ads
    Powered by

    View Slide

  13. ©2023, Imply
    What makes Druid different
    13
    Storage Segmentation
    Maturity
    Scale
    Community Success

    View Slide

  14. ©2023, Imply
    Focus on maturity
    14
    Data
    Warehousing

    View Slide

  15. ©2023, Imply
    And many more!
    Focus on community success
    Retail Financial
    Gaming Networking/Energy Technology Security
    Ad Tech Media

    View Slide

  16. ©2023, Imply
    Coming in May: Druid 26.0
    16
    Schema
    auto-discovery
    Get both high
    performance & flexibility
    More ANSI SQL
    Compatibility
    Unnest &
    Arrays
    Shuffle
    Joins
    Simpler
    Data Ingestion
    Plus additional features including
    Sessionization, Interpolation, and Advanced Dictionary Compression

    View Slide

  17. ©2023, Imply
    With Druid you now get both high performance & flexibility
    17
    Strong Data Types
    High
    Performance
    Data Type Discovery
    Ease of Use like
    schemaless databases
    N
    ew

    View Slide

  18. ©2023, Imply 18
    Voice Assistant
    Sends data about each request plus
    periodic status updates
    IoT Use Case
    Streaming
    Pipeline
    Analytics
    Database
    Data
    Analyst

    View Slide

  19. ©2023, Imply 19
    Voice Assistant
    Temperature & Humidity
    sensors enabled
    IoT Use Case
    Temp & Humidity
    Data
    Streaming
    Pipeline
    Analytics
    Database
    Data
    Analyst
    Adding Humidity & Temperature Data
    No Broken Schemas
    New columns for temperature & Humidity are
    automatically discovered and added with the right data
    type to the Druid table
    Auto-discover and add
    New Data Field & Type

    View Slide

  20. ©2023, Imply
    Joins Prepare Incoming Data for Fast Analytics in Druid
    20
    Partition 1
    Partition 2
    Partition n
    Partition 1
    Partition 2
    Partition n
    Table 1 Table 2
    Table 3

    View Slide

  21. ©2023, Imply
    Pre-joining Data was Necessary to Bring Datasets Together
    21
    Fact
    Table A
    Fact
    Table A
    Ingest Store Query
    Third Party Tools Real-time Analytics

    View Slide

  22. ©2023, Imply
    Ingestion Becomes Simpler, Easier, and Less Expensive
    22
    Fact
    Table A
    Fact
    Table A
    Ingest Store Query
    Druid Can Now Join Datasets at Ingestion

    View Slide

  23. ©2023, Imply 23
    Unnest
    “I want to UNNEST this repeated record
    into its own little temporary table.”
    Array
    “I wish I could do a SQL join without
    getting duplicate rows back.”
    Extend Standard SQL Features
    Keeping up with Evolving ANSI SQL Standards

    View Slide

  24. ©2023, Imply 24
    What is the average basket size with and
    without groceries? And how is it trending over
    the last 18 months?
    Billions
    transactions/month
    Standard Query
    Take too long
    +
    Isn’t necessary

    View Slide

  25. ©2023, Imply 25
    Statistics teaches us, when there
    is too much data, sample
    Sample GROUP BY
    Query a subset of data
    Druid ensures it is statistically valid
    Billions
    transactions/month
    What is the average basket size with and
    without groceries? And how is it trending over
    the last 18 months?

    View Slide

  26. ©2023, Imply 26
    What was the impact on sales from
    an approaching hurricane?
    Automatically figure out time
    boundaries of a data set
    Find day parts & start/end
    of impactful events
    Time weighted
    averages and
    interpolation

    View Slide

  27. ©2023, Imply
    Advanced String Dictionary Compression
    27
    100 TBs 70 TBs
    Saving up to 30% on string storage w/ ZERO impact on performance
    Efficiency of querying
    numbers while retaining
    all the flexibility of the
    human language

    View Slide

  28. ©2023, Imply
    Coming Later this Year: Async Queries + Cold Storage
    28

    View Slide

  29. ©2023, Imply
    Streams Everywhere
    29

    View Slide

  30. ©2023, Imply
    More Streaming = More Real-Time Data
    30

    View Slide

  31. ©2023, Imply
    Next Generation Infrastructure
    31
    2023-24
    Memory Pooling
    Effective 50%+ reduction in
    costs of database
    infrastructure
    2025 - 26
    Data Processing Accelerators
    Co-processors that improve
    database performance per
    CPU by 3x - 20x

    View Slide

  32. ©2023, Imply
    Cloud Maturity
    Today:
    ~45% of global IT infrastructure
    on the Cloud
    End of 2027:
    ~85% of global IT infrastructure
    on the Cloud
    Streaming Data and Real-Time Analytics Applications are the default on the Cloud
    Managed services become even more dominant as a delivery model
    32

    View Slide

  33. ©2023, Imply
    RTA and Machine Learning AI/ML becomes just another
    piece of the analytic toolset,
    like regressions and sketches.
    By 2025, all data-center class
    CPUs include packaged GPUs
    for machine learning
    inference acceleration.
    Active GANs optimize
    segmentation and query
    optimization.
    LLMs automate data
    gathering and (ironically) data
    quality
    33

    View Slide

  34. ©2023, Imply
    Real-Time Analytics Applications in 2030
    34
    EB scale and beyond
    Still subsecond response
    Ubiquitous streaming
    Most queries still SQL

    View Slide

  35. ©2023, Imply
    Real-Time Analytics Applications in 2030
    Open Source wins
    Druid is one of the projects that lead Real-Time Analytics databases (no single winner)
    Streaming and Analytics delivered as managed services
    Mix of edge and centralized computing
    35
    +
    ???

    View Slide

  36. ©2023, Imply 36
    Questions?
    … maybe Answers

    View Slide

  37. ©2023, Imply 37
    Darin Briskman
    Director of Technology
    [email protected]
    Join the Druid Community!
    https://druid.apache.org
    Druid Architecture and Concepts
    https://imply.io/druid-architecture-concepts/
    Building Real-Time Analytics Applications
    https://bit.ly/40AIlB6
    Try Polaris, the Druid DBaaS
    https://imply.io/polaris

    View Slide