A Tale of Three Real-Time OLAP Databases (Neha Pawar & Chinmay Soman, StarTree)

A Tale of Three Real-Time OLAP Databases (Neha Pawar & Chinmay Soman, StarTree) | RTA Summit 2023

by StarTree

Slide 1

Slide 1 text

A Tale of Three Real-Time OLAP Databases Neha Pawar Chinmay Soman

Slide 2

Slide 2 text

● Overview of real-time OLAP ● High level architecture overview ● Layers of a Real-time OLAP Database ● Summary and deep-dive into each layer ● Concluding remarks and takeaways Talk Structure

Slide 3

Slide 3 text

Blog: A Tale of Three … stree.ai/compare

Slide 4

Slide 4 text

Rise of Real-time OLAP Latency: seconds to minutes Freshness: Hours Concurrency: 100s of users Latency: Milliseconds Freshness: seconds Concurrency: Millions of users

Slide 5

Slide 5 text

The three real-time OLAP databases Open Source Database Apache Pinot Apache Druid ClickHouse

Slide 6

Slide 6 text

The three real-time OLAP databases Open Source Database Apache Pinot Apache Druid ClickHouse Version 0.12.0 Jan 19 2023 25.0.0 Jan 4 2023 23.1 Jan 26 2023

Slide 7

Slide 7 text

The three real-time OLAP databases Open Source Database Apache Pinot Apache Druid ClickHouse Version 0.12.0 Jan 19 2023 25.0.0 Jan 4 2023 23.1 Jan 26 2023 Commercial DBaaS Product Offering StarTree Cloud Imply ClickHouse Cloud

Slide 8

Slide 8 text

High-level Architecture Overview

Slide 9

Slide 9 text

Apache Pinot Apache Druid ClickHouse

Slide 10

Slide 10 text

Why this matters Less components Easy getting started experience Separation of concerns Better scalability Inbuilt functionality for management and operability More operator friendly Better no-code experience Architecture Overview Summary

Slide 11

Slide 11 text

Layer Deep-dives

Slide 12

Slide 12 text

Layers of a Real-time OLAP Database

Slide 13

Slide 13 text

Data Sources

Slide 14

Slide 14 text

Data Sources Apache Kafka Amazon Kinesis Google PubSub Apache Pulsar Change Data Capture (CDC) S3 GCS ADLS HDFS Databricks DeltaLake Apache Iceberg Apache Hudi Snowﬂake BigQuery MySQL Postgres StarTree only StarTree only StarTree only StarTree only

Slide 15

Slide 15 text

Ingestion

Slide 16

Slide 16 text

Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored in batch sources Pre-Processing Filtering Transformation Rollups

Slide 17

Slide 17 text

Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored in batch sources Pre-Processing Filtering Transformation Rollups

Slide 18

Slide 18 text

Why this matters Pull Model Faster, efﬁcient way of ingesting from real-time sources Push Model Required to get around connectivity issues or direct application writes Experimental Scalability Ingest at high message rate Limited Out of order Handle late arriving messages Exactly once guarantees Data, Query consistency Complex Setup Real-Time Ingestion

Slide 19

Slide 19 text

Kafka Consumer Low-level ● Accurate ● Flexible ● Scalable and fault tolerant High Level / Consumer group ● Duplication ● Scalability challenges ● Not robust Ingestion Latency Very low: process tiny batches of data Moderate -> High: Large message batches recommended Scalability Easy to scale out Limited by Kafka partitions Data duplication (Ingestion) Cannot happen Can happen Replication complexity Easy: relying on Kafka offsets Difﬁcult: Need Zookeeper/ClickHouse Keeper in the fast path Cloud Readiness Available in StarTree Cloud Not available in ClickHouse Cloud Real-Time Ingestion: Native Kafka support

Slide 20

Slide 20 text

Why this matters Full Row upserts Data and Query accuracy Limited: async (sync is not recommended) Partial row upsert: Upsert certain columns based on primary key Data and Query accuracy De-duplication Data and Query accuracy Scalability Ability to handle high ingestion rate and lots of primary keys without sacriﬁcing performance Real-Time Upserts

Slide 21

Slide 21 text

Real-Time Upserts: Apache Pinot & StarTree Design Highlights ● Low Query overhead: Reconciled at ingestion time ● Query Performance: Per segment bitmaps help in filtering out obsolete records in an efficient manner ● Scalability: Billions of primary keys per server and reduced heap usage Coming Soon ! ● TTL support ● Compaction ● Native Bootstrap / Backfill support

Slide 22

Slide 22 text

Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored in batch sources Pre-Processing Filtering Transformation Rollups

Slide 23

Slide 23 text

Why this matters Pull Model Native support for ingesting data from S3, GCS, ADLS Push Model Ability to push large batches of data from Spark/Flink/… Data Format support Ability to handle wide variety of data format Exactly once / Ease of backﬁll Ability to replace historical data Batch Ingestion

Slide 24

Slide 24 text

Why this matters Native bootstrap / One time ingestion ETL from SQL sources into real-time OLAP (one time) Native incremental ingestion Incremental ETL from SQL sources into real-time OLAP (hourly/daily/…) SQL Ingestion

Slide 25

Slide 25 text

Batch/SQL support in Apache Pinot and StarTree Design Highlights ● Scalable: Auto scaling for minions ● Robust: Ingestion tasks are checkpointed and retried upon failures ● Easy to use: No external dependecy

Slide 26

Slide 26 text

Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored in batch sources Pre-Processing Filtering Transformation Rollups

Slide 27

Slide 27 text

Pre-processing Why this matters Value/type transformation Filter, transform, ﬂatten (nested) values before being ingested to speed up query performance Real-time Rollups Rollup real-time data to reduce data granularity for space and query optimization Batch Rollups Rollup batch data to reduce data granularity for space and query optimization SQL based transforms Use SQL to pre-process data Denormalization Combine data from multiple sources before ingestion

Slide 28

Slide 28 text

Storage

Slide 29

Slide 29 text

Data Layout within Server Data Layout within Segment Partitioning Sorting Columnar Encoding and compression Data Layout within Cluster Storage tiers Storage Segment to server assignment

Slide 30

Slide 30 text

Data Layout in Segment & Servers Why this matters Custom Partitioning Pruning at segment / server level Custom Sorting Better data locality, smarter encoding Can only be primary time column Smart data assignment on serving layer Reduce query fanout High level objective: Better p99 latency, increased throughput

Slide 31

Slide 31 text

Why Data Layout helps Pinot Broker Query 1 2 5 6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3

Slide 32

Slide 32 text

Data Layout on Cluster - Tiered Storage 32 Server 1 Server 2 SSD SSD Server 1 Server 2 HDD HDD Tiered Storage using compute node pools and local volumes Tiered, but still tightly coupled Server 1 Server 2 SSD SSD Cloud Object Storage Tiered Storage using Cloud Object Storage Fully decoupled

Slide 33

Slide 33 text

Tiered Storage for Apache Pinot in StarTree Cloud Brokers Brokers Server 1 Server 2 Disk/SSD Disk/SSD Fully tightly-coupled Server 3 Server 4 Cloud Object Storage Fully decoupled Hybrid 33

Slide 34

Slide 34 text

Tiered Storage using Cloud Object Storage Attributes of tiered storage Columnar / block fetch Prefetching Data altering Indexing Column level caching Index level caching

Slide 35

Slide 35 text

Indexing

Slide 36

Slide 36 text

Filter and aggregation optimizations Pinot Broker Query 1 2 5 6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning Filter optimizations Aggregation optimizations 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3

Slide 37

Slide 37 text

Indexing ● Minimize data scans ● Keep query performance optimal and predictable ○ High throughput ○ Growing data size

Slide 38

Slide 38 text

Index Purpose Inverted Fast filter Sorted Personalized analytics Range Range queries JSON Semi-structured data Geospatial Location queries Timestamp Timestamp rollups Text Log analytics Bloom Approximate filter Sparse Approximate filter

Slide 39

Slide 39 text

Power of Indexing

Slide 40

Slide 40 text

40 Star-tree Index name env_id type value ts cpu_core a357 ... 20 ... mem 57f ... 70 ... instance 57f ... 12 ... cpu_core 57f ... 6 ... cpu_core a357 ... 9 ... … … ... ... ... Star-tree index Dimension split order = name, env_id Aggregations = sum(value) Max leaf records = 10 * cpu_core mem instance * name a357 57f * 57f a357 * < 10 records env_id 100 386 486 250 300 550 sum(value)

Slide 41

Slide 41 text

41 Star-tree Index select sum(value) from table where name = cpu_core and env_id = a357 Star-tree index Dimension split order = name, env_id Aggregations = sum(value) Max leaf records = 10 cpu_core mem instance * * a357 57f * 57f a357 * 386 486 250 300 550 100

Slide 42

Slide 42 text

Materialized View vs Star-tree Index Star-tree index Materialized View Storage overhead Moderate: ● Allows tuning space overhead High: ● Materialize all combinations of selected dimensions Creation overhead Low: ● Natively built like an index High: ● Could need external data preprocessing workflow Backfill / sync overhead Low: ● Can be created anytime ● Easy to keep in sync High: ● Could need external data preprocessing workflow ● Challenges with keeping data in sync Query overhead Low: ● Engine picks right index under the hood Moderate: ● Queries need to target a specific view

Slide 43

Slide 43 text

Query

Slide 44

Slide 44 text

Query All aspects related to querying a database Query Caching Optimization to speedup repeated expensive queries Complex Query Constructs Constructs needed to run complex / multi-stage queries Basic Query Constructs Constructs needed to run basic OLAP queries

Slide 45

Slide 45 text

Query: Basic Query Constructs Constructs needed to run basic OLAP queries Why this matters Filter Aggregation Group by Order by Bare minimum constructs needed for real-time OLAP Sketches Fast approximate counting on large distinct values

Slide 46

Slide 46 text

Query: Complex Query Constructs Constructs needed to run complex / multi-stage queries Why this matters Pluggable UDFs Plug in custom UDFs Multi-stage query engine Execute complex queries without bottlenecks on single stage scatter gather Only for ingestion Window functions Aggregations over a window Joins Combine tables from multiple data sources during query

Slide 47

Slide 47 text

Joins

Slide 48

Slide 48 text

Query: Query Caching Optimization to speedup repeated expensive queries Why this matters Broker cache Per query result cache Server cache Per segment result cache Eviction Evict stale entries as soon as data changes

Slide 49

Slide 49 text

Query Performance How Zomato uses Apache Pinot Real-time Security Insights: Apache Pinot at Conﬂuera Webex: Real-time Observability and Analytics with Apache Pinot Pinot in YouGov Chart comparing latency in ms

Slide 50

Slide 50 text

Cluster Management

Slide 51

Slide 51 text

Cluster Management Cluster operations Availability of easy mechanisms to perform common cluster operations Cluster setup Cluster setup options for isolation, high availability All aspects related to cluster operability

Slide 52

Slide 52 text

Cluster Management: Cluster Setup Cluster setup options for isolation, high availability Why this matters Multi tenancy Host tables of multiple usecases on same cluster, with physical / logical isolations High availability Keep service available through partial component failures Separation of concerns using separate components, deep store

Slide 53

Slide 53 text

Cluster Management: Cluster Operations Availability of easy mechanisms to support common cluster operations Why this matters Scale up / Scale out Make capacity changes to cluster and re-distribute data Node maintenance Replacing nodes / taking nodes down for maintenance Disaster recovery Permanent backup and recovery mechanisms to recover from cluster / node failures

Slide 54

Slide 54 text

Security

Slide 55

Slide 55 text

Purpose Authentication OIDC based authentication Kerberos based authentication Authorization ABAC RBAC Encryption Data encryption Storage Storage Storage + inbuilt Network encryption (SSL / TLS) Security

Slide 56

Slide 56 text

Cloud Deployment Model

Slide 57

Slide 57 text

Cloud Deployment Model Bring Your Own Cloud Cluster provisioned in customer’s cloud account Saas Cluster provisioned in vendor’s cloud account Ease of deploying as a fully-managed cloud solution, via commercial DBaaS providers

Slide 58

Slide 58 text

Cloud Deployment Model Why this matters Saas Cluster provisioned in vendor’s cloud account - Customer ships data to this cluster Bring Your Own Cloud Cluster provisioned in customer’s cloud account - Customer retains full control of data Supported w/ limitations: Needs VPC Peering

Slide 59

Slide 59 text

StarTree’s Bring Your Own Cloud ● No VPC peering ● Principle of least privilege ● Data Security / Governance ● SLA ● Fully managed ● Cost effective