A Tale of Three Real-Time OLAP Databases (Neha Pawar & Chinmay Soman, StarTree) | RTA Summit 2023

A Tale of Three Real-Time OLAP Databases Neha Pawar Chinmay
Soman

• Overview of real-time OLAP • High level architecture overview
• Layers of a Real-time OLAP Database • Summary and deep-dive into each layer • Concluding remarks and takeaways Talk Structure

Blog: A Tale of Three … stree.ai/compare

Rise of Real-time OLAP Latency: seconds to minutes Freshness: Hours
Concurrency: 100s of users Latency: Milliseconds Freshness: seconds Concurrency: Millions of users

The three real-time OLAP databases Open Source Database Apache Pinot
Apache Druid ClickHouse

Apache Druid ClickHouse Version 0.12.0 Jan 19 2023 25.0.0 Jan 4 2023 23.1 Jan 26 2023

Apache Druid ClickHouse Version 0.12.0 Jan 19 2023 25.0.0 Jan 4 2023 23.1 Jan 26 2023 Commercial DBaaS Product Offering StarTree Cloud Imply ClickHouse Cloud

High-level Architecture Overview

Apache Pinot Apache Druid ClickHouse

Why this matters Less components Easy getting started experience Separation
of concerns Better scalability Inbuilt functionality for management and operability More operator friendly Better no-code experience Architecture Overview Summary

Layer Deep-dives

Layers of a Real-time OLAP Database

Data Sources

Data Sources Apache Kafka Amazon Kinesis Google PubSub Apache Pulsar
Change Data Capture (CDC) S3 GCS ADLS HDFS Databricks DeltaLake Apache Iceberg Apache Hudi Snowﬂake BigQuery MySQL Postgres StarTree only StarTree only StarTree only StarTree only

Ingestion

Ingestion Real-Time Pull or push Upserts Batch/SQL Ingest data stored
in batch sources Pre-Processing Filtering Transformation Rollups

Why this matters Pull Model Faster, efﬁcient way of ingesting
from real-time sources Push Model Required to get around connectivity issues or direct application writes Experimental Scalability Ingest at high message rate Limited Out of order Handle late arriving messages Exactly once guarantees Data, Query consistency Complex Setup Real-Time Ingestion

Kafka Consumer Low-level • Accurate • Flexible • Scalable and
fault tolerant High Level / Consumer group • Duplication • Scalability challenges • Not robust Ingestion Latency Very low: process tiny batches of data Moderate -> High: Large message batches recommended Scalability Easy to scale out Limited by Kafka partitions Data duplication (Ingestion) Cannot happen Can happen Replication complexity Easy: relying on Kafka offsets Difﬁcult: Need Zookeeper/ClickHouse Keeper in the fast path Cloud Readiness Available in StarTree Cloud Not available in ClickHouse Cloud Real-Time Ingestion: Native Kafka support

Why this matters Full Row upserts Data and Query accuracy
Limited: async (sync is not recommended) Partial row upsert: Upsert certain columns based on primary key Data and Query accuracy De-duplication Data and Query accuracy Scalability Ability to handle high ingestion rate and lots of primary keys without sacriﬁcing performance Real-Time Upserts

Real-Time Upserts: Apache Pinot & StarTree Design Highlights • Low
Query overhead: Reconciled at ingestion time • Query Performance: Per segment bitmaps help in filtering out obsolete records in an efficient manner • Scalability: Billions of primary keys per server and reduced heap usage Coming Soon ! • TTL support • Compaction • Native Bootstrap / Backfill support

Why this matters Pull Model Native support for ingesting data
from S3, GCS, ADLS Push Model Ability to push large batches of data from Spark/Flink/… Data Format support Ability to handle wide variety of data format Exactly once / Ease of backﬁll Ability to replace historical data Batch Ingestion

Why this matters Native bootstrap / One time ingestion ETL
from SQL sources into real-time OLAP (one time) Native incremental ingestion Incremental ETL from SQL sources into real-time OLAP (hourly/daily/…) SQL Ingestion

Batch/SQL support in Apache Pinot and StarTree Design Highlights •
Scalable: Auto scaling for minions • Robust: Ingestion tasks are checkpointed and retried upon failures • Easy to use: No external dependecy

Pre-processing Why this matters Value/type transformation Filter, transform, ﬂatten (nested)
values before being ingested to speed up query performance Real-time Rollups Rollup real-time data to reduce data granularity for space and query optimization Batch Rollups Rollup batch data to reduce data granularity for space and query optimization SQL based transforms Use SQL to pre-process data Denormalization Combine data from multiple sources before ingestion

Storage

Data Layout within Server Data Layout within Segment Partitioning Sorting
Columnar Encoding and compression Data Layout within Cluster Storage tiers Storage Segment to server assignment

Data Layout in Segment & Servers Why this matters Custom
Partitioning Pruning at segment / server level Custom Sorting Better data locality, smarter encoding Can only be primary time column Smart data assignment on serving layer Reduce query fanout High level objective: Better p99 latency, increased throughput

Why Data Layout helps Pinot Broker Query 1 2 5
6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3

Data Layout on Cluster - Tiered Storage 32 Server 1
Server 2 SSD SSD Server 1 Server 2 HDD HDD Tiered Storage using compute node pools and local volumes Tiered, but still tightly coupled Server 1 Server 2 SSD SSD Cloud Object Storage Tiered Storage using Cloud Object Storage Fully decoupled

Tiered Storage for Apache Pinot in StarTree Cloud Brokers Brokers
Server 1 Server 2 Disk/SSD Disk/SSD Fully tightly-coupled Server 3 Server 4 Cloud Object Storage Fully decoupled Hybrid 33

Tiered Storage using Cloud Object Storage Attributes of tiered storage
Columnar / block fetch Prefetching Data altering Indexing Column level caching Index level caching

Indexing

Filter and aggregation optimizations Pinot Broker Query 1 2 5
6 9 10 Server 1 Server 2 Server 3 3 4 7 8 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Broker level pruning 5 6 7 8 9 10 11 12 Server level pruning Filter optimizations Aggregation optimizations 1 2 3 4 5 6 7 8 9 10 11 12 Total segments to process Server 1 Server 2 Server 3

Indexing • Minimize data scans • Keep query performance optimal
and predictable ◦ High throughput ◦ Growing data size

Index Purpose Inverted Fast filter Sorted Personalized analytics Range Range
queries JSON Semi-structured data Geospatial Location queries Timestamp Timestamp rollups Text Log analytics Bloom Approximate filter Sparse Approximate filter

Power of Indexing

40 Star-tree Index name env_id type value ts cpu_core a357
... 20 ... mem 57f ... 70 ... instance 57f ... 12 ... cpu_core 57f ... 6 ... cpu_core a357 ... 9 ... … … ... ... ... Star-tree index Dimension split order = name, env_id Aggregations = sum(value) Max leaf records = 10 * cpu_core mem instance * name a357 57f * 57f a357 * < 10 records env_id 100 386 486 250 300 550 sum(value)

41 Star-tree Index select sum(value) from table where name =
cpu_core and env_id = a357 Star-tree index Dimension split order = name, env_id Aggregations = sum(value) Max leaf records = 10 cpu_core mem instance * * a357 57f * 57f a357 * 386 486 250 300 550 100

Materialized View vs Star-tree Index Star-tree index Materialized View Storage
overhead Moderate: • Allows tuning space overhead High: • Materialize all combinations of selected dimensions Creation overhead Low: • Natively built like an index High: • Could need external data preprocessing workflow Backfill / sync overhead Low: • Can be created anytime • Easy to keep in sync High: • Could need external data preprocessing workflow • Challenges with keeping data in sync Query overhead Low: • Engine picks right index under the hood Moderate: • Queries need to target a specific view

Query All aspects related to querying a database Query Caching
Optimization to speedup repeated expensive queries Complex Query Constructs Constructs needed to run complex / multi-stage queries Basic Query Constructs Constructs needed to run basic OLAP queries

Query: Basic Query Constructs Constructs needed to run basic OLAP
queries Why this matters Filter Aggregation Group by Order by Bare minimum constructs needed for real-time OLAP Sketches Fast approximate counting on large distinct values

Query: Complex Query Constructs Constructs needed to run complex /
multi-stage queries Why this matters Pluggable UDFs Plug in custom UDFs Multi-stage query engine Execute complex queries without bottlenecks on single stage scatter gather Only for ingestion Window functions Aggregations over a window Joins Combine tables from multiple data sources during query

Query: Query Caching Optimization to speedup repeated expensive queries Why
this matters Broker cache Per query result cache Server cache Per segment result cache Eviction Evict stale entries as soon as data changes

Query Performance How Zomato uses Apache Pinot Real-time Security Insights:
Apache Pinot at Conﬂuera Webex: Real-time Observability and Analytics with Apache Pinot Pinot in YouGov Chart comparing latency in ms

Cluster Management

Cluster Management Cluster operations Availability of easy mechanisms to perform
common cluster operations Cluster setup Cluster setup options for isolation, high availability All aspects related to cluster operability

Cluster Management: Cluster Setup Cluster setup options for isolation, high
availability Why this matters Multi tenancy Host tables of multiple usecases on same cluster, with physical / logical isolations High availability Keep service available through partial component failures Separation of concerns using separate components, deep store

Cluster Management: Cluster Operations Availability of easy mechanisms to support
common cluster operations Why this matters Scale up / Scale out Make capacity changes to cluster and re-distribute data Node maintenance Replacing nodes / taking nodes down for maintenance Disaster recovery Permanent backup and recovery mechanisms to recover from cluster / node failures

Security

Purpose Authentication OIDC based authentication Kerberos based authentication Authorization ABAC
RBAC Encryption Data encryption Storage Storage Storage + inbuilt Network encryption (SSL / TLS) Security

Cloud Deployment Model

Cloud Deployment Model Bring Your Own Cloud Cluster provisioned in
customer’s cloud account Saas Cluster provisioned in vendor’s cloud account Ease of deploying as a fully-managed cloud solution, via commercial DBaaS providers

Cloud Deployment Model Why this matters Saas Cluster provisioned in
vendor’s cloud account - Customer ships data to this cluster Bring Your Own Cloud Cluster provisioned in customer’s cloud account - Customer retains full control of data Supported w/ limitations: Needs VPC Peering

StarTree’s Bring Your Own Cloud • No VPC peering •
Principle of least privilege • Data Security / Governance • SLA • Fully managed • Cost effective

Outbound Connectors

Connector JDBC Presto Trino Tableau Looker PowerBI Grafana SuperSet

Summary of Comparison

Thanks!

A Tale of Three Real-Time OLAP Databases (Neha...

A Tale of Three Real-Time OLAP Databases (Neha Pawar & Chinmay Soman, StarTree) | RTA Summit 2023

More Decks by StarTree

Other Decks in Technology

Featured

Transcript