Accelerating Data Analytics with StarRocks

Confidential │ ©2024 CelerData, Inc. Agenda • What is StarRocks
• Deep Dive into StarRocks • Use Cases

Confidential │ ©2024 CelerData, Inc. Explosive Growth Community • Over
500 large enterprises worldwide are using StarRocks in production environments • Tens of thousands of engineers worldwide are participating in the StarRocks community • 9.7K stars on Github • 400+ contributors from dozens of companies, e.g. Airbnb, Pinterest, Atlassian, Alibaba, Tencent, etc. • 3.8k community users in the StarRocks Slack workspace

Confidential │ ©2024 CelerData, Inc. Native Vectorized SQL Engine •
Vectorized Engine is implemented natively in C++ • A single instruction can work on multiple rows of data • All operators are fully vectorized

Confidential │ ©2024 CelerData, Inc. Massive Parallel Processing (MPP) Parallel
execution of logical execution units fully utilizes the resources of all CPU cores and physical machines and accelerates the query speed.

Confidential │ ©2024 CelerData, Inc. CBO Query Optimization • There
could be 1000s of potential ways to execute a query • A CBO finds the most optimized execution plan • StarRocks CBO is based on the Cascade framework • Deeply customized to leverage the vectorized engine • Highly optimized for StarRocks scheduler

Confidential │ ©2024 CelerData, Inc. Intelligent Materialized View Query auto-rewrite
Pre-computations

Confidential │ ©2024 CelerData, Inc. StarRocks is a Real Time
Analytics Database Message Streams Query and Storage Engines SQL over MySQL Protocol

Confidential │ ©2024 CelerData, Inc. Data Optimization Metadata Optimization •
Parquet Reader Optimization ◦ Operation on encoding data ◦ Leverage page index • ORC Reader Optimization ◦ Optimization for small stripes ◦ Predication push down • Generic IO Optimization ◦ IO Merge ◦ Late materialization • Distributed metadata plan ◦ Parse metadata in a distributed way • Manifest cache ◦ Cache parsed manifest information. • Adaptive Strategy Optimization on data lake

Confidential │ ©2024 CelerData, Inc. • Native Cache Support Without
operational cost • Both for metadata and data cache • In-memory and On-disk, two level cache. • SLRU strategy. Auto capacity control • Auto warm-up and manual `cache select` Two level cache Fully utilize local disk and network bandwidth Advanced Caching System

Confidential │ ©2024 CelerData, Inc. • One copy of parquet
file • TPC-DS SF1000 • StarRocks 3.2.6 Trino 445/449 • StarRocks 3.29x faster than Trino Iceberg Benchmark

Confidential │ ©2024 CelerData, Inc. Use Case: Insights at Pinterest
StarRocks Solution Background and Pain Points • Performance Gains: 50% reduction in p90 latency with only 32% of the previous infrastructure, tripling cost-performance efficiency. ~1200 QPS. • Data Freshness: Achieved 10-second data freshness, streamlined ingestion. • Cost comparison: Druid 300 nodes vs StarRocks: 81 nodes, over $2M saved! • Pinterest serves over 500 million users, with advertisers relying on real-time insights to optimize their campaigns. • The existing system using Druid struggled with increasing data scale, high costs, and lacked advanced SQL support. • Complex, multi-dimensional queries and the need for real-time aggregations created performance challenges. Benefits • Real-Time Aggregation: aggregate tables • Query Performance: vectorized query engine, query caching, materialized views • Data ingestion: Use ANSI SQL and eliminated JSON configs. Partner insight is a customer facing dashboard service for advertisers Medium blog post by Pinterest Engineering StarRocks x Pinterest meetup: Customer Facing Analytics that Pays

Confidential │ ©2024 CelerData, Inc. Use Case: Self-serving analytics at
Naver Pain Points in the New Scenario Result StarRocks Solution • Total nodes: 155(FE + BE + CN) • Unlocks New Use Cases – Enables new business scenarios and expands analytics. • Unified Query Access – Integrates internal data sources, expanding use cases. • Realtime insights – Empowers faster data-driven decisions across the ecosystem. • Out-of-the-Box Multi-Table Queries – No need for denormalized tables. • Federated Analytics: – Seamless integration with Iceberg, Hive, MySQL, etc. • Superior Aggregated Query Performance: – Matches or exceeds ClickHouse, with MV for further acceleration. • Cloud-Native Scalability: – Decoupled storage-compute for flexible scaling. • Limited query flexibility and performance due to ClickHouse’s weak JOIN capabilities. • Scaling challenges with fast-growing business. • Inefficient upsert/delete for real-time workloads. • Underutilized internal data resources for analytics. Blog post: How JOIN Changed How We Approach Data Infra At NAVER

Confidential │ ©2024 CelerData, Inc. Use Case: Trust Analytics at
Airbnb StarRocks Solution To enhance security, Airbnb needs a real-time fraud detection system to identify various attacks and take actions ASAP. This needs analyze both real-time transactional data and historical data together. Background and Pain Points • StarRocks native primary key table for real-time transactional data. • Ad-hoc queries directly towards open data. • StarRocks provides second-level query latency for complex joins. • Alerting can be achieved by just running a SQL query regularly. Benefits • Improved data freshness from 1 day to 1 minute • Saved a lot of money for the company Blog post: Airbnb Builds a Fast Analytics Experience with StarRocks StarRocks x Airbnb Meetup: Real-time analytics at Airbnb with StarRocks

Accelerating Data Analytics with StarRocks

Accelerating Data Analytics with StarRocks

Simo Wang

Other Decks in Technology

Featured

Transcript

Confidential │ ©2024 CelerData, Inc. StarRocks Accelerating Data Analytics with

Confidential │ ©2024 CelerData, Inc. Agenda • What is StarRocks

Confidential │ ©2024 CelerData, Inc. What is StarRocks

Confidential │ ©2024 CelerData, Inc. Explosive Growth Community • Over

Confidential │ ©2024 CelerData, Inc. Trusted by Enterprises Worldwide

Confidential │ ©2024 CelerData, Inc. StarRocks Architecture

Confidential │ ©2024 CelerData, Inc. Native Vectorized SQL Engine •

Confidential │ ©2024 CelerData, Inc. Massive Parallel Processing (MPP) Parallel

Confidential │ ©2024 CelerData, Inc. CBO Query Optimization • There

Confidential │ ©2024 CelerData, Inc. Colocate Join: Reduce Data Shuffle

Confidential │ ©2024 CelerData, Inc. Intelligent Materialized View Query auto-rewrite

Confidential │ ©2024 CelerData, Inc. StarRocks is a Real Time

Confidential │ ©2024 CelerData, Inc. StarRocks as a lakehouse engine

Confidential │ ©2024 CelerData, Inc. Data Optimization Metadata Optimization •

Confidential │ ©2024 CelerData, Inc. • Native Cache Support Without

Confidential │ ©2024 CelerData, Inc. • One copy of parquet

Confidential │ ©2024 CelerData, Inc. Use Case: Insights at Pinterest

Confidential │ ©2024 CelerData, Inc. Use Case: Self-serving analytics at

Confidential │ ©2024 CelerData, Inc. Use Case: Trust Analytics at