Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerating Data Analytics with StarRocks

Accelerating Data Analytics with StarRocks

Avatar for Simo Wang

Simo Wang

March 26, 2025
Tweet

Other Decks in Technology

Transcript

  1. Confidential │ ©2024 CelerData, Inc. Explosive Growth Community • Over

    500 large enterprises worldwide are using StarRocks in production environments • Tens of thousands of engineers worldwide are participating in the StarRocks community • 9.7K stars on Github • 400+ contributors from dozens of companies, e.g. Airbnb, Pinterest, Atlassian, Alibaba, Tencent, etc. • 3.8k community users in the StarRocks Slack workspace
  2. Confidential │ ©2024 CelerData, Inc. Native Vectorized SQL Engine •

    Vectorized Engine is implemented natively in C++ • A single instruction can work on multiple rows of data • All operators are fully vectorized
  3. Confidential │ ©2024 CelerData, Inc. Massive Parallel Processing (MPP) Parallel

    execution of logical execution units fully utilizes the resources of all CPU cores and physical machines and accelerates the query speed.
  4. Confidential │ ©2024 CelerData, Inc. CBO Query Optimization • There

    could be 1000s of potential ways to execute a query • A CBO finds the most optimized execution plan • StarRocks CBO is based on the Cascade framework • Deeply customized to leverage the vectorized engine • Highly optimized for StarRocks scheduler
  5. Confidential │ ©2024 CelerData, Inc. StarRocks is a Real Time

    Analytics Database Message Streams Query and Storage Engines SQL over MySQL Protocol
  6. Confidential │ ©2024 CelerData, Inc. Data Optimization Metadata Optimization •

    Parquet Reader Optimization ◦ Operation on encoding data ◦ Leverage page index • ORC Reader Optimization ◦ Optimization for small stripes ◦ Predication push down • Generic IO Optimization ◦ IO Merge ◦ Late materialization • Distributed metadata plan ◦ Parse metadata in a distributed way • Manifest cache ◦ Cache parsed manifest information. • Adaptive Strategy Optimization on data lake
  7. Confidential │ ©2024 CelerData, Inc. • Native Cache Support Without

    operational cost • Both for metadata and data cache • In-memory and On-disk, two level cache. • SLRU strategy. Auto capacity control • Auto warm-up and manual `cache select` Two level cache Fully utilize local disk and network bandwidth Advanced Caching System
  8. Confidential │ ©2024 CelerData, Inc. • One copy of parquet

    file • TPC-DS SF1000 • StarRocks 3.2.6 Trino 445/449 • StarRocks 3.29x faster than Trino Iceberg Benchmark
  9. Confidential │ ©2024 CelerData, Inc. Use Case: Insights at Pinterest

    StarRocks Solution Background and Pain Points • Performance Gains: 50% reduction in p90 latency with only 32% of the previous infrastructure, tripling cost-performance efficiency. ~1200 QPS. • Data Freshness: Achieved 10-second data freshness, streamlined ingestion. • Cost comparison: Druid 300 nodes vs StarRocks: 81 nodes, over $2M saved! • Pinterest serves over 500 million users, with advertisers relying on real-time insights to optimize their campaigns. • The existing system using Druid struggled with increasing data scale, high costs, and lacked advanced SQL support. • Complex, multi-dimensional queries and the need for real-time aggregations created performance challenges. Benefits • Real-Time Aggregation: aggregate tables • Query Performance: vectorized query engine, query caching, materialized views • Data ingestion: Use ANSI SQL and eliminated JSON configs. Partner insight is a customer facing dashboard service for advertisers Medium blog post by Pinterest Engineering StarRocks x Pinterest meetup: Customer Facing Analytics that Pays
  10. Confidential │ ©2024 CelerData, Inc. Use Case: Self-serving analytics at

    Naver Pain Points in the New Scenario Result StarRocks Solution • Total nodes: 155(FE + BE + CN) • Unlocks New Use Cases – Enables new business scenarios and expands analytics. • Unified Query Access – Integrates internal data sources, expanding use cases. • Realtime insights – Empowers faster data-driven decisions across the ecosystem. • Out-of-the-Box Multi-Table Queries – No need for denormalized tables. • Federated Analytics: – Seamless integration with Iceberg, Hive, MySQL, etc. • Superior Aggregated Query Performance: – Matches or exceeds ClickHouse, with MV for further acceleration. • Cloud-Native Scalability: – Decoupled storage-compute for flexible scaling. • Limited query flexibility and performance due to ClickHouse’s weak JOIN capabilities. • Scaling challenges with fast-growing business. • Inefficient upsert/delete for real-time workloads. • Underutilized internal data resources for analytics. Blog post: How JOIN Changed How We Approach Data Infra At NAVER
  11. Confidential │ ©2024 CelerData, Inc. Use Case: Trust Analytics at

    Airbnb StarRocks Solution To enhance security, Airbnb needs a real-time fraud detection system to identify various attacks and take actions ASAP. This needs analyze both real-time transactional data and historical data together. Background and Pain Points • StarRocks native primary key table for real-time transactional data. • Ad-hoc queries directly towards open data. • StarRocks provides second-level query latency for complex joins. • Alerting can be achieved by just running a SQL query regularly. Benefits • Improved data freshness from 1 day to 1 minute • Saved a lot of money for the company Blog post: Airbnb Builds a Fast Analytics Experience with StarRocks StarRocks x Airbnb Meetup: Real-time analytics at Airbnb with StarRocks