A Case Study in API Cost of Running Presto in the Cloud at Scale (Hope Wang & Chunxu Tang, Alluxio) | RTA Summit 2024

A Case Study in API Cost of Running Presto in
the Cloud at Scale Hope Wang, Developer Advocate @ Alluxio Bin Fan, VP of Technology @ Alluxio

PART 1: MOTIVATION Why are we discussing this and why
does it matter to you?

The Evolution of the Modern Data Stack Tightly-Coupled MapReduce &
HDFS Compute-Storage Separation On-Prem HDFS Cloud Data Lake Sequential Access Row-based Files Columnar Files in Lakes 15yr Ago Today More Elastic, Cheaper, Easier to Manage, More Scalable, More Efficient

The Evolution of the Modern Data Stack Tightly-Coupled MapReduce &
HDFS Compute-Storage Separation On-Prem HDFS Cloud Data Lake Sequential Access Row-based Files Columnar Files in Lakes 10yr+ Ago Today ⚠ Loss of Data Locality ⚠ Different Cost Model ⚠ Less Sequential Data Access

In the On-Prem Days… Data Request is “Free” • For
infrastructure on-prem, organizations pay for upfront investments in infrastructure for on-premises setups • No more costs per query or data retrieval

Data Request Cost of Cloud Storage Cloud Storage Ingress Egress
Traffic (per month) Pricing Amazon S3 Free 100 GB - 10 TB $0.09/GB 10 - 50 TB $0.085/GB 50 - 150 TB $0.07/GB > 150 TB $0.05/GB > 500 TB Contact sales Google Cloud Storage Free 0 - 1 TB $0.12/GB 1 - 10 TB $0.11/GB > 10 TB $0.08/GB Customized Contact sales Azure Blob Storage Free 100 GB - 10 TB $0.087/GB 10 TB - 50 TB $0.083/GB 50 TB - 150 TB $0.07/GB 150 TB - 500 TB $0.05/GB > 500 TB Contact sales Cloud Storage Operations Pricing Data Transfer/Egress Pricing (Cross Region, Hybrid/Multi Cloud) Notes: • Pricing applies to AWS US East Ohio region, GCP North America regions, Azure North America regions. As of May 1, 2024. • Source: https://aws.amazon.com/s3/pricing/, https://cloud.google.com/storage/pricing, https://azure.microsoft.com/en-us/pricing/details/bandwidth/, https://azure.microsoft.com/en-us/pricing/details/storage/blobs/

Data Request Size of a Presto Worker • Large volume:
A Presto cluster can process 1 ~ 10 PB per day. • Small data requests dominate the I/O ◦ 50% requests < 10 KB ◦ 90% requests < 1 MB Cumulative distribution function over ～5 days Source: Rethinking the Cloudonomics of Efficient I/O for Data-Intensive Analytics Applications

A Back-on-Envelope Calculation 1 PB 10 Billion $0.0004 / 1000
365 Days × × × Data access per day (a medium-sized Presto cluster) 100 KB per request on average Amazon S3 GET request pricing Per Year $1.4 Million =

10% of your data is hot data Caching Layer between
compute & storage Add a Source: Alluxio

Caching’s Values Boost Performance Save Costs Prevent Network Congestion Offload
Under Storage

Cost Reduction by Adding Cache Presto Presto AWS S3 us-east-1
Without Cache With Cache AWS S3 us-west-1 AWS S3 us-east-1 Frequently Retrieving Data = High GET/PUT Operations Costs & Data Transfer Costs Fast Access with Hot Data Cached AWS S3 us-west-1 Only Retrieve Data When Necessary = Lower S3 Costs … … … … Cache

Performance: Reduce I/O Time I/O Compute I/O Compute Compute I/O
(first time retrieving remote data) Compute I/O Compute Without Cache With Cache Total job run time is reduced I/O Compute Compute Compute I/O

Alluxio: A Caching Framework to Fit Different Needs Alluxio Edge
Cache (Local Cache) Alluxio Distributed Cache • Run as a library in the application processes (Presto, Trino, HDFS DataNode) • Leverage local disk NVMe or memory • When the size of hot data ﬁts local disks • Standalone cache service shareable across applications • Cache capacity scales horizontally

PART 2: A Case Study Using Caching Challenges and Solutions
in Production at Scale

The Easy Part …

Multi-level Caching L1 Edge Cache+L2 Distributed Cache Alluxio Distributed Cache
HDFS/ S3 Alluxio Edge Cache

Alluxio Edge Cache in Presto Local cache storage Alluxio Caching
File System On Cache Hit External Storage Presto Worker On Cache Miss HDFS API Calls Alluxio Cache Manager External File System Presto Server JVM • Battle tested in Uber, Meta, Tiktok and etc. • Support Iceberg, Hudi, Delta Lake and Hive tables • Support varied file format such as Paquet, ORC and CSV • Fully optimized for local NVMe storages

Alluxio Edge Cache Data Management Alluxio edge cache provides cache
eviction and admission • Support LRU and FIFO cache eviction policy • Support customized cache admission policy • Support TTL • Support data quota

Not So Easy Part …

Challenges to Run Presto @ Uber with Alluxio • Realtime
Partition Updates • Cluster Membership Change • Cache Size Restriction Source: Speed Up Presto at Uber with Alluxio Local Cache

Challenge #1: Realtime Partition Updates • A lot of tables/partitions
are constantly changing ◦ Hudi tables ◦ Queries constantly performing upserts • Causing outdated partitions in cache ◦ Partition may have changed on HDFS, while Alluxio still caches the outdated version Source: Speed Up Presto at Uber with Alluxio Local Cache

Challenge #1: Realtime Partition Updates • Solution: Add Hive latest
modification time to caching key ◦ hdfs://<path> → hdfs://<path><mod time> • New partition with latest modification gets cached ◦ Each update will create a separate cache copy • Tradeoff: outdated partition still present in cache, wasting caching space until evicted Source: Speed Up Presto at Uber with Alluxio Local Cache

Challenge #2: Cluster Membership Change • A file is pinned
to a set of worker nodes for cache efficiency ◦ SOFT_AFFINITY in Presto ◦ Default use a modulo hash function to map partitions to nodes • Presto worker nodes may go up/down due to operational activities ◦ Node crash/Maintenance ◦ Ad-hoc Node restart • Hash function map to wrong nodes when node changes, e.g. ◦ For 3 nodes, key#4 → 4 mod 3 = node#1 ◦ One node down, then, key#4 → 4 mod 2 = node#0 ▪ Wrong node, cache miss Source: Speed Up Presto at Uber with Alluxio Local Cache

Challenge #2: Cluster Membership Change Solution: Consistent Hashing • All
nodes are placed on a virtual ring • Relative ordering of nodes on the ring does not change • Look up the key (file) on the ring • Replication for better robustness Node0 Node1 Node2 Partition Source: Speed Up Presto at Uber with Alluxio Local Cache

Challenge #3: Cache Size Limitation • PBs read by Presto
queries >> PBs Disks space available on Worker nodes ◦ 74PB of total data accessed daily v.s ~120TB disk space per cluster • Cache inefficiency due to: ◦ High eviction rate ◦ High cache miss rate Source: Speed Up Presto at Uber with Alluxio Local Cache

Challenge #3: Cache Size Limitation • Solution: Selective Cache ◦
Cache a subset of selected datas ▪ table to cache + partitions to cache ▪ Based on traffic pattern analysis • Greatly increased cache hit rate ▪ From ~65% to >90% { "databases": [{ "name": "database_foo", "tables": [{ "name" : "table_bar", "maxCachedPartitions": 100}]}] } Source: Speed Up Presto at Uber with Alluxio Local Cache

Test Results

In the dashboard showing during the TPC-DS tests running: •
Left side shows: total # Cloud API calls / total # API calls * 100% ( eq. API call saving) • Right side shows: total volume read from Cloud / total volume read * 100% (eq. read vol. saving) Cloud Cost Saving

Comparison of query execution time of TPC-DS Query 81 to
Query 99 without and with Presto local cache. TPC-DS Benchmark of Presto+Alluxio Edge Cache

Key Takeaways

Type of Cache When to Use 1. Metastore Cache •
Slow planning time • Slow Hive Metastore • Large tables with hundreds of partitions 2. List File Cache • Overloaded HDFS namenode • Overloaded object store like S3 3. Fragment Result Cache • Duplicated queries 4. Alluxio Edge Cache • Slow or unstable external storage 5. Alluxio Distributed Cache • Cross-region, multi-cloud, hybrid-cloud • Data sharing with other compute engines Choose the Right Cache

THANK YOU! Hope Wang, Alluxio (hope.wang@alluxio.com) Bin Fan, Alluxio (binfan@alluxio.com)
Scan the QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts!

A Case Study in API Cost of Running Presto in t...

A Case Study in API Cost of Running Presto in the Cloud at Scale (Hope Wang & Chunxu Tang, Alluxio) | RTA Summit 2024

StarTree
PRO

More Decks by StarTree

Other Decks in Technology

Featured

Transcript