Slide 1

Slide 1 text

A Case Study in API Cost of Running Presto in the Cloud at Scale Hope Wang, Developer Advocate @ Alluxio Bin Fan, VP of Technology @ Alluxio

Slide 2

Slide 2 text

PART 1: MOTIVATION Why are we discussing this and why does it matter to you?

Slide 3

Slide 3 text

The Evolution of the Modern Data Stack Tightly-Coupled MapReduce & HDFS Compute-Storage Separation On-Prem HDFS Cloud Data Lake Sequential Access Row-based Files Columnar Files in Lakes 15yr Ago Today More Elastic, Cheaper, Easier to Manage, More Scalable, More Efficient

Slide 4

Slide 4 text

The Evolution of the Modern Data Stack Tightly-Coupled MapReduce & HDFS Compute-Storage Separation On-Prem HDFS Cloud Data Lake Sequential Access Row-based Files Columnar Files in Lakes 10yr+ Ago Today ⚠ Loss of Data Locality ⚠ Different Cost Model ⚠ Less Sequential Data Access

Slide 5

Slide 5 text

In the On-Prem Days… Data Request is “Free” ● For infrastructure on-prem, organizations pay for upfront investments in infrastructure for on-premises setups ● No more costs per query or data retrieval

Slide 6

Slide 6 text

Data Request Cost of Cloud Storage Cloud Storage Ingress Egress Traffic (per month) Pricing Amazon S3 Free 100 GB - 10 TB $0.09/GB 10 - 50 TB $0.085/GB 50 - 150 TB $0.07/GB > 150 TB $0.05/GB > 500 TB Contact sales Google Cloud Storage Free 0 - 1 TB $0.12/GB 1 - 10 TB $0.11/GB > 10 TB $0.08/GB Customized Contact sales Azure Blob Storage Free 100 GB - 10 TB $0.087/GB 10 TB - 50 TB $0.083/GB 50 TB - 150 TB $0.07/GB 150 TB - 500 TB $0.05/GB > 500 TB Contact sales Cloud Storage Operations Pricing Data Transfer/Egress Pricing (Cross Region, Hybrid/Multi Cloud) Notes: ● Pricing applies to AWS US East Ohio region, GCP North America regions, Azure North America regions. As of May 1, 2024. ● Source: https://aws.amazon.com/s3/pricing/, https://cloud.google.com/storage/pricing, https://azure.microsoft.com/en-us/pricing/details/bandwidth/, https://azure.microsoft.com/en-us/pricing/details/storage/blobs/

Slide 7

Slide 7 text

Data Request Size of a Presto Worker ● Large volume: A Presto cluster can process 1 ~ 10 PB per day. ● Small data requests dominate the I/O ○ 50% requests < 10 KB ○ 90% requests < 1 MB Cumulative distribution function over ~5 days Source: Rethinking the Cloudonomics of Efficient I/O for Data-Intensive Analytics Applications

Slide 8

Slide 8 text

A Back-on-Envelope Calculation 1 PB 10 Billion $0.0004 / 1000 365 Days × × × Data access per day (a medium-sized Presto cluster) 100 KB per request on average Amazon S3 GET request pricing Per Year $1.4 Million =

Slide 9

Slide 9 text

10% of your data is hot data Caching Layer between compute & storage Add a Source: Alluxio

Slide 10

Slide 10 text

Caching’s Values Boost Performance Save Costs Prevent Network Congestion Offload Under Storage

Slide 11

Slide 11 text

Cost Reduction by Adding Cache Presto Presto AWS S3 us-east-1 Without Cache With Cache AWS S3 us-west-1 AWS S3 us-east-1 Frequently Retrieving Data = High GET/PUT Operations Costs & Data Transfer Costs Fast Access with Hot Data Cached AWS S3 us-west-1 Only Retrieve Data When Necessary = Lower S3 Costs … … … … Cache

Slide 12

Slide 12 text

Performance: Reduce I/O Time I/O Compute I/O Compute Compute I/O (first time retrieving remote data) Compute I/O Compute Without Cache With Cache Total job run time is reduced I/O Compute Compute Compute I/O

Slide 13

Slide 13 text

Alluxio: A Caching Framework to Fit Different Needs Alluxio Edge Cache (Local Cache) Alluxio Distributed Cache ● Run as a library in the application processes (Presto, Trino, HDFS DataNode) ● Leverage local disk NVMe or memory ● When the size of hot data fits local disks ● Standalone cache service shareable across applications ● Cache capacity scales horizontally

Slide 14

Slide 14 text

PART 2: A Case Study Using Caching Challenges and Solutions in Production at Scale

Slide 15

Slide 15 text

The Easy Part …

Slide 16

Slide 16 text

Multi-level Caching L1 Edge Cache+L2 Distributed Cache Alluxio Distributed Cache HDFS/ S3 Alluxio Edge Cache

Slide 17

Slide 17 text

Alluxio Edge Cache in Presto Local cache storage Alluxio Caching File System On Cache Hit External Storage Presto Worker On Cache Miss HDFS API Calls Alluxio Cache Manager External File System Presto Server JVM ● Battle tested in Uber, Meta, Tiktok and etc. ● Support Iceberg, Hudi, Delta Lake and Hive tables ● Support varied file format such as Paquet, ORC and CSV ● Fully optimized for local NVMe storages

Slide 18

Slide 18 text

Alluxio Edge Cache Data Management Alluxio edge cache provides cache eviction and admission ● Support LRU and FIFO cache eviction policy ● Support customized cache admission policy ● Support TTL ● Support data quota

Slide 19

Slide 19 text

Not So Easy Part …

Slide 20

Slide 20 text

Challenges to Run Presto @ Uber with Alluxio ● Realtime Partition Updates ● Cluster Membership Change ● Cache Size Restriction Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 21

Slide 21 text

Challenge #1: Realtime Partition Updates ● A lot of tables/partitions are constantly changing ○ Hudi tables ○ Queries constantly performing upserts ● Causing outdated partitions in cache ○ Partition may have changed on HDFS, while Alluxio still caches the outdated version Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 22

Slide 22 text

Challenge #1: Realtime Partition Updates ● Solution: Add Hive latest modification time to caching key ○ hdfs:// → hdfs:// ● New partition with latest modification gets cached ○ Each update will create a separate cache copy ● Tradeoff: outdated partition still present in cache, wasting caching space until evicted Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 23

Slide 23 text

Challenge #2: Cluster Membership Change ● A file is pinned to a set of worker nodes for cache efficiency ○ SOFT_AFFINITY in Presto ○ Default use a modulo hash function to map partitions to nodes ● Presto worker nodes may go up/down due to operational activities ○ Node crash/Maintenance ○ Ad-hoc Node restart ● Hash function map to wrong nodes when node changes, e.g. ○ For 3 nodes, key#4 → 4 mod 3 = node#1 ○ One node down, then, key#4 → 4 mod 2 = node#0 ■ Wrong node, cache miss Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 24

Slide 24 text

Challenge #2: Cluster Membership Change Solution: Consistent Hashing ● All nodes are placed on a virtual ring ● Relative ordering of nodes on the ring does not change ● Look up the key (file) on the ring ● Replication for better robustness Node0 Node1 Node2 Partition Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 25

Slide 25 text

Challenge #3: Cache Size Limitation ● PBs read by Presto queries >> PBs Disks space available on Worker nodes ○ 74PB of total data accessed daily v.s ~120TB disk space per cluster ● Cache inefficiency due to: ○ High eviction rate ○ High cache miss rate Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 26

Slide 26 text

Challenge #3: Cache Size Limitation ● Solution: Selective Cache ○ Cache a subset of selected datas ■ table to cache + partitions to cache ■ Based on traffic pattern analysis ● Greatly increased cache hit rate ■ From ~65% to >90% { "databases": [{ "name": "database_foo", "tables": [{ "name" : "table_bar", "maxCachedPartitions": 100}]}] } Source: Speed Up Presto at Uber with Alluxio Local Cache

Slide 27

Slide 27 text

Test Results

Slide 28

Slide 28 text

In the dashboard showing during the TPC-DS tests running: ● Left side shows: total # Cloud API calls / total # API calls * 100% ( eq. API call saving) ● Right side shows: total volume read from Cloud / total volume read * 100% (eq. read vol. saving) Cloud Cost Saving

Slide 29

Slide 29 text

Comparison of query execution time of TPC-DS Query 81 to Query 99 without and with Presto local cache. TPC-DS Benchmark of Presto+Alluxio Edge Cache

Slide 30

Slide 30 text

Key Takeaways

Slide 31

Slide 31 text

Type of Cache When to Use 1. Metastore Cache ● Slow planning time ● Slow Hive Metastore ● Large tables with hundreds of partitions 2. List File Cache ● Overloaded HDFS namenode ● Overloaded object store like S3 3. Fragment Result Cache ● Duplicated queries 4. Alluxio Edge Cache ● Slow or unstable external storage 5. Alluxio Distributed Cache ● Cross-region, multi-cloud, hybrid-cloud ● Data sharing with other compute engines Choose the Right Cache

Slide 32

Slide 32 text

THANK YOU! Hope Wang, Alluxio ([email protected]) Bin Fan, Alluxio ([email protected]) Scan the QR code for a Linktree including great learning resources, exciting meetups & a community of data & AI infra experts!