Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An update on the Apache Gluten project (incubat...

An update on the Apache Gluten project (incubator) and its use of Velox

This talk will provide a technical overview of the project. An emphasis will be on experiences working with customers from across the globe on enabling them to get their Spark workloads up and running with Gluten and Velox. The talk will also cover Gluten’s recent acceptance as an Apache incubator project. The talk will close with some details on what’s next.

Binwei Yang
Founder and Technical Lead of the Gluten project at Intel

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. Intel technologies’ features and benefits depend on system configuration and

    may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com]. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. ​Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. No computer system can be absolutely secure. Intel, the Intel logo, Xeon, Intel vPro, Intel Xeon Phi, Look Inside., are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. © 2017 Intel Corporation. Legal Disclaimers 2
  2. Agenda 3 INTEL CONFIDENTIAL • Introduction • Current Status •

    Performance & Characteristic • Plan • Q&A
  3. Turbocharging Spark performance 4 INTEL CONFIDENTIAL 4 Task Thread Gluten

    Plugin Native Library JNI Bindings Operator 1 Operator 2 Operator 3 Op. is native? JVM SQL Engine Operators Expression JIT Whole Stage Code Gen Fallback No Yes Velox Apache Arrow Computer Engine Clickhouse 3rd Party Engine Driver Node Worker Node Worker Node Executor Task Task Block manager Executor Task Task Block manager Executor Task Task Block manager Executor Task Task Block manager Spark Scale Out Framework + Optimal Native Library
  4. Sep. TPCDS passed Gluten Journey 5 Dec. Gluten Github Repo

    Setup 2021 2022 2019-2020 Gazelle Development Jun. TPCH Passed 1.61x boost 2023 Jul. 1.0 release TPCH 3.18x TPCDS 2.67x 168 funcs 20 ops Oct. Switch to upstream Velox 3.4 pass H/DS Dec. TPCH 2.34x TPCDS 1.35x Apr. Spark3.2/3.3 UT passed Decimal Added TPCH 2.81x TPCDS 2.1x 0.5 beta release 2024 Mar. Contributed to Apache 3.4 UT passed TPCH 3.23x TPCDS 2.73x 229 funcs 29 ops Dec. DataLake support Spill Support UDF framework 2025 Oct. Photon released
  5. Gluten Current Coverage 6 Mar. 24’ All Customer required Gluten

    Supported/common Notes Operators 90 37 29(78%) Functions* 361 289 229(79%) Data type 18 16 15(93%) Array/Map/Struct not fully tested Data source** 10 7 5(71%) File-format*** 6 4 2(50%) Spark Version 3.2/3.3/3.4/3.5 3.2/3.3/3.4/3.5 3.2/3.3 UT not passed on 3.5 Customer-specific Customized Spark support from customer *: Spark Functions use Spark3.2 as the baseline **: Data source includes localfs, hdfs, S3, GCS, Hive, abfs, deltalake, iceburg, hudi, alluxio ***: File Format includes parquet, orc, text(hive), csv, json Not supported cases fallback to Vanilla Spark
  6. Key Features 7 • Parquet Scan • All complex data

    type fallback • Timestamp fallback • 505 Spark UT fallback • C2R, R2C • Offload to Velox • Fuzzer tests including all data type passed
  7. Key Features 8 • Shuffle • Only module implemented in

    Gluten • Split Velox vector into arrow buffers, compress it then flush to file • Merge small buffers of a rowvector into single larger one for better compression • Push based shuffle added • Remote shuffle implemented – Celeborn • Currently hash based. Sort based shuffle is WIP by community
  8. Key Features 9 • Memory Management • Reuse Vanilla Spark’s,

    counted into offheap memory • Onheap/offheap memory management • Spill support • https://github.com/apache/incubator- gluten/issues/3030
  9. Key Features 10 • DataLake support -> most work done

    by community • https://github.com/apache/incubator-gluten/issues/3378 • Iceberg, Delta, Hudi (WIP) • COW(Copy on write) MOR (Merge on Read) • Write not supported, fallback • Apache Paimon (WIP)
  10. MicroBenchmark 11 • Framework to reproduce a task in gbenchmark

    • https://github.com/apache/incubator-gluten/blob/main/docs/developers/MicroBenchmarks.md • Simulate in single thread or Multiple threads • Debug, Performance Optimization • Input: substrait plan in json format, configs, split for first stage, data for reducer stage • Output: print, discard, write to parquet, shuffle writer • Spill not support yet
  11. Fuzzer Test (WIP) 12 • Debug process: • Issue saw

    on cluster, simplify the query, reproduce on single node, create micro benchmark, debug in micro • Hard to reproduce, easy to fix • Automatically generate parquet and SQL, or copy from the other project, compare Gluten/Velox vs. Vanilla Spark • Velox • Spark Connect, pass SQL/Plan to Spark server, get Arrow • Compare with Velox result • Single stage • Gluten • Spark vs. Gluten • Multiple stage
  12. Gluten Performance 13 1.00 1.04 3.12 3.49 0.00 0.50 1.00

    1.50 2.00 2.50 3.00 3.50 4.00 SPR EMR Normalized Performance, Higher is better TPCH Like SF3T gen-to-gen performance comparison Spark Gluten 1.12x 1.00 1.09 3.09 3.30 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 SPR EMR Normalized Performance, Higher is better TPCDS Like SF3T gen-to-gen performance comparison Spark Gluten XEON 8480+ XEON 8592+ XEON 8480+ XEON 8592+
  13. Batch Based Processing 15 Frequent memory access + simple data

    processing IPC is sensitive to loaded latency Join Filter Σ Aggregate Project f f f Scan c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 4K Row c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 4K Row c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 4K Row c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 4K Row c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 Sort Hash Table Hash Table c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 c1 c1 c1 c2 c2 c2 c3 c3 c3 c4 c4 c4 Cached cpu% 86% cpu freq 2,569 ipc 1.20 l3 stall 13% mem stall 82% issue stall 73% mem_bw_rd 154,956 mem_bw_wr 83,681 remote ratio 0.12 local_mem_lat(ns) 193 remote_mem_lat(ns) 283 l3 load miss/Kinst 0.6 l3 miss/Kinst 5.23 l2 miss/Kinst 9.30
  14. Cache Behavior 16 L1 Hit 96.29% 0% 10% 20% 30%

    40% 50% 60% 70% 80% 90% 100% FB Hit 1.54% L2 Hit 1.71% L3 Hit 0.12% L3 Miss 0.32% 0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% 3.50% 4.00%
  15. Loaded Latency 17 0 50 100 150 200 250 300

    350 400 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 500,000 Memory Latency (ns) Memory Throughput MB/s Throghput vs latency curve 8592+ 8480+ Gluten TPCH Avg It’s more burst memory access Time
  16. Roadmap in 24’ 18 GA 1.0 Medium Touch Engagements 75%

    common functions implemented One-Click Installation & Improved Documentation 10 customer adoption & Apache incubation Linux/Spark: Ubuntu20/22, CentOS 8/9. Spark 3.2/3.3/3.4 Submit SPIP to Spark community March ‘24 GA 1.0 (GH 1.1.1) SCALE Sept. ‘24 GA 1.5 (Apache 1.2) Medium Touch Engagements Upstreaming to velox for lower maintenance burden & agile release (newly added ) Dynamic Memory Allocation More Stability to avoid OOM, JVM Crash Kicked off Apache incubation of Gluten project Beta-2 Nov 30 ‘23 Beta 0.7 (GH 1.1) Low Touch High Scale 85% common functions implemented Product level Documentation Multiple Spark Version support(3.3, 3.4, 3.5) UDF codegen via LLM & Spark.AI natural language to SQL POC Pyspark support Low Touch High Scale 90% common functions implemented, 10% more performance optimization Included in AI Kit selector for AI Kit 2025.0. Reference kits for UDF codegen and Spark.AI nature language to SQL 60% ready for Apache incubation graduate Accelerator support Spark.AI Dec. ‘24 GA 1.6 (Apache 1.3)
  17. Beyond 24’ 19 • Performance optimization for customer workload. Large

    room to improve • Spark.AI is gaining traction in Spark community. Text2SQL, Port java/scala UDF, LLM inference as UDF, auto tune, log analysis, cost model • Better accelerator support, including QAT, IAA, DSA, CXL, should also extend to GPU integration.
  18. Gluten 2024 Major Tasks Target timeline Migrate to Apache incubator

    project Q1 Release: Mar(before contribute to Apache), Sep(Apache release), may add several minor releases Q1, Q2, Q3, Q4 Improve Quality by adding more tests(new CI/CD, pre-built docker image) Q1, Q2 Easier installation(pip, conda, docker, use static building to support more OS) Q1, Q2, Q3, Q4 More Spark version support(spark 3.4/3.5/4.0) Q1, Q2, Q3, Q4 Increase coverage for Spark functions/operators(90% on common functions, 100% on common operators) Q1, Q2, Q3, Q4 PySpark support(PyArrow UDF, ML workloads) Q1, Q2, Q3, Q4 Enhance spill support(sort, hash join, hash aggregation) Q1, Q2, Q3, Q4 Support more file formats(CSV, JSON, HIVE Text) Q1, Q2, Q3, Q4 Datalake support(iceberg, delta lake, hudi) Q1, Q2, Q3, Q4 Remote shuffle service support(Apache Uniffle) Q1 Columnar shuffle improvement(sort-based shuffle, performance improvement) Q1, Q2 More data source connectors(abfs, gcs, hdfs viewfs/federation, Alluxio) Q1, Q2, Q3 POC for more accelerators(GPU, FPGA) Q1, Q2, Q3, Q4 Gluten-indicator: tools to check potential benefits from Gluten vs. vanilla Spark && auto turning on Spark configurations Q1, Q2, Q3, Q4
  19. © 2024 Intel Corporation Intel, the Intel logo, and Xeon

    are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Java is a registered trademark of Oracle and/or its affiliates. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. ​Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. 24