Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerating Spark at Microsoft using Gluten & ...

Accelerating Spark at Microsoft using Gluten & Velox

Microsoft Fabric emerges as a cornerstone big data solution, proficient in executing Spark workloads. In our quest to enhance Spark performance, we’ve made substantial investments in query optimization and execution to cater to our customers’ needs. Amidst exploring avenues for faster query execution engines, we delved into existing solutions such as Weld. In this presentation, we aim to elucidate our decision of adopting Velox and Gluten stack as our native query execution engine for Spark. We’ll delve into the intricacies of integrating it seamlessly within the Azure Fabric ecosystem, including features like ABFS support and integration with read cache. Our efforts have yielded remarkable results, with performance gains reaching up to 2x faster TPCDS benchmarks. The gains are not limited to just industry benchmarks rather are evident from customer testing done with internal customers as well. Join us as we share insights, lessons learned, and the transformative impact of leveraging Velox and Gluten stack within the Microsoft Fabric environment.

Zhen Li
Software Engineer at Microsoft

Swinky Mann
Software Engineer at Microsoft

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. Who We Are Zhen Li Software Engineer Spark Runtime Team,

    Fabric @Microsoft Swinky Mann Software Engineer
  2. Outline • Introduction • MS Fabric and Internal Spark •

    Integration with Gluten and Velox • Optimizing Performance • Conclusion Microsoft Confidential
  3. Microsoft Fabric The data platform for the era of AI

    Microsoft Confidential OneLake Data Factory Synapse Data Warehousing Synapse Real Time Analytics Power BI Synapse Data Engineering Synapse Data Science Data Activator
  4. Microsoft Confidential Internal Spark (without Gluten & Velox) • Internal

    Spark – Apache Spark + our optimizations. • 1 TB TPCDS - All 99 queries • Spark 3.4, ABFS, parquet • Lower is better. • Internal Spark 2x faster than Apache Spark. 100% 50%
  5. Integration of Velox-Gluten • ABFS Support • Added ABFS (Azure

    Blob Filesystem) storage adapter • OneLake integration, Auth, etc. • Support for Spark operators • Operators: Expand, BroadcastNestedLoopJoin, CartesianProduct, RollupHashAggregation. • 20+ Spark Functions: uuid, date_from_unix_date, from_utc_timestamp etc. • INT96/INT64 Timestamp in Velox parquet scan. • Spark scan with metadata columns in Gluten. Microsoft Confidential
  6. Integration of Velox-Gluten • Reliability Improvement: • UT for Spark

    3.3, Spark 3.4: 300+ UTs fixed from 40 suites. • Committed to making changes for Spark 3.5. • Delta Integration: • Support for Delta Update, Delete, Merge, Convert To Delta Commands • Reimplementation of unsupported UDFs to avoid fallbacks • Columnar implementation for Delta Optimized Write • Fallbacks for unsupported scenarios (Delta log checkpoint, Deletion Vectors) • Delta UTs coverage and testing. Microsoft Confidential
  7. Optimizing Performance • Scan • Data Reading in Parallel: •

    Concurrent Data Reading Support in the ‘preadv’. • Split preloading in Gluten. • Fabric intelligent cache integration. Microsoft Confidential
  8. Optimizing Performance • Hash join & Hash aggregation • Avoid

    re-computing normalized keys in HashTable::groupProbe (PR:6406 – query67). • Improve normalized key join probe (PR:6695 – query64). • Store duplicate row address in vector for join probe (PR:9079 – query72). Microsoft Confidential
  9. Conclusion • Working towards making Gluten & Velox more reliable,

    robust and performant. • Actively contributing to Gluten & Velox community. Microsoft Confidential