Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prestissimo at IBM

Prestissimo at IBM

In this talk we will give an overview of all Prestissimo related activity at IBM since last VeloxCon. This includes : i) Feature enhancements for Prestissimo tech preview on IBM watsonx.data. ii) TPC-DS updates. iii) Presto 2.0 plans. iv) Connector SPI

Aditi Pandit
Software Engineer at IBM

Ali LeClerc

April 05, 2024

More Decks by Ali LeClerc

Other Decks in Technology


  1. Talk Outline • Ahana -> IBM Data & AI •

    IBM Watsonx • Prestissimo will be a core-engine for Watsonx.data Open data LakeHouse • Updates • Prestissimo in 2023 • Prestissimo in 2024 • Presto 2.0 ?
  2. Prestissimo introduction C++ worker Built over the Velox library SIMD

    Runtime optimizations Smart I/O prefetching/caching Memory Arbitration features
  3. Benefits of Prestissimo/Velox • Huge performance boost ◦ Query processing

    can be done with much smaller clusters. • Avoids performance cliffs ◦ No Java processes, JVM or Garbage collection. ◦ Memory management and SIMD are explicitly controlled in C++. ◦ Memory arbitration improves efficiency. • Easier to build and operate at scale ◦ Reusable and extensible primitives across data engines (like Spark). ◦ Performance can be better understood.
  4. Prestissimo in 2023 (post IBM acquisition) Goals : Inner-source Prestissimo

    in the IBM tech stack. Make it available to broader teams. • Watsonx.data dev team, devops, testing, docs, research (DB and Storage), customer demos Build solid Presto OSS team. S/W Dev, Devops
  5. Features • CTAS • S3 Compatible Reader/Writer • Parquet Reader/Writer

    • Expand SQL Coverage • TPC-H and TPC-DS for 1K and 10K • Full Presto SQL SELECT statement syntax supported • Type/Function coverage • JWT and Cert based TLS
  6. Prestissimo in 2024 • Iceberg Reader • Prestissimo production readiness

    • Metrics collection with Prometheus • New Velox system implementation • AsyncDataCache • Spilling • Memory arbitration • TPC-DS benchmark runs
  7. Performance testing and tracking framework • Pbench tool -> Automates

    deployments, choosing workloads and running them. • Performance dashboards • Plan comparison tool
  8. TPC-DS 1K results (Presto OSS 0.287) Native : 33.6 mins

    Java : 1.15 hrs r5.4xlarge – 8 W (vCPU: 128, Memory: 1024GB)
  9. TPC-DS 10K results Presto OSS 0.287 Native : 1.6 hrs

    Java : 2.83 hrs r5.8xlarge – 16 W (vCPU: 512, Memory: 4096GB)
  10. Top issues from TPC-DS runs • Control Velox memory usage

    https://github.com/facebookincubator/velox/discussions/9008 • HashProbe performance. listJoinResults very slow for joins with keys with many multiple matches https://github.com/facebookincubator/velox/issues/9078 • Exchange performance
  11. Presto 2.0 Native Engine SQL feature complete Production readiness Performance

    improvements Lakehouse formats Iceberg(Reader/Writer) Hudi, Delta Connector SPI Feature lockdown, UDF, UDAs Connector Optimizer Expand Presto optimizer for enterprise use-cases with Prestissimo Open search space. Multi-query block merges, Comprehensive Join enumeration, Cost based logical tx. CTEs. Cardinality estimation. New theoretically sound architecture. Add histogram estimators. HBO. Plan lockdown. Prestissimo focused physical plans.
  12. Team acknowledgements • Ahana :Ethan Zhang (mgr), Aditi Pandit, Deepak

    Majeti, Ying Su, Tim Meehan, Karteek Murthy, Pramod Satya • IBM : Ashok Kumar (mgr), Christian Zentgraf, Michael Ohsaka, Minhan Cao, Sujit Madiraju, Prateek Dabre, Soumya Duriseti, Manoj Negi…
  13. Please join our community ! Presto Native Worker Working Group

    Prestissimo : GitHub, Slack, LinkedIn, Meetup Velox : GitHub, Slack, Website