Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prestissimo at IBM

Prestissimo at IBM

In this talk we will give an overview of all Prestissimo related activity at IBM since last VeloxCon. This includes : i) Feature enhancements for Prestissimo tech preview on IBM watsonx.data. ii) TPC-DS updates. iii) Presto 2.0 plans. iv) Connector SPI

Aditi Pandit
Software Engineer at IBM

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. Talk Outline • Ahana -> IBM Data & AI •

    IBM Watsonx • Prestissimo will be a core-engine for Watsonx.data Open data LakeHouse • Updates • Prestissimo in 2023 • Prestissimo in 2024 • Presto 2.0 ?
  2. Prestissimo introduction C++ worker Built over the Velox library SIMD

    Runtime optimizations Smart I/O prefetching/caching Memory Arbitration features
  3. Benefits of Prestissimo/Velox • Huge performance boost ◦ Query processing

    can be done with much smaller clusters. • Avoids performance cliffs ◦ No Java processes, JVM or Garbage collection. ◦ Memory management and SIMD are explicitly controlled in C++. ◦ Memory arbitration improves efficiency. • Easier to build and operate at scale ◦ Reusable and extensible primitives across data engines (like Spark). ◦ Performance can be better understood.
  4. Prestissimo in 2023 (post IBM acquisition) Goals : Inner-source Prestissimo

    in the IBM tech stack. Make it available to broader teams. • Watsonx.data dev team, devops, testing, docs, research (DB and Storage), customer demos Build solid Presto OSS team. S/W Dev, Devops
  5. Features • CTAS • S3 Compatible Reader/Writer • Parquet Reader/Writer

    • Expand SQL Coverage • TPC-H and TPC-DS for 1K and 10K • Full Presto SQL SELECT statement syntax supported • Type/Function coverage • JWT and Cert based TLS
  6. Prestissimo in 2024 • Iceberg Reader • Prestissimo production readiness

    • Metrics collection with Prometheus • New Velox system implementation • AsyncDataCache • Spilling • Memory arbitration • TPC-DS benchmark runs
  7. Performance testing and tracking framework • Pbench tool -> Automates

    deployments, choosing workloads and running them. • Performance dashboards • Plan comparison tool
  8. TPC-DS 1K results (Presto OSS 0.287) Native : 33.6 mins

    Java : 1.15 hrs r5.4xlarge – 8 W (vCPU: 128, Memory: 1024GB)
  9. TPC-DS 10K results Presto OSS 0.287 Native : 1.6 hrs

    Java : 2.83 hrs r5.8xlarge – 16 W (vCPU: 512, Memory: 4096GB)
  10. Top issues from TPC-DS runs • Control Velox memory usage

    https://github.com/facebookincubator/velox/discussions/9008 • HashProbe performance. listJoinResults very slow for joins with keys with many multiple matches https://github.com/facebookincubator/velox/issues/9078 • Exchange performance
  11. Presto 2.0 Native Engine SQL feature complete Production readiness Performance

    improvements Lakehouse formats Iceberg(Reader/Writer) Hudi, Delta Connector SPI Feature lockdown, UDF, UDAs Connector Optimizer Expand Presto optimizer for enterprise use-cases with Prestissimo Open search space. Multi-query block merges, Comprehensive Join enumeration, Cost based logical tx. CTEs. Cardinality estimation. New theoretically sound architecture. Add histogram estimators. HBO. Plan lockdown. Prestissimo focused physical plans.
  12. Team acknowledgements • Ahana :Ethan Zhang (mgr), Aditi Pandit, Deepak

    Majeti, Ying Su, Tim Meehan, Karteek Murthy, Pramod Satya • IBM : Ashok Kumar (mgr), Christian Zentgraf, Michael Ohsaka, Minhan Cao, Sujit Madiraju, Prateek Dabre, Soumya Duriseti, Manoj Negi…
  13. Please join our community ! Presto Native Worker Working Group

    Prestissimo : GitHub, Slack, LinkedIn, Meetup Velox : GitHub, Slack, Website