Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Daft: Pyton Distributed DataFrame for Complex data

Daft: Pyton Distributed DataFrame for Complex data

if you are an ML practitioner and want to learn how you can:

Crunch timeseries and forecasting data at massive scale with Ray from Nixtla
Conduct distributed data processing with Python Daft Dataframes using Ray as its distributed compute engine.
Both are exclusive Ray community user talks, and the Ray team is delighted to have them share their Ray use cases and journeys with the community.

Agenda
(The times are not strict; they may vary slightly.)
Talk 0: Welcome remarks & upcoming Ray announcements - Jules Damji, Anyscale
Talk 1 (30-35 mins): Forecasting at Scale with Nixtla and Ray - Max Mergenthaler & Frederico Ramirez, Nixtla
Talk 2 (30-35 mins) : Daft: The Ray-native Python dataframe for Complex Data - Jay Chia, Eventual

The notebook URL: https://github.com/Eventual-Inc/Daft/blob/14fc9bcea6e63cf03ec4886f9aceffeeae3f6207/tutorials/image_training/coco-dataset.ipynb

Anyscale

March 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. About Me Jay Chia • Co-founder at Eventual Computing •

    Software Lead at Freenome and Lyft L5 • Distributed data systems for ML • Maintainer of www.getdaft.io - ❤ dataframes ❤
  2. What is Daft? www.getdaft.io 1. Python library (pip install getdaft)

    2. Dataframe (tables, rows, columns) 3. Distributed (with Ray!) 4. Complex Data (images, video, tensors, Python objects, …)
  3. Complex Primitive Complex Data 42 “foo” 3.14 True LatLong( lat=40.741895,

    long= -73.989308, ) Person( name=“jay”, age=35, ) www.getdaft.io
  4. Complex Primitive Complex Data 42 “foo” 3.14 True LatLong( lat=40.741895,

    long= -73.989308, ) Person( name=“jay”, age=35, ) www.getdaft.io
  5. Complex Primitive Complex Data Image( height=128, width=128, data=[243, …] )

    42 “foo” 3.14 True LatLong( lat=40.741895, long= -73.989308, ) Person( name=“jay”, age=35, ) www.getdaft.io
  6. What makes data complex? 1. Composed of more primitives >

    1 Mpixels 2. Domain-specific semantics 3. Hard to represent Videos: LatLong(lat=40.741895, long= -73.989308) www.getdaft.io
  7. www.getdaft.io How does Daft help? 1. More primitives a. Leverage

    Ray for distributed computing and GPUs 🔥🔥🔥 b. Lazy execution + query optimization 2. Domain-specific semantics a. Python user-defined functions (@udf) b. Complex data kernels written in Rust [coming soon!] 3. Hard to represent a. Daft-native Complex DataTypes - images, tensors, documents etc [coming soon!]
  8. Demo: using Daft with Ray to train on COCO www.getdaft.io

    Data Storage Query (joins, filters) Data Processing and ETL Analytics Interactive Data Science ray.data ray.tune
  9. Recap: using Daft with Ray www.getdaft.io Data Storage Query (joins,

    filters) Data Processing and ETL Analytics Interactive Data Science ray.data ray.*
  10. Roadmap 1. Rust - performance, I/O, serialization 2. Native complex

    types and kernels a. df[“img”].image .to_embedding(“clip”) b. df[“t1”].tensor .cosine_sim(df[“t2”]) 3. End-to-end lazy partition evaluation into Ray Datasets www.getdaft.io pip install getdaft Contact us! [email protected]
  11. Like what you see? Come chat with us - weʼd

    love to collaborate! Catch us at Ray Summit 2023! Weʼll be talking more about Daft architecture and how we built this on Ray - see you then! www.getdaft.io pip install getdaft Contact us! [email protected]