Python distributed dataframe
for complex data
www.getdaft.io
Ray meetup - March 22 2023
Slide 2
Slide 2 text
About Me
Jay Chia
• Co-founder at Eventual Computing
• Software Lead at Freenome and Lyft L5
• Distributed data systems for ML
• Maintainer of www.getdaft.io - ❤ dataframes ❤
Slide 3
Slide 3 text
What is Daft?
www.getdaft.io
1. Python library (pip install getdaft)
2. Dataframe (tables, rows, columns)
3. Distributed (with Ray!)
4. Complex Data (images, video, tensors, Python objects, …)
Slide 4
Slide 4 text
What is Daft?
www.getdaft.io
Slide 5
Slide 5 text
“Complex” Data
www.getdaft.io
Slide 6
Slide 6 text
Complex
Primitive
Complex Data
www.getdaft.io
Slide 7
Slide 7 text
Complex
Primitive
Complex Data
42
“foo”
3.14
True
www.getdaft.io
What makes data complex?
1. Composed of more primitives
> 1 Mpixels
2. Domain-specific semantics
3. Hard to represent
Videos:
LatLong(lat=40.741895, long= -73.989308)
www.getdaft.io
Slide 12
Slide 12 text
www.getdaft.io
How does Daft help?
1. More primitives
a. Leverage Ray for distributed computing and GPUs 🔥🔥🔥
b. Lazy execution + query optimization
2. Domain-specific semantics
a. Python user-defined functions (@udf)
b. Complex data kernels written in Rust [coming soon!]
3. Hard to represent
a. Daft-native Complex DataTypes - images, tensors, documents etc [coming soon!]
Slide 13
Slide 13 text
Distributed Dataframes with Ray
www.getdaft.io
Slide 14
Slide 14 text
Distributed Daft with Ray
www.getdaft.io
Slide 15
Slide 15 text
Distributed Daft with Ray
www.getdaft.io
Slide 16
Slide 16 text
Distributed Daft with Ray
www.getdaft.io
Ray Dataset
Slide 17
Slide 17 text
Demo: using Daft with Ray to train on COCO
www.getdaft.io
Data Storage
Query (joins, filters)
Data Processing and ETL
Analytics
Interactive Data Science
ray.data ray.tune
Slide 18
Slide 18 text
Recap: using Daft with Ray
www.getdaft.io
Data Storage
Query (joins, filters)
Data Processing and ETL
Analytics
Interactive Data Science
ray.data ray.*
Slide 19
Slide 19 text
Roadmap
1. Rust - performance, I/O,
serialization
2. Native complex types and kernels
a. df[“img”].image
.to_embedding(“clip”)
b. df[“t1”].tensor
.cosine_sim(df[“t2”])
3. End-to-end lazy partition
evaluation into Ray Datasets
www.getdaft.io
pip install getdaft
Contact us!
[email protected]
Slide 20
Slide 20 text
Like what you see?
Come chat with us - weʼd love to
collaborate!
Catch us at Ray Summit 2023!
Weʼll be talking more about Daft
architecture and how we built
this on Ray - see you then!
www.getdaft.io
pip install getdaft
Contact us!
[email protected]