Slide 1

Slide 1 text

Python distributed dataframe for complex data www.getdaft.io Ray meetup - March 22 2023

Slide 2

Slide 2 text

About Me Jay Chia • Co-founder at Eventual Computing • Software Lead at Freenome and Lyft L5 • Distributed data systems for ML • Maintainer of www.getdaft.io - ❤ dataframes ❤

Slide 3

Slide 3 text

What is Daft? www.getdaft.io 1. Python library (pip install getdaft) 2. Dataframe (tables, rows, columns) 3. Distributed (with Ray!) 4. Complex Data (images, video, tensors, Python objects, …)

Slide 4

Slide 4 text

What is Daft? www.getdaft.io

Slide 5

Slide 5 text

“Complex” Data www.getdaft.io

Slide 6

Slide 6 text

Complex Primitive Complex Data www.getdaft.io

Slide 7

Slide 7 text

Complex Primitive Complex Data 42 “foo” 3.14 True www.getdaft.io

Slide 8

Slide 8 text

Complex Primitive Complex Data 42 “foo” 3.14 True LatLong( lat=40.741895, long= -73.989308, ) Person( name=“jay”, age=35, ) www.getdaft.io

Slide 9

Slide 9 text

Complex Primitive Complex Data 42 “foo” 3.14 True LatLong( lat=40.741895, long= -73.989308, ) Person( name=“jay”, age=35, ) www.getdaft.io

Slide 10

Slide 10 text

Complex Primitive Complex Data Image( height=128, width=128, data=[243, …] ) 42 “foo” 3.14 True LatLong( lat=40.741895, long= -73.989308, ) Person( name=“jay”, age=35, ) www.getdaft.io

Slide 11

Slide 11 text

What makes data complex? 1. Composed of more primitives > 1 Mpixels 2. Domain-specific semantics 3. Hard to represent Videos: LatLong(lat=40.741895, long= -73.989308) www.getdaft.io

Slide 12

Slide 12 text

www.getdaft.io How does Daft help? 1. More primitives a. Leverage Ray for distributed computing and GPUs 🔥🔥🔥 b. Lazy execution + query optimization 2. Domain-specific semantics a. Python user-defined functions (@udf) b. Complex data kernels written in Rust [coming soon!] 3. Hard to represent a. Daft-native Complex DataTypes - images, tensors, documents etc [coming soon!]

Slide 13

Slide 13 text

Distributed Dataframes with Ray www.getdaft.io

Slide 14

Slide 14 text

Distributed Daft with Ray www.getdaft.io

Slide 15

Slide 15 text

Distributed Daft with Ray www.getdaft.io

Slide 16

Slide 16 text

Distributed Daft with Ray www.getdaft.io Ray Dataset

Slide 17

Slide 17 text

Demo: using Daft with Ray to train on COCO www.getdaft.io Data Storage Query (joins, filters) Data Processing and ETL Analytics Interactive Data Science ray.data ray.tune

Slide 18

Slide 18 text

Recap: using Daft with Ray www.getdaft.io Data Storage Query (joins, filters) Data Processing and ETL Analytics Interactive Data Science ray.data ray.*

Slide 19

Slide 19 text

Roadmap 1. Rust - performance, I/O, serialization 2. Native complex types and kernels a. df[“img”].image .to_embedding(“clip”) b. df[“t1”].tensor .cosine_sim(df[“t2”]) 3. End-to-end lazy partition evaluation into Ray Datasets www.getdaft.io pip install getdaft Contact us! [email protected]

Slide 20

Slide 20 text

Like what you see? Come chat with us - weʼd love to collaborate! Catch us at Ray Summit 2023! Weʼll be talking more about Daft architecture and how we built this on Ray - see you then! www.getdaft.io pip install getdaft Contact us! [email protected]