Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Daft: Pyton Distributed DataFrame for Complex data

Daft: Pyton Distributed DataFrame for Complex data

if you are an ML practitioner and want to learn how you can:

Crunch timeseries and forecasting data at massive scale with Ray from Nixtla
Conduct distributed data processing with Python Daft Dataframes using Ray as its distributed compute engine.
Both are exclusive Ray community user talks, and the Ray team is delighted to have them share their Ray use cases and journeys with the community.

Agenda
(The times are not strict; they may vary slightly.)
Talk 0: Welcome remarks & upcoming Ray announcements - Jules Damji, Anyscale
Talk 1 (30-35 mins): Forecasting at Scale with Nixtla and Ray - Max Mergenthaler & Frederico Ramirez, Nixtla
Talk 2 (30-35 mins) : Daft: The Ray-native Python dataframe for Complex Data - Jay Chia, Eventual

The notebook URL: https://github.com/Eventual-Inc/Daft/blob/14fc9bcea6e63cf03ec4886f9aceffeeae3f6207/tutorials/image_training/coco-dataset.ipynb

Anyscale
PRO

March 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Python distributed dataframe
    for complex data
    www.getdaft.io
    Ray meetup - March 22 2023

    View Slide

  2. About Me
    Jay Chia
    • Co-founder at Eventual Computing
    • Software Lead at Freenome and Lyft L5
    • Distributed data systems for ML
    • Maintainer of www.getdaft.io - ❤ dataframes ❤

    View Slide

  3. What is Daft?
    www.getdaft.io
    1. Python library (pip install getdaft)
    2. Dataframe (tables, rows, columns)
    3. Distributed (with Ray!)
    4. Complex Data (images, video, tensors, Python objects, …)

    View Slide

  4. What is Daft?
    www.getdaft.io

    View Slide

  5. “Complex” Data
    www.getdaft.io

    View Slide

  6. Complex
    Primitive
    Complex Data
    www.getdaft.io

    View Slide

  7. Complex
    Primitive
    Complex Data
    42
    “foo”
    3.14
    True
    www.getdaft.io

    View Slide

  8. Complex
    Primitive
    Complex Data
    42
    “foo”
    3.14
    True
    LatLong(
    lat=40.741895,
    long= -73.989308,
    )
    Person(
    name=“jay”,
    age=35,
    )
    www.getdaft.io

    View Slide

  9. Complex
    Primitive
    Complex Data
    42
    “foo”
    3.14
    True
    LatLong(
    lat=40.741895,
    long= -73.989308,
    )
    Person(
    name=“jay”,
    age=35,
    )
    www.getdaft.io

    View Slide

  10. Complex
    Primitive
    Complex Data
    Image(
    height=128,
    width=128,
    data=[243, …]
    )
    42
    “foo”
    3.14
    True
    LatLong(
    lat=40.741895,
    long= -73.989308,
    )
    Person(
    name=“jay”,
    age=35,
    )
    www.getdaft.io

    View Slide

  11. What makes data complex?
    1. Composed of more primitives
    > 1 Mpixels
    2. Domain-specific semantics
    3. Hard to represent
    Videos:
    LatLong(lat=40.741895, long= -73.989308)
    www.getdaft.io

    View Slide

  12. www.getdaft.io
    How does Daft help?
    1. More primitives
    a. Leverage Ray for distributed computing and GPUs 🔥🔥🔥
    b. Lazy execution + query optimization
    2. Domain-specific semantics
    a. Python user-defined functions (@udf)
    b. Complex data kernels written in Rust [coming soon!]
    3. Hard to represent
    a. Daft-native Complex DataTypes - images, tensors, documents etc [coming soon!]

    View Slide

  13. Distributed Dataframes with Ray
    www.getdaft.io

    View Slide

  14. Distributed Daft with Ray
    www.getdaft.io

    View Slide

  15. Distributed Daft with Ray
    www.getdaft.io

    View Slide

  16. Distributed Daft with Ray
    www.getdaft.io
    Ray Dataset

    View Slide

  17. Demo: using Daft with Ray to train on COCO
    www.getdaft.io
    Data Storage
    Query (joins, filters)
    Data Processing and ETL
    Analytics
    Interactive Data Science
    ray.data ray.tune

    View Slide

  18. Recap: using Daft with Ray
    www.getdaft.io
    Data Storage
    Query (joins, filters)
    Data Processing and ETL
    Analytics
    Interactive Data Science
    ray.data ray.*

    View Slide

  19. Roadmap
    1. Rust - performance, I/O,
    serialization
    2. Native complex types and kernels
    a. df[“img”].image
    .to_embedding(“clip”)
    b. df[“t1”].tensor
    .cosine_sim(df[“t2”])
    3. End-to-end lazy partition
    evaluation into Ray Datasets
    www.getdaft.io
    pip install getdaft
    Contact us!
    [email protected]

    View Slide

  20. Like what you see?
    Come chat with us - weʼd love to
    collaborate!
    Catch us at Ray Summit 2023!
    Weʼll be talking more about Daft
    architecture and how we built
    this on Ray - see you then!
    www.getdaft.io
    pip install getdaft
    Contact us!
    [email protected]

    View Slide