Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Interactive Data Science with Modin and Ray (Devin Petersohn, UC Berkeley))

Scaling Interactive Data Science with Modin and Ray (Devin Petersohn, UC Berkeley))

Interactive data science at scale with Modin. Includes a set of comparisons and benchmarks against existing systems/solutions.

Anyscale

July 15, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. 2 About me • Retired U.S. Marine ◦ Korean Crypto-Linguist

    ◦ 3d Radio BN (2008-2012) • BS - University of Missouri (2016) • MS - UC Berkeley (2018) • 6th year PhD student at UC Berkeley (2021) ◦ Anthony Joseph ◦ Started with work in Genomics ◦ Learned how to communicate with Scientists ◦ Started the dataframe effort in RISELab
  2. 9 Data Science has a scalability problem MORE DATA SCIENTISTS

    != MORE INSIGHTS MORE DATA SCIENTISTS != MORE PRODUCTION MODELS
  3. 10 Large Cluster Small Cluster Laptop/Workstation Many organizations look like

    this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite
  4. 11 Large Cluster Small Cluster Laptop/Workstation Many organizations look like

    this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite
  5. 12 Large Cluster Small Cluster Laptop/Workstation Many organizations look like

    this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite
  6. 13 Large Cluster Small Cluster Laptop/Workstation Many organizations look like

    this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite
  7. 14

  8. 15 Data Science scalability is a human scalability issue Human

    time is the limit here, not the scalability of individual components
  9. 16 Shifting the focus from the machine to the user

    Tools should work for data scientists Data Scientists shouldn’t have to work for their tools
  10. 17 My work: Transparently scale existing systems Abstract away all

    of the components of the system that data scientists don’t care about, only expose details they do care about.
  11. 18 My work: Transparently scale existing systems Abstract away all

    of the components of the system that data scientists don’t care about, only expose details they do care about.
  12. 20 Let Σ∗ be the finite set of characters from

    alphabet Σ. Let Dom be a finite set of domains {dom 1 ,dom 2 , ...}. Let each dom i ∈ Dom have a mapping p i : Σ∗ → dom i . A dataframe is a tuple (A mn , R m , C n , D n ), where A mn is an arrangement of entries in columns and rows from the domain Σ∗, R m is a vector of row labels from Σ∗, C n is a vector of column labels from Σ∗, and D n is a vector of n domains from some finite set of domains Dom, one per column, each of which can also be left unspecified. We call D n the schema of the dataframe. If any of the n entries within D n is left unspecified, then that domain can be induced by applying a schema induction function S(·) to the corresponding column of A mn . The schema induction function S: Σ∗ → Dom, assigns an arrangement of m strings to a domain in Dom. Dataframe formal definition
  13. Dataframe algebraic operators • mask • filter_by_types • map •

    filter • explode • reduce • window • groupby • infer_types • join • concat • transpose • to_labels • from_labels • sort_by
  14. Decomposition Rules -> Formalize parallelism Cell wise: An operator can

    be applied to a “unit dataframe” independently Row-wise: An operator can be applied to each row independently Column-wise: An operator can be applied to each column independently
  15. 25 So now we know: • What a dataframe is

    (formally) • What operators a dataframe supports • How these operators map back to pandas • How to handle dataframe types • How to decouple logical and physical order
  16. 30 How Modin is being used in practice: Large Cluster

    Small Cluster Laptop/Workstation Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite
  17. 33 Dask - Functional Evaluation • Row-store (cannot support most

    columnar operations) • Sorts the dataframe implicitly based on row labels (index) • No support for MultiIndex • Supports most relational database-style queries (join, groupby) • Fairly rigid partitioning, more flexible than Spark
  18. 34 Koalas - Functional Evaluation • Row-store (cannot support most

    columnar operations) • Only supports dataframe operators that directly translate to ANSI SQL • Supports only relational database-style queries (join, groupby) • Missing support for categorical data • Fairly rigid partitioning
  19. 35 Modin - Functional Evaluation • Supports operators on both

    axes (columns and rows) • Not all translations to low-level algebra finished • Not all low-level algebraic operators fully optimized (uncounted) • Fall back to pandas in case something is not yet supported
  20. 36 Modin is the only distributed system capable of handling

    all dataframe operations (Because we derived a theoretical basis for the implementation)
  21. 37 Performance (after tuning Spark and Dask) • NYC Taxi

    data • 20 columns, 100m rows • Reported in log-log scale • Neither Spark nor Dask could run queries with default parameters ◦ Modified partitioning with various layouts and reported best times • Modin was run with defaults ◦ Native Ray impl.
  22. 38 Performance (after tuning Spark and Dask) • NYC Taxi

    data • 20 columns, 100m rows • Reported in log-log scale • Neither Spark nor Dask could run queries with default parameters ◦ Modified partitioning with various layouts and reported best times • Modin was run with defaults 100x
  23. 40 Modin is modular, so as new components are developed,

    we can incorporate them Performance can only get better
  24. 41 BYODP (Bring your own data processor) • Proof of

    concept implemented with OmnisciDB ◦ Compiles the Modin’s Dataframe DSL to Apache Calcite ◦ Execution is handed off to OmnisciDB ◦ Supports the parts of the pandas API that the Omnisci implementation supports Do you have a data processing system you wish would expose the pandas API? Let us know!
  25. • Open source at github.com/modin-project/modin • Install with pip install

    modin • Email the developer mailing list: [email protected] • Visit the documentation: modin.readthedocs.io Thank you! Devin Petersohn [email protected]
  26. 43 Modin Open source project updates • Hundreds of daily

    users • Used in many companies, government organizations, and research labs • Major contributions from outside of Berkeley ◦ Intel - 10+ full time engineers actively contributing (daily) ◦ Georgia Tech - 4 students added NVIDIA GPU support (RAPIDS) ◦ MindsDB - added SQL language support ◦ GitHub contributions - XGBoost support, Functionality, enhancements • Most users are replacing pandas + Spark in their DS workloads with Modin