Scaling Interactive Data Science with Modin and Ray (Devin Petersohn, UC Berkeley))

Scaling Interactive Data Science with Modin and Ray Devin Petersohn

2 About me • Retired U.S. Marine ◦ Korean Crypto-Linguist
◦ 3d Radio BN (2008-2012) • BS - University of Missouri (2016) • MS - UC Berkeley (2018) • 6th year PhD student at UC Berkeley (2021) ◦ Anthony Joseph ◦ Started with work in Genomics ◦ Learned how to communicate with Scientists ◦ Started the dataframe effort in RISELab

3 Data Science problems ...but real problems

4 Data Science has a scalability problem

5 Data Science has a scalability problem MOAR MACHINES!!!!1!!

8 Data Science has a scalability problem MORE DATA SCIENTISTS
!= MORE INSIGHTS

9 Data Science has a scalability problem MORE DATA SCIENTISTS
!= MORE INSIGHTS MORE DATA SCIENTISTS != MORE PRODUCTION MODELS

10 Large Cluster Small Cluster Laptop/Workstation Many organizations look like
this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

15 Data Science scalability is a human scalability issue Human
time is the limit here, not the scalability of individual components

16 Shifting the focus from the machine to the user
Tools should work for data scientists Data Scientists shouldn’t have to work for their tools

17 My work: Transparently scale existing systems Abstract away all
of the components of the system that data scientists don’t care about, only expose details they do care about.

18 My work: Transparently scale existing systems Abstract away all
of the components of the system that data scientists don’t care about, only expose details they do care about.

19 First Steps: Formalize the Dataframe

20 Let Σ∗ be the finite set of characters from
alphabet Σ. Let Dom be a finite set of domains {dom 1 ,dom 2 , ...}. Let each dom i ∈ Dom have a mapping p i : Σ∗ → dom i . A dataframe is a tuple (A mn , R m , C n , D n ), where A mn is an arrangement of entries in columns and rows from the domain Σ∗, R m is a vector of row labels from Σ∗, C n is a vector of column labels from Σ∗, and D n is a vector of n domains from some finite set of domains Dom, one per column, each of which can also be left unspecified. We call D n the schema of the dataframe. If any of the n entries within D n is left unspecified, then that domain can be induced by applying a schema induction function S(·) to the corresponding column of A mn . The schema induction function S: Σ∗ → Dom, assigns an arrangement of m strings to a domain in Dom. Dataframe formal definition

Dataframe algebraic operators • mask • filter_by_types • map •
filter • explode • reduce • window • groupby • infer_types • join • concat • transpose • to_labels • from_labels • sort_by

Proof by exhaustion that all pandas APIs are covered

Decomposition Rules -> Formalize parallelism Cell wise: An operator can
be applied to a “unit dataframe” independently Row-wise: An operator can be applied to each row independently Column-wise: An operator can be applied to each column independently

24 Order Semantics Type System Formalization of other components of
the dataframe

25 So now we know: • What a dataframe is
(formally) • What operators a dataframe supports • How these operators map back to pandas • How to handle dataframe types • How to decouple logical and physical order

26 Modin: A dataframe implementation grounded in theory

27 Modin: A dataframe implementation grounded in theory Pluggable APIs

28 Modin: A dataframe implementation grounded in theory Common Execution
Specification

29 Modin: A dataframe implementation grounded in theory Native implementations
for multiple environments

30 How Modin is being used in practice: Large Cluster
Small Cluster Laptop/Workstation Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

31 Common Question: How is Modin diﬀerent from _____?

32 Modin vs Dask vs Koalas - Functional Evaluation

33 Dask - Functional Evaluation • Row-store (cannot support most
columnar operations) • Sorts the dataframe implicitly based on row labels (index) • No support for MultiIndex • Supports most relational database-style queries (join, groupby) • Fairly rigid partitioning, more flexible than Spark

34 Koalas - Functional Evaluation • Row-store (cannot support most
columnar operations) • Only supports dataframe operators that directly translate to ANSI SQL • Supports only relational database-style queries (join, groupby) • Missing support for categorical data • Fairly rigid partitioning

35 Modin - Functional Evaluation • Supports operators on both
axes (columns and rows) • Not all translations to low-level algebra finished • Not all low-level algebraic operators fully optimized (uncounted) • Fall back to pandas in case something is not yet supported

36 Modin is the only distributed system capable of handling
all dataframe operations (Because we derived a theoretical basis for the implementation)

37 Performance (after tuning Spark and Dask) • NYC Taxi
data • 20 columns, 100m rows • Reported in log-log scale • Neither Spark nor Dask could run queries with default parameters ◦ Modified partitioning with various layouts and reported best times • Modin was run with defaults ◦ Native Ray impl.

38 Performance (after tuning Spark and Dask) • NYC Taxi
data • 20 columns, 100m rows • Reported in log-log scale • Neither Spark nor Dask could run queries with default parameters ◦ Modified partitioning with various layouts and reported best times • Modin was run with defaults 100x

39 Performance of calculating median for every column Parallelism limited
by number of columns (Recall formalism) 5.6x

40 Modin is modular, so as new components are developed,
we can incorporate them Performance can only get better

41 BYODP (Bring your own data processor) • Proof of
concept implemented with OmnisciDB ◦ Compiles the Modin’s Dataframe DSL to Apache Calcite ◦ Execution is handed oﬀ to OmnisciDB ◦ Supports the parts of the pandas API that the Omnisci implementation supports Do you have a data processing system you wish would expose the pandas API? Let us know!

• Open source at github.com/modin-project/modin • Install with pip install
modin • Email the developer mailing list: [email protected] • Visit the documentation: modin.readthedocs.io Thank you! Devin Petersohn [email protected]

43 Modin Open source project updates • Hundreds of daily
users • Used in many companies, government organizations, and research labs • Major contributions from outside of Berkeley ◦ Intel - 10+ full time engineers actively contributing (daily) ◦ Georgia Tech - 4 students added NVIDIA GPU support (RAPIDS) ◦ MindsDB - added SQL language support ◦ GitHub contributions - XGBoost support, Functionality, enhancements • Most users are replacing pandas + Spark in their DS workloads with Modin

Scaling Interactive Data Science with Modin and...

Scaling Interactive Data Science with Modin and Ray (Devin Petersohn, UC Berkeley))

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript