Scaling Interactive Data Science with Modin and Ray (Devin Petersohn, UC Berkeley))

Slide 1

Slide 1 text

Scaling Interactive Data Science with Modin and Ray Devin Petersohn

Slide 2

Slide 2 text

2 About me ● Retired U.S. Marine ○ Korean Crypto-Linguist ○ 3d Radio BN (2008-2012) ● BS - University of Missouri (2016) ● MS - UC Berkeley (2018) ● 6th year PhD student at UC Berkeley (2021) ○ Anthony Joseph ○ Started with work in Genomics ○ Learned how to communicate with Scientists ○ Started the dataframe effort in RISELab

Slide 3

Slide 3 text

3 Data Science problems ...but real problems

Slide 4

Slide 4 text

4 Data Science has a scalability problem

Slide 5

Slide 5 text

5 Data Science has a scalability problem MOAR MACHINES!!!!1!!

Slide 6

Slide 6 text

6 Data Science has a scalability problem

Slide 7

Slide 7 text

7 Data Science has a scalability problem

Slide 8

Slide 8 text

8 Data Science has a scalability problem MORE DATA SCIENTISTS != MORE INSIGHTS

Slide 9

Slide 9 text

9 Data Science has a scalability problem MORE DATA SCIENTISTS != MORE INSIGHTS MORE DATA SCIENTISTS != MORE PRODUCTION MODELS

Slide 10

Slide 10 text

10 Large Cluster Small Cluster Laptop/Workstation Many organizations look like this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

Slide 11

Slide 11 text

11 Large Cluster Small Cluster Laptop/Workstation Many organizations look like this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

Slide 12

Slide 12 text

12 Large Cluster Small Cluster Laptop/Workstation Many organizations look like this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

Slide 13

Slide 13 text

13 Large Cluster Small Cluster Laptop/Workstation Many organizations look like this Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

Slide 14

Slide 14 text

Slide 15

Slide 15 text

15 Data Science scalability is a human scalability issue Human time is the limit here, not the scalability of individual components

Slide 16

Slide 16 text

16 Shifting the focus from the machine to the user Tools should work for data scientists Data Scientists shouldn’t have to work for their tools

Slide 17

Slide 17 text

17 My work: Transparently scale existing systems Abstract away all of the components of the system that data scientists don’t care about, only expose details they do care about.

Slide 18

Slide 18 text

18 My work: Transparently scale existing systems Abstract away all of the components of the system that data scientists don’t care about, only expose details they do care about.

Slide 19

Slide 19 text

19 First Steps: Formalize the Dataframe

Slide 20

Slide 20 text

20 Let Σ∗ be the finite set of characters from alphabet Σ. Let Dom be a finite set of domains {dom 1 ,dom 2 , ...}. Let each dom i ∈ Dom have a mapping p i : Σ∗ → dom i . A dataframe is a tuple (A mn , R m , C n , D n ), where A mn is an arrangement of entries in columns and rows from the domain Σ∗, R m is a vector of row labels from Σ∗, C n is a vector of column labels from Σ∗, and D n is a vector of n domains from some finite set of domains Dom, one per column, each of which can also be left unspecified. We call D n the schema of the dataframe. If any of the n entries within D n is left unspecified, then that domain can be induced by applying a schema induction function S(·) to the corresponding column of A mn . The schema induction function S: Σ∗ → Dom, assigns an arrangement of m strings to a domain in Dom. Dataframe formal definition

Slide 21

Slide 21 text

Dataframe algebraic operators ● mask ● filter_by_types ● map ● filter ● explode ● reduce ● window ● groupby ● infer_types ● join ● concat ● transpose ● to_labels ● from_labels ● sort_by

Slide 22

Slide 22 text

Proof by exhaustion that all pandas APIs are covered

Slide 23

Slide 23 text

Decomposition Rules -> Formalize parallelism Cell wise: An operator can be applied to a “unit dataframe” independently Row-wise: An operator can be applied to each row independently Column-wise: An operator can be applied to each column independently

Slide 24

Slide 24 text

24 Order Semantics Type System Formalization of other components of the dataframe

Slide 25

Slide 25 text

25 So now we know: ● What a dataframe is (formally) ● What operators a dataframe supports ● How these operators map back to pandas ● How to handle dataframe types ● How to decouple logical and physical order

Slide 26

Slide 26 text

26 Modin: A dataframe implementation grounded in theory

Slide 27

Slide 27 text

27 Modin: A dataframe implementation grounded in theory Pluggable APIs

Slide 28

Slide 28 text

28 Modin: A dataframe implementation grounded in theory Common Execution Specification

Slide 29

Slide 29 text

29 Modin: A dataframe implementation grounded in theory Native implementations for multiple environments

Slide 30

Slide 30 text

30 How Modin is being used in practice: Large Cluster Small Cluster Laptop/Workstation Prototyping Testing Exploring Production New Data Source New spec New requirements Feedback Rewrite

Slide 31

Slide 31 text

31 Common Question: How is Modin diﬀerent from _____?

Slide 32

Slide 32 text

32 Modin vs Dask vs Koalas - Functional Evaluation

Slide 33

Slide 33 text

33 Dask - Functional Evaluation ● Row-store (cannot support most columnar operations) ● Sorts the dataframe implicitly based on row labels (index) ● No support for MultiIndex ● Supports most relational database-style queries (join, groupby) ● Fairly rigid partitioning, more flexible than Spark

Slide 34

Slide 34 text

34 Koalas - Functional Evaluation ● Row-store (cannot support most columnar operations) ● Only supports dataframe operators that directly translate to ANSI SQL ● Supports only relational database-style queries (join, groupby) ● Missing support for categorical data ● Fairly rigid partitioning

Slide 35

Slide 35 text

35 Modin - Functional Evaluation ● Supports operators on both axes (columns and rows) ● Not all translations to low-level algebra finished ● Not all low-level algebraic operators fully optimized (uncounted) ● Fall back to pandas in case something is not yet supported

Slide 36

Slide 36 text

36 Modin is the only distributed system capable of handling all dataframe operations (Because we derived a theoretical basis for the implementation)

Slide 37

Slide 37 text

37 Performance (after tuning Spark and Dask) ● NYC Taxi data ● 20 columns, 100m rows ● Reported in log-log scale ● Neither Spark nor Dask could run queries with default parameters ○ Modified partitioning with various layouts and reported best times ● Modin was run with defaults ○ Native Ray impl.

Slide 38

Slide 38 text

38 Performance (after tuning Spark and Dask) ● NYC Taxi data ● 20 columns, 100m rows ● Reported in log-log scale ● Neither Spark nor Dask could run queries with default parameters ○ Modified partitioning with various layouts and reported best times ● Modin was run with defaults 100x

Slide 39

Slide 39 text

39 Performance of calculating median for every column Parallelism limited by number of columns (Recall formalism) 5.6x

Slide 40

Slide 40 text

40 Modin is modular, so as new components are developed, we can incorporate them Performance can only get better

Slide 41

Slide 41 text

41 BYODP (Bring your own data processor) ● Proof of concept implemented with OmnisciDB ○ Compiles the Modin’s Dataframe DSL to Apache Calcite ○ Execution is handed oﬀ to OmnisciDB ○ Supports the parts of the pandas API that the Omnisci implementation supports Do you have a data processing system you wish would expose the pandas API? Let us know!

Slide 42

Slide 42 text

• Open source at github.com/modin-project/modin • Install with pip install modin • Email the developer mailing list: [email protected] • Visit the documentation: modin.readthedocs.io Thank you! Devin Petersohn [email protected]

Slide 43

Slide 43 text

43 Modin Open source project updates ● Hundreds of daily users ● Used in many companies, government organizations, and research labs ● Major contributions from outside of Berkeley ○ Intel - 10+ full time engineers actively contributing (daily) ○ Georgia Tech - 4 students added NVIDIA GPU support (RAPIDS) ○ MindsDB - added SQL language support ○ GitHub contributions - XGBoost support, Functionality, enhancements ● Most users are replacing pandas + Spark in their DS workloads with Modin