daskperiment: Reproducibility for Humans

Slide 1

Slide 1 text

daskperiment Reproducibility for Humans Masaaki Horikoshi @ ARISE analytics

Slide 2

Slide 2 text

Self-introduction • Masaaki Horikoshi • ARISE analytics • Data analytics, etc • A member of core developers: • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

Contents &YQFSJNFOU.BOBHFNFOU EBTLQFSJNFOU

Slide 4

Slide 4 text

Reproducibility • The closeness of the agreement between the results of measurements of the same measurand carried out with same methodology described in the corresponding scientiﬁc evidence. (wikipedia) • Without reproducibility: • Unable to replicate the previous experiment trial. • Fruitless investigations and collation works required.

Slide 5

Slide 5 text

Data Analytics Process https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining • Lots of trials and errors is necessary especially in initial phase. $3*41%.

Slide 6

Slide 6 text

Data Analytics Pipeline %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB • Varies along with the trial and errors • Data dependent • CACE (Change Anything, Change Everything) %FQMPZNFOU 5SBJOFE .PEFM

Slide 7

Slide 7 text

Dependencies • Data: Data deﬁnitions, periods, target domains… • Logic: Feature engineerings, ML algorithms, hyper- parameters… • Environment: Host, interpreter, packages… &OWJSPONFOU &OWJSPONFOU %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB %FQMPZNFOU 5SBJOFE .PEFM

Slide 8

Slide 8 text

Reproducibility • The closeness of the agreement between the results of measurements of the same measurand carried out with same methodology described in the corresponding scientiﬁc evidence. (wikipedia) &YQFSJNFOUEFQFOEFODJFTNVTUCFUPUBMMZ NBOBHFEUPHVBSBOUFFSFQSPEVDJCJMJUZ

Slide 9

Slide 9 text

Dependency Management • These trials and errors are “smaller” to fit with version control systems. • Minor data modification, feature engineering, new package trial… • Interactive trials and errors sometimes lead to human errors. • Unintended variable overwritten, modified function is not properly applied…

Slide 10

Slide 10 text

Experiment Management • There are lots of tools for logging but… • How often it is checked? • Unable to aware the lost of reproducibility. • Needs “Management”, in addition to “Logging”. • A tool for humans?

Slide 11

Slide 11 text

daskperiment Reproducibility for Humans https://github.com/sinhrks/daskperiment

Slide 12

Slide 12 text

Usage: Jupyter Notebook • Usable in your interpreter with minimum modiﬁcation. def prepare_data(a, b): return a + b def calculate_score(s): return 10 / s ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) 4UBOEBSE EBTLQFSJNFOU d = prepare_data(a, b) calculate_score(d) a = 1 b = 2 ex.set_parameters(a=1, b=2) s.compute() %FpOF FYQFSJNFOU 3VO FYQFSJNFOU %FpOFBOE SVO FYQFSJNFOU %FpOF GVODUJPOT %FpOF QBSBNFUFST /BUVSBMMZTQMJUTEFpOJUJPOBOEFYFDVUJPO

Slide 13

Slide 13 text

Basic Functionality • Logging history in every experiment trial. • Notify users if required. ...[WARNING] Experiment step result is changed with the same input: (step: calculate_score, args: (7,), kwargs: {}) ex.set_parameters(a=1, b=2) s.compute() ex.set_parameters(a=1, b=2) s.compute()

Slide 14

Slide 14 text

Internal Expressions • Experiment deﬁnitions are expressed as computation graph. EBTLQFSJNFOU ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute() 1BSBNFUFSTBSF DIBOHFEQFSUSJBMT

Slide 15

Slide 15 text

Demonstrations https://github.com/sinhrks/daskperiment/blob/master/notebook/quickstart.ipynb

Slide 16

Slide 16 text

Functionalities Introduced in Demo • Experiment management • Manage trial result and its (hyper) parameters • Save intermediate results and metrics • Detect code and environment change • Validate function purity • Feed random seed • Dashboard • Command line tool (CLI)

Slide 17

Slide 17 text

Usage: CLI • Same deﬁnition can be used for CLI execution. ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) $-* 1BSBNFUFSTDBOCF QSPWJEFEWJB$-* +VQZUFS/PUFCPPL ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute()

Slide 18

Slide 18 text

Note: Data Management • Can cover basic usages, but… • Detect external data source change. • Load persisted data to the other experiment. • Using data versioning tool is recommended. @ex def load_user_data(filename): return pd.read_csv(filename) @ex_modeling def load_data(tria_id): return ex_preprocess.get_persisted(‘prepare_data’, trial_id=trial_id)

Slide 19

Slide 19 text

Internal Overview %BUB &OWJSPONFOU &YQFSJNFOU NBOBHFNFOU%# $PNQVUBUJPO %BTL %BUBDPMMFDUJPO $PMMFDUPS *OQVUPVUQVU #BDLFOE 7JTVBMJ[BUJPO #PBSE daskperiment

Slide 20

Slide 20 text

Dask • A ﬂexible parallel computing library for analytics. • Mainly focused on numeric computations. • Provides data structures which is a subset of NumPy and pandas. Blocked Algorithm 4VN $PODBU Dask DataFrame pandas DataFrame 4VN 4VN

Slide 21

Slide 21 text

(Incomplete) List of OSS uses Dask Airﬂow • (TFLearn) Deep learning library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workﬂows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • Executes end-to-end data science and analytics pipelines entirely on GPUs.

Slide 22

Slide 22 text

Dask Computation Graph • Task dependency analysis • Remove unnecessary task (node) • Merge identical task • Static/dynamic computation order analysis • Graph optimization • Inlining • Assigning dependent tasks to the same worker

Slide 23

Slide 23 text

Experiment to Computation Graph • Dask.Delayed: Dask API to built computation graph from arbitrary functions. • Experiment deﬁnitions are expressed as computation graph with Dask.Delayed. • Insert pre/post processing per experiment and trial. • Parameters are provided to computation graph per trial. • Dask executes computation graph in parallel (if possible)

Slide 24

Slide 24 text

Collector • Collect environment info and store to experiment DB via backend. $BUFHPSZ %FTDSJQUJPO $PEF $PEFDPOUFYU 1MBUGPSN *OGPSNBUJPODPMMFDUFECZlQMBUGPSNzNPEVMF $16 *OGPSNBUJPODPMMFDUFECZlDQVJOGPzQBDLBHF 1ZUIPO*OUFSQSFUFS 1ZUIPOWFSTJPO WFOW JOTUBMMFEQBDLBHFTʜ 1BDLBHFEFUBJM /VN1Z 4DJ1Z QBOEBT DPOEBEFUBJMT (JU (JUSFQPTJUPSZ XPSLJOHCSBODIʜ

Slide 25

Slide 25 text

Backend • Users can specify the location to store experiment info as Backend. • Load previous experiment history if exists. • Following backends are supported. • File • Redis • MongoDB daskperiment.Experiment('redis_uri_backend', backend='redis://localhost:6379/0') daskperiment.Experiment('mongo_uri_backend', backend='mongodb://localhost:27017/test_db')

Slide 26

Slide 26 text

Remainings • Collaboration in a team • Optimize computation graph between trials • Extensibility • Fix Collector and Backend API • Support deep learning frameworks

Slide 27

Slide 27 text

Conclusions • It is difﬁcult to guarantee reproducibility in data analytics. • Needs “Management”, in addition to “Logging”. • Built a tool named “daskperiment” • Track and detect anything unexpected. • Experiment steps are automatically parallelized.