daskperiment: Reproducibility for Humans

daskperiment Reproducibility for Humans Masaaki Horikoshi @ ARISE analytics

Self-introduction • Masaaki Horikoshi • ARISE analytics • Data analytics,
etc • A member of core developers: • GitHub: https://github.com/sinhrks

Contents &YQFSJNFOU.BOBHFNFOU EBTLQFSJNFOU

Reproducibility • The closeness of the agreement between the results
of measurements of the same measurand carried out with same methodology described in the corresponding scientiﬁc evidence. (wikipedia) • Without reproducibility: • Unable to replicate the previous experiment trial. • Fruitless investigations and collation works required.

Data Analytics Process https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining • Lots of trials and errors
is necessary especially in initial phase. $3*41%.

Data Analytics Pipeline %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB • Varies
along with the trial and errors • Data dependent • CACE (Change Anything, Change Everything) %FQMPZNFOU 5SBJOFE .PEFM

Dependencies • Data: Data deﬁnitions, periods, target domains… • Logic:
Feature engineerings, ML algorithms, hyper- parameters… • Environment: Host, interpreter, packages… &OWJSPONFOU &OWJSPONFOU %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB %FQMPZNFOU 5SBJOFE .PEFM

Reproducibility • The closeness of the agreement between the results
of measurements of the same measurand carried out with same methodology described in the corresponding scientiﬁc evidence. (wikipedia) &YQFSJNFOUEFQFOEFODJFTNVTUCFUPUBMMZ NBOBHFEUPHVBSBOUFFSFQSPEVDJCJMJUZ

Dependency Management • These trials and errors are “smaller” to
fit with version control systems. • Minor data modification, feature engineering, new package trial… • Interactive trials and errors sometimes lead to human errors. • Unintended variable overwritten, modified function is not properly applied…

Experiment Management • There are lots of tools for logging
but… • How often it is checked? • Unable to aware the lost of reproducibility. • Needs “Management”, in addition to “Logging”. • A tool for humans?

daskperiment Reproducibility for Humans https://github.com/sinhrks/daskperiment

Usage: Jupyter Notebook • Usable in your interpreter with minimum
modiﬁcation. def prepare_data(a, b): return a + b def calculate_score(s): return 10 / s ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) 4UBOEBSE EBTLQFSJNFOU d = prepare_data(a, b) calculate_score(d) a = 1 b = 2 ex.set_parameters(a=1, b=2) s.compute() %FpOF FYQFSJNFOU 3VO FYQFSJNFOU %FpOFBOE SVO FYQFSJNFOU %FpOF GVODUJPOT %FpOF QBSBNFUFST /BUVSBMMZTQMJUTEFpOJUJPOBOEFYFDVUJPO

Basic Functionality • Logging history in every experiment trial. •
Notify users if required. ...[WARNING] Experiment step result is changed with the same input: (step: calculate_score, args: (7,), kwargs: {}) ex.set_parameters(a=1, b=2) s.compute() ex.set_parameters(a=1, b=2) s.compute()

Internal Expressions • Experiment deﬁnitions are expressed as computation graph.
EBTLQFSJNFOU ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute() 1BSBNFUFSTBSF DIBOHFEQFSUSJBMT

Demonstrations https://github.com/sinhrks/daskperiment/blob/master/notebook/quickstart.ipynb

Functionalities Introduced in Demo • Experiment management • Manage trial
result and its (hyper) parameters • Save intermediate results and metrics • Detect code and environment change • Validate function purity • Feed random seed • Dashboard • Command line tool (CLI)

Usage: CLI • Same deﬁnition can be used for CLI
execution. ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) $-* 1BSBNFUFSTDBOCF QSPWJEFEWJB$-* +VQZUFS/PUFCPPL ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute()

Note: Data Management • Can cover basic usages, but… •
Detect external data source change. • Load persisted data to the other experiment. • Using data versioning tool is recommended. @ex def load_user_data(filename): return pd.read_csv(filename) @ex_modeling def load_data(tria_id): return ex_preprocess.get_persisted(‘prepare_data’, trial_id=trial_id)

Internal Overview %BUB &OWJSPONFOU &YQFSJNFOU NBOBHFNFOU%# $PNQVUBUJPO %BTL %BUBDPMMFDUJPO $PMMFDUPS
*OQVUPVUQVU #BDLFOE 7JTVBMJ[BUJPO #PBSE daskperiment

Dask • A ﬂexible parallel computing library for analytics. •
Mainly focused on numeric computations. • Provides data structures which is a subset of NumPy and pandas. Blocked Algorithm 4VN $PODBU Dask DataFrame pandas DataFrame 4VN 4VN

(Incomplete) List of OSS uses Dask Airﬂow • (TFLearn) Deep
learning library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workﬂows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • Executes end-to-end data science and analytics pipelines entirely on GPUs.

Dask Computation Graph • Task dependency analysis • Remove unnecessary
task (node) • Merge identical task • Static/dynamic computation order analysis • Graph optimization • Inlining • Assigning dependent tasks to the same worker

Experiment to Computation Graph • Dask.Delayed: Dask API to built
computation graph from arbitrary functions. • Experiment deﬁnitions are expressed as computation graph with Dask.Delayed. • Insert pre/post processing per experiment and trial. • Parameters are provided to computation graph per trial. • Dask executes computation graph in parallel (if possible)

Collector • Collect environment info and store to experiment DB
via backend. $BUFHPSZ %FTDSJQUJPO $PEF $PEFDPOUFYU 1MBUGPSN *OGPSNBUJPODPMMFDUFECZlQMBUGPSNzNPEVMF $16 *OGPSNBUJPODPMMFDUFECZlDQVJOGPzQBDLBHF 1ZUIPO*OUFSQSFUFS 1ZUIPOWFSTJPO WFOW JOTUBMMFEQBDLBHFTʜ 1BDLBHFEFUBJM /VN1Z 4DJ1Z QBOEBT DPOEBEFUBJMT (JU (JUSFQPTJUPSZ XPSLJOHCSBODIʜ

Backend • Users can specify the location to store experiment
info as Backend. • Load previous experiment history if exists. • Following backends are supported. • File • Redis • MongoDB daskperiment.Experiment('redis_uri_backend', backend='redis://localhost:6379/0') daskperiment.Experiment('mongo_uri_backend', backend='mongodb://localhost:27017/test_db')

Remainings • Collaboration in a team • Optimize computation graph
between trials • Extensibility • Fix Collector and Backend API • Support deep learning frameworks

Conclusions • It is difﬁcult to guarantee reproducibility in data
analytics. • Needs “Management”, in addition to “Logging”. • Built a tool named “daskperiment” • Track and detect anything unexpected. • Experiment steps are automatically parallelized.

daskperiment: Reproducibility for Humans

daskperiment: Reproducibility for Humans

Sinhrks

More Decks by Sinhrks

Other Decks in Science

Featured

Transcript