Reproducibility • The closeness of the agreement between the results of measurements of the same measurand carried out with same methodology described in the corresponding scientific evidence. (wikipedia) • Without reproducibility: • Unable to replicate the previous experiment trial. • Fruitless investigations and collation works required.
Data Analytics Process https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining • Lots of trials and errors is necessary especially in initial phase. $3*41%.
Data Analytics Pipeline %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB • Varies along with the trial and errors • Data dependent • CACE (Change Anything, Change Everything) %FQMPZNFOU 5SBJOFE .PEFM
Reproducibility • The closeness of the agreement between the results of measurements of the same measurand carried out with same methodology described in the corresponding scientific evidence. (wikipedia) &YQFSJNFOUEFQFOEFODJFTNVTUCFUPUBMMZ NBOBHFEUPHVBSBOUFFSFQSPEVDJCJMJUZ
Dependency Management • These trials and errors are “smaller” to fit with version control systems. • Minor data modification, feature engineering, new package trial… • Interactive trials and errors sometimes lead to human errors. • Unintended variable overwritten, modified function is not properly applied…
Experiment Management • There are lots of tools for logging but… • How often it is checked? • Unable to aware the lost of reproducibility. • Needs “Management”, in addition to “Logging”. • A tool for humans?
Usage: Jupyter Notebook • Usable in your interpreter with minimum modification. def prepare_data(a, b): return a + b def calculate_score(s): return 10 / s ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) 4UBOEBSE EBTLQFSJNFOU d = prepare_data(a, b) calculate_score(d) a = 1 b = 2 ex.set_parameters(a=1, b=2) s.compute() %FpOF FYQFSJNFOU 3VO FYQFSJNFOU %FpOFBOE SVO FYQFSJNFOU %FpOF GVODUJPOT %FpOF QBSBNFUFST /BUVSBMMZTQMJUTEFpOJUJPOBOEFYFDVUJPO
Basic Functionality • Logging history in every experiment trial. • Notify users if required. ...[WARNING] Experiment step result is changed with the same input: (step: calculate_score, args: (7,), kwargs: {}) ex.set_parameters(a=1, b=2) s.compute() ex.set_parameters(a=1, b=2) s.compute()
Internal Expressions • Experiment definitions are expressed as computation graph. EBTLQFSJNFOU ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute() 1BSBNFUFSTBSF DIBOHFEQFSUSJBMT
Functionalities Introduced in Demo • Experiment management • Manage trial result and its (hyper) parameters • Save intermediate results and metrics • Detect code and environment change • Validate function purity • Feed random seed • Dashboard • Command line tool (CLI)
Usage: CLI • Same definition can be used for CLI execution. ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) $-* 1BSBNFUFSTDBOCF QSPWJEFEWJB$-* +VQZUFS/PUFCPPL ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute()
Note: Data Management • Can cover basic usages, but… • Detect external data source change. • Load persisted data to the other experiment. • Using data versioning tool is recommended. @ex def load_user_data(filename): return pd.read_csv(filename) @ex_modeling def load_data(tria_id): return ex_preprocess.get_persisted(‘prepare_data’, trial_id=trial_id)
Dask • A flexible parallel computing library for analytics. • Mainly focused on numeric computations. • Provides data structures which is a subset of NumPy and pandas. Blocked Algorithm 4VN $PODBU Dask DataFrame pandas DataFrame 4VN 4VN
(Incomplete) List of OSS uses Dask Airflow • (TFLearn) Deep learning library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • Executes end-to-end data science and analytics pipelines entirely on GPUs.
Experiment to Computation Graph • Dask.Delayed: Dask API to built computation graph from arbitrary functions. • Experiment definitions are expressed as computation graph with Dask.Delayed. • Insert pre/post processing per experiment and trial. • Parameters are provided to computation graph per trial. • Dask executes computation graph in parallel (if possible)
Backend • Users can specify the location to store experiment info as Backend. • Load previous experiment history if exists. • Following backends are supported. • File • Redis • MongoDB daskperiment.Experiment('redis_uri_backend', backend='redis://localhost:6379/0') daskperiment.Experiment('mongo_uri_backend', backend='mongodb://localhost:27017/test_db')
Remainings • Collaboration in a team • Optimize computation graph between trials • Extensibility • Fix Collector and Backend API • Support deep learning frameworks
Conclusions • It is difficult to guarantee reproducibility in data analytics. • Needs “Management”, in addition to “Logging”. • Built a tool named “daskperiment” • Track and detect anything unexpected. • Experiment steps are automatically parallelized.