daskperiment: Reproducibility for Humans

22f56e55955b9aa693081ed5dc6400ae?s=47 Sinhrks
April 24, 2019

daskperiment: Reproducibility for Humans

@Scipy Japan 2019/4/24

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

April 24, 2019
Tweet

Transcript

  1. daskperiment Reproducibility for Humans Masaaki Horikoshi @ ARISE analytics

  2. Self-introduction • Masaaki Horikoshi • ARISE analytics • Data analytics,

    etc • A member of core developers: • GitHub: https://github.com/sinhrks
  3. Contents &YQFSJNFOU.BOBHFNFOU EBTLQFSJNFOU

  4. Reproducibility • The closeness of the agreement between the results

    of measurements of the same measurand carried out with same methodology described in the corresponding scientific evidence. (wikipedia) • Without reproducibility: • Unable to replicate the previous experiment trial. • Fruitless investigations and collation works required.
  5. Data Analytics Process https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining • Lots of trials and errors

    is necessary especially in initial phase. $3*41%.
  6. Data Analytics Pipeline %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB • Varies

    along with the trial and errors • Data dependent • CACE (Change Anything, Change Everything) %FQMPZNFOU 5SBJOFE .PEFM
  7. Dependencies • Data: Data definitions, periods, target domains… • Logic:

    Feature engineerings, ML algorithms, hyper- parameters… • Environment: Host, interpreter, packages… &OWJSPONFOU &OWJSPONFOU %BUB QSFQBSBUJPO .PEFMJOH &WBMVBUJPO %BUB %FQMPZNFOU 5SBJOFE .PEFM
  8. Reproducibility • The closeness of the agreement between the results

    of measurements of the same measurand carried out with same methodology described in the corresponding scientific evidence. (wikipedia) &YQFSJNFOUEFQFOEFODJFTNVTUCFUPUBMMZ NBOBHFEUPHVBSBOUFFSFQSPEVDJCJMJUZ
  9. Dependency Management • These trials and errors are “smaller” to

    fit with version control systems. • Minor data modification, feature engineering, new package trial… • Interactive trials and errors sometimes lead to human errors. • Unintended variable overwritten, modified function is not properly applied…
  10. Experiment Management • There are lots of tools for logging

    but… • How often it is checked? • Unable to aware the lost of reproducibility. • Needs “Management”, in addition to “Logging”. • A tool for humans?
  11. daskperiment Reproducibility for Humans https://github.com/sinhrks/daskperiment

  12. Usage: Jupyter Notebook • Usable in your interpreter with minimum

    modification. def prepare_data(a, b): return a + b def calculate_score(s): return 10 / s ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) 4UBOEBSE EBTLQFSJNFOU d = prepare_data(a, b) calculate_score(d) a = 1 b = 2 ex.set_parameters(a=1, b=2) s.compute() %FpOF FYQFSJNFOU 3VO FYQFSJNFOU %FpOFBOE SVO FYQFSJNFOU %FpOF GVODUJPOT %FpOF QBSBNFUFST /BUVSBMMZTQMJUTEFpOJUJPOBOEFYFDVUJPO
  13. Basic Functionality • Logging history in every experiment trial. •

    Notify users if required. ...[WARNING] Experiment step result is changed with the same input: (step: calculate_score, args: (7,), kwargs: {}) ex.set_parameters(a=1, b=2) s.compute() ex.set_parameters(a=1, b=2) s.compute()
  14. Internal Expressions • Experiment definitions are expressed as computation graph.

    EBTLQFSJNFOU ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute() 1BSBNFUFSTBSF DIBOHFEQFSUSJBMT
  15. Demonstrations https://github.com/sinhrks/daskperiment/blob/master/notebook/quickstart.ipynb

  16. Functionalities Introduced in Demo • Experiment management • Manage trial

    result and its (hyper) parameters • Save intermediate results and metrics • Detect code and environment change • Validate function purity • Feed random seed • Dashboard • Command line tool (CLI)
  17. Usage: CLI • Same definition can be used for CLI

    execution. ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) $-* 1BSBNFUFSTDBOCF QSPWJEFEWJB$-* +VQZUFS/PUFCPPL ex = daskperiment.Experiment(id=‘demo_pj') a = ex.parameter('a') b = ex.parameter('b') @ex def prepare_data(a, b): return a + b @ex.result def calculate_score(s): return 10 / s d = prepare_data(a, b) s = calculate_score(d) ex.set_parameters(a=1, b=2) s.compute()
  18. Note: Data Management • Can cover basic usages, but… •

    Detect external data source change. • Load persisted data to the other experiment. • Using data versioning tool is recommended. @ex def load_user_data(filename): return pd.read_csv(filename) @ex_modeling def load_data(tria_id): return ex_preprocess.get_persisted(‘prepare_data’, trial_id=trial_id)
  19. Internal Overview %BUB &OWJSPONFOU &YQFSJNFOU NBOBHFNFOU%# $PNQVUBUJPO %BTL %BUBDPMMFDUJPO $PMMFDUPS

    *OQVUPVUQVU #BDLFOE 7JTVBMJ[BUJPO #PBSE daskperiment
  20. Dask • A flexible parallel computing library for analytics. •

    Mainly focused on numeric computations. • Provides data structures which is a subset of NumPy and pandas. Blocked Algorithm 4VN $PODBU Dask DataFrame pandas DataFrame 4VN 4VN
  21. (Incomplete) List of OSS uses Dask Airflow • (TFLearn) Deep

    learning library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • Executes end-to-end data science and analytics pipelines entirely on GPUs.
  22. Dask Computation Graph • Task dependency analysis • Remove unnecessary

    task (node) • Merge identical task • Static/dynamic computation order analysis • Graph optimization • Inlining • Assigning dependent tasks to the same worker
  23. Experiment to Computation Graph • Dask.Delayed: Dask API to built

    computation graph from arbitrary functions. • Experiment definitions are expressed as computation graph with Dask.Delayed. • Insert pre/post processing per experiment and trial. • Parameters are provided to computation graph per trial. • Dask executes computation graph in parallel (if possible)
  24. Collector • Collect environment info and store to experiment DB

    via backend. $BUFHPSZ %FTDSJQUJPO $PEF $PEFDPOUFYU 1MBUGPSN *OGPSNBUJPODPMMFDUFECZlQMBUGPSNzNPEVMF $16 *OGPSNBUJPODPMMFDUFECZlDQVJOGPzQBDLBHF 1ZUIPO*OUFSQSFUFS 1ZUIPOWFSTJPO WFOW JOTUBMMFEQBDLBHFTʜ 1BDLBHFEFUBJM /VN1Z 4DJ1Z QBOEBT DPOEBEFUBJMT (JU (JUSFQPTJUPSZ XPSLJOHCSBODIʜ
  25. Backend • Users can specify the location to store experiment

    info as Backend. • Load previous experiment history if exists. • Following backends are supported. • File • Redis • MongoDB daskperiment.Experiment('redis_uri_backend', backend='redis://localhost:6379/0') daskperiment.Experiment('mongo_uri_backend', backend='mongodb://localhost:27017/test_db')
  26. Remainings • Collaboration in a team • Optimize computation graph

    between trials • Extensibility • Fix Collector and Backend API • Support deep learning frameworks
  27. Conclusions • It is difficult to guarantee reproducibility in data

    analytics. • Needs “Management”, in addition to “Logging”. • Built a tool named “daskperiment” • Track and detect anything unexpected. • Experiment steps are automatically parallelized.