Upgrade to Pro — share decks privately, control downloads, hide ads and more …

daskperiment: Reproducibility for Humans

Sinhrks
April 24, 2019

daskperiment: Reproducibility for Humans

@Scipy Japan 2019/4/24

Sinhrks

April 24, 2019
Tweet

More Decks by Sinhrks

Other Decks in Science

Transcript

  1. daskperiment
    Reproducibility for Humans
    Masaaki Horikoshi @ ARISE analytics

    View Slide

  2. Self-introduction
    • Masaaki Horikoshi
    • ARISE analytics
    • Data analytics, etc
    • A member of core developers:
    • GitHub: https://github.com/sinhrks

    View Slide

  3. Contents
    &YQFSJNFOU.BOBHFNFOU
    EBTLQFSJNFOU

    View Slide

  4. Reproducibility
    • The closeness of the agreement between the results of
    measurements of the same measurand carried out with same
    methodology described in the corresponding scientific
    evidence. (wikipedia)
    • Without reproducibility:
    • Unable to replicate the previous experiment trial.
    • Fruitless investigations and collation works required.

    View Slide

  5. Data Analytics Process
    https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
    • Lots of trials and errors is necessary especially in initial phase.
    $3*41%.

    View Slide

  6. Data Analytics Pipeline
    %BUB
    QSFQBSBUJPO
    .PEFMJOH
    &WBMVBUJPO
    %BUB
    • Varies along with the trial and errors
    • Data dependent
    • CACE (Change Anything, Change Everything)
    %FQMPZNFOU
    5SBJOFE
    .PEFM

    View Slide

  7. Dependencies
    • Data: Data definitions, periods, target domains…
    • Logic: Feature engineerings, ML algorithms, hyper-
    parameters…
    • Environment: Host, interpreter, packages…
    &OWJSPONFOU &OWJSPONFOU
    %BUB
    QSFQBSBUJPO
    .PEFMJOH
    &WBMVBUJPO
    %BUB
    %FQMPZNFOU
    5SBJOFE
    .PEFM

    View Slide

  8. Reproducibility
    • The closeness of the agreement between the results of
    measurements of the same measurand carried out with same
    methodology described in the corresponding scientific
    evidence. (wikipedia)
    &YQFSJNFOUEFQFOEFODJFTNVTUCFUPUBMMZ
    NBOBHFEUPHVBSBOUFFSFQSPEVDJCJMJUZ

    View Slide

  9. Dependency Management
    • These trials and errors are “smaller” to fit with version control
    systems.
    • Minor data modification, feature engineering, new package
    trial…
    • Interactive trials and errors sometimes lead to human errors.
    • Unintended variable overwritten, modified function is not
    properly applied…

    View Slide

  10. Experiment Management
    • There are lots of tools for logging but…
    • How often it is checked?
    • Unable to aware the lost of reproducibility.
    • Needs “Management”, in addition to “Logging”.
    • A tool for humans?

    View Slide

  11. daskperiment
    Reproducibility for Humans
    https://github.com/sinhrks/daskperiment

    View Slide

  12. Usage: Jupyter Notebook
    • Usable in your interpreter with minimum modification.
    def prepare_data(a, b):
    return a + b
    def calculate_score(s):
    return 10 / s
    ex = daskperiment.Experiment(id=‘demo_pj')
    a = ex.parameter('a')
    b = ex.parameter('b')
    @ex
    def prepare_data(a, b):
    return a + b
    @ex.result
    def calculate_score(s):
    return 10 / s
    d = prepare_data(a, b)
    s = calculate_score(d)
    4UBOEBSE EBTLQFSJNFOU
    d = prepare_data(a, b)
    calculate_score(d)
    a = 1
    b = 2
    ex.set_parameters(a=1, b=2)
    s.compute()
    %FpOF
    FYQFSJNFOU
    3VO
    FYQFSJNFOU
    %FpOFBOE
    SVO
    FYQFSJNFOU
    %FpOF
    GVODUJPOT
    %FpOF
    QBSBNFUFST
    /BUVSBMMZTQMJUTEFpOJUJPOBOEFYFDVUJPO

    View Slide

  13. Basic Functionality
    • Logging history in every experiment trial.
    • Notify users if required.
    ...[WARNING] Experiment step result is changed with the same input: (step: calculate_score,
    args: (7,), kwargs: {})
    ex.set_parameters(a=1, b=2)
    s.compute()
    ex.set_parameters(a=1, b=2)
    s.compute()

    View Slide

  14. Internal Expressions
    • Experiment definitions are expressed as computation graph.
    EBTLQFSJNFOU
    ex = daskperiment.Experiment(id=‘demo_pj')
    a = ex.parameter('a')
    b = ex.parameter('b')
    @ex
    def prepare_data(a, b):
    return a + b
    @ex.result
    def calculate_score(s):
    return 10 / s
    d = prepare_data(a, b)
    s = calculate_score(d)
    ex.set_parameters(a=1, b=2)
    s.compute()
    1BSBNFUFSTBSF
    DIBOHFEQFSUSJBMT

    View Slide

  15. Demonstrations
    https://github.com/sinhrks/daskperiment/blob/master/notebook/quickstart.ipynb

    View Slide

  16. Functionalities Introduced in Demo
    • Experiment management
    • Manage trial result and its (hyper) parameters
    • Save intermediate results and metrics
    • Detect code and environment change
    • Validate function purity
    • Feed random seed
    • Dashboard
    • Command line tool (CLI)

    View Slide

  17. Usage: CLI
    • Same definition can be used for CLI execution.
    ex = daskperiment.Experiment(id=‘demo_pj')
    a = ex.parameter('a')
    b = ex.parameter('b')
    @ex
    def prepare_data(a, b):
    return a + b
    @ex.result
    def calculate_score(s):
    return 10 / s
    d = prepare_data(a, b)
    s = calculate_score(d)
    $-*
    1BSBNFUFSTDBOCF
    QSPWJEFEWJB$-*
    +VQZUFS/PUFCPPL
    ex = daskperiment.Experiment(id=‘demo_pj')
    a = ex.parameter('a')
    b = ex.parameter('b')
    @ex
    def prepare_data(a, b):
    return a + b
    @ex.result
    def calculate_score(s):
    return 10 / s
    d = prepare_data(a, b)
    s = calculate_score(d)
    ex.set_parameters(a=1, b=2)
    s.compute()

    View Slide

  18. Note: Data Management
    • Can cover basic usages, but…
    • Detect external data source change.
    • Load persisted data to the other experiment.
    • Using data versioning tool is recommended.
    @ex
    def load_user_data(filename):
    return pd.read_csv(filename)
    @ex_modeling
    def load_data(tria_id):
    return ex_preprocess.get_persisted(‘prepare_data’, trial_id=trial_id)

    View Slide

  19. Internal Overview
    %BUB
    &OWJSPONFOU
    &YQFSJNFOU
    NBOBHFNFOU%#
    $PNQVUBUJPO
    %BTL

    %BUBDPMMFDUJPO
    $PMMFDUPS

    *OQVUPVUQVU
    #BDLFOE

    7JTVBMJ[BUJPO
    #PBSE

    daskperiment

    View Slide

  20. Dask
    • A flexible parallel computing library for analytics.
    • Mainly focused on numeric computations.
    • Provides data structures which is a subset of NumPy and pandas.
    Blocked Algorithm
    4VN
    $PODBU
    Dask DataFrame
    pandas DataFrame
    4VN
    4VN

    View Slide

  21. (Incomplete) List of OSS uses Dask
    Airflow
    • (TFLearn) Deep learning library featuring a higher-level API
    for TensorFlow.
    • (Distributed Scheduler) A platform to author, schedule and
    monitor workflows.
    • Image Processing SciKit.
    • N-D labeled arrays and datasets in Python.
    • Executes end-to-end data science and analytics pipelines
    entirely on GPUs.

    View Slide

  22. Dask Computation Graph
    • Task dependency analysis

    • Remove unnecessary task (node)

    • Merge identical task

    • Static/dynamic computation order
    analysis
    • Graph optimization

    • Inlining

    • Assigning dependent tasks to the same
    worker

    View Slide

  23. Experiment to Computation Graph
    • Dask.Delayed: Dask API to built computation graph from arbitrary
    functions.
    • Experiment definitions are expressed as computation graph with
    Dask.Delayed.
    • Insert pre/post processing per experiment and trial.
    • Parameters are provided to computation graph per trial.
    • Dask executes computation graph in parallel (if possible)

    View Slide

  24. Collector
    • Collect environment info and store to experiment DB via
    backend.
    $BUFHPSZ %FTDSJQUJPO
    $PEF $PEFDPOUFYU
    1MBUGPSN *OGPSNBUJPODPMMFDUFECZlQMBUGPSNzNPEVMF
    $16 *OGPSNBUJPODPMMFDUFECZlDQVJOGPzQBDLBHF
    1ZUIPO*OUFSQSFUFS 1ZUIPOWFSTJPO WFOW JOTUBMMFEQBDLBHFTʜ
    1BDLBHFEFUBJM /VN1Z 4DJ1Z QBOEBT DPOEBEFUBJMT
    (JU (JUSFQPTJUPSZ XPSLJOHCSBODIʜ

    View Slide

  25. Backend
    • Users can specify the location to store experiment info as Backend.
    • Load previous experiment history if exists.
    • Following backends are supported.
    • File
    • Redis
    • MongoDB
    daskperiment.Experiment('redis_uri_backend', backend='redis://localhost:6379/0')
    daskperiment.Experiment('mongo_uri_backend', backend='mongodb://localhost:27017/test_db')

    View Slide

  26. Remainings
    • Collaboration in a team
    • Optimize computation graph between trials
    • Extensibility
    • Fix Collector and Backend API
    • Support deep learning frameworks

    View Slide

  27. Conclusions
    • It is difficult to guarantee reproducibility in data analytics.
    • Needs “Management”, in addition to “Logging”.
    • Built a tool named “daskperiment”
    • Track and detect anything unexpected.
    • Experiment steps are automatically parallelized.

    View Slide