Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ODSC East: Kedro + MLflow Reproducible and versioned data pipelines at scale (Preview)

ODSC East: Kedro + MLflow Reproducible and versioned data pipelines at scale (Preview)

Tom Goldenberg

February 19, 2020
Tweet

More Decks by Tom Goldenberg

Other Decks in Technology

Transcript

  1. Kedro + MLflow Reproducible and versioned data pipelines at scale

    Tom Goldenberg @tomgoldenberg #kedro | ODSC East 2020
  2. 2 All content copyright © 2019 QuantumBlack, a McKinsey company

    Kedro is a Python library that is the bridge between Machine Learning and Software Engineering Kedro is a development workflow framework that implements software engineering best-practice for data pipelines with an eye towards productionising machine learning models. 2 All content copyright © 2019 QuantumBlack, a McKinsey company
  3. 3 All content copyright © 2019 QuantumBlack, a McKinsey company

    Why does Kedro exist? Clean code is expected A successful project does not only entail having a model run in production; our success is a client that can maintain their own data pipeline when we leave. A larger team increases workflow variance Our data scientists, data engineers and machine learning engineers really struggled to collaborate on a code-base together. Efficiency when delivering production- Ready code We have time to do code and model optimization but we do not have time to refactor code. This means that we needed a seamless way to quickly move from the experimentation phase into production-ready code. Reduced learning curve Our teams come from many different backgrounds with varying experience with software engineering principles. It’s with empathy that we say, “How can we tweak your workflow so that our coding standards are the same?”
  4. 4 All content copyright © 2019 QuantumBlack, a McKinsey company

    Experiment tracking – necessary during model development and post-deployment Proving value of experiments Without a baseline hard to show progress How do I demonstrate the value and results of the different experimentations that I’ve done? How can I share my findings with other team members so they can leverage off the insights that I’ve found? I can’t remember what features I used to train my best performing model from a week ago! Visualisation of data to bring transparency into project progress Sync and collaborate across multiple workstreams Tracking run parameters Pain points Examples Addressed by Lost work Not easy to find out parameters from previous runs Collaboration Referencing and accessing models and artifacts without repo bloat
  5. 5 All content copyright © 2019 QuantumBlack, a McKinsey company

    Kedro and MLflow work together in both model development and deployment End-user Squad Kedro Seamless Local Development Data abstraction and code organization for best software engineering practices Integrated Model Tracking Model run comparisons, parameter tracking and model versioning Local Development Version Control Batch Train Model Serve Consumption 1 3 4 2 5