Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automate Your Boilerplate

Chris
October 25, 2017

Automate Your Boilerplate

Project templating and scaffolding tools like the cookiecutter Python package can be a great help when starting a new project. They provide a way of generating a predefined layout of files and directories, and can also be parameterised to accept arguments as they are generated. e.g. name of the new project.

Creating such a template takes some effort but means quicker startup times on future projects; less boiler plate code to write; more consistent project layouts; and even automation of common setup tasks. However such templates work best for setting up highly repeatable project structures (like when writing plugins or small command line applications).

Can we make use of these for Data Science projects? I will share my experience of doing so in this talk, along with the benefits and drawbacks I have found in trying to automate away as much of the boilerplate and manual effort when starting a new Data Science project!

I will then present how I have eased some of the pain points by breaking down a typical project into several templating layers, and then developing a command line tool to help manage and apply these template layers to a project as and when needed.

As part of the talk I will give an overview of cookiecutter templates, and how they can be built upon to achieve this approach of composable project templating.

Chris

October 25, 2017
Tweet

More Decks by Chris

Other Decks in Programming

Transcript

  1. Introduction • Data Scientist / Developer / DevOps • Resident

    Python Specialist • I work hard to be lazy!
  2. Data Science Projects No standard path to get from A

    to B Common issues: • Consistency • Reproducibility • Reusability
  3. Consistency Helps • Using standard layouts • Analysis is not

    a package • Useful to automate this boilerplate setup example-project ├── analysis │ ├── 01-Explore.ipynb │ └── source.py ├── data │ ├── interim │ ├── prepared │ └── raw ├── output ├── Makefile ├── README.md └── environment.yml
  4. • Uses jinja2 templating engine • Allows pre/post hooks (.py

    or .sh) • Language agnostic • Many existing templates – https://drivendata.github.io/cookiecutter- data-science/ – https://github.com/audreyr/cookiecutter
  5. • Variable tag in files/directory names and content: • Replace

    with variables in a json file: { "my_variable_name": "default value" } {{cookiecutter.my_variable_name}}
  6. Usage Installation pip install cookiecutter Create new project with cookiecutter

    path/to/template Or cookiecutter gh:path/to/github/repo
  7. Project Templating Pro • Faster project startup time • Less

    distractions, more focus on the issue • More consistent layouts • Collaborators will like you more! Con • Time and effort to create/debug etc.
  8. Data Science Projects • Different sizes and mixture of: •

    Different needs at different times in a projects lifetime. Exploration Construction
  9. Good First Step, but… May later want to add: •

    Python package • Tests • Documentation • Docker file • CI build • conda-recipe • Logging • Command line API • …
  10. Issue • Templates generated only once – Need to know

    what you want upfront • Many project needs = many templates? – Maintenance issues, more updating/debugging – Many repeated sections • Or one large monolithic template? – Long survey, no multi-select – Unused files are confusing/distracting – Harder to on board new team members
  11. What if We start out with bare minimum and add

    in extra components as needed? • Pros: – Greater flexibility – Easier comprehension – Multiple checkpoints • Cons: – Time in upfront design – Longer dev/debug times – Much more complexity
  12. Would need 1. A way to render each template layer

    2. A way to handle dependencies between components 3. A way to store state information for the project 4. Tool to setup/add templates as needed
  13. Keep it as simple as possible • Minimise changes •

    Use independent files • Favor appending text Projects as Layered Components
  14. Masonry Installation pip install masonry Create new project mason init

    path/to/masonry-templates Add a new layer mason add template_layer
  15. Masonry Template Collection Cookiecutter template per directory path/to/masonry-templates │ ├──

    package │ ├── {{cookiecutter.project_name}} │ ├── hooks │ └── cookiecutter.json │ ├── pytest │ ├── {{cookiecutter.project_name}} │ ├── hooks │ └── cookiecutter.json │ └── metadata.json { "default": "package", "dependencies": { "pytest": ["package"] } }
  16. Masonry Features • Cookiecutter – pre and post hooks supported

    – variables remembered and reused • Interactively select project and templates • Easy pre/post fixing to text and .py files • Git commits after each added layer • Automated check on all the template layers
  17. Summary • Consistency in DS project layout helps • Cookiecutter

    - create custom templates and reduce startup times. • Multiple needs throughout a project: – Consider a layered templating approach – Upfront design really pays off
  18. Roadmap • Aim to support all cookiecutter features • Improve

    UI feedback • Improve logging for hook scripts and error feedback • pytest-plugin? • Include utilities for common post hook tasks e.g. run subprocess