Port Harcourt School of AI Workshop

Software Engineering Best Practices for Data Scientists and Machine Learning
Engineers Port Harcourt School of AI Practical Machine Learning Course September 05, 2020

Eseoghene Emuraye • Statistical Consultant - Provide consultation in programming
and machine learning Graduate Student, Data Science The George Washington University • Cognitive Engineer Consultant Intern - Worked with the New York State Government • Graduate Teaching Assistant (Natural Language Processing) - Assist students and grade assignments/exams • Hiking/exploring nature • Playing the piano Favorite way to pass time @eemuraye linkedin.com/in/eserichard

Outline • Learning objectives • Why Data Scientists/Machine Engineers should
adopt software engineering best practices • Scope of work for Data Scientists/ML Engineers • Model Experimentation/Development phase • MLOps phase • Case study: Goodreads Average Rating Prediction • Key Takeaways

Learning Objectives • Understand the roles played by data scientists/machine
engineers in a team • Gain basic understanding of essential software engineering best practices

Why Data Scientists/Machine Engineers should adopt software engineering best practices
Source: Drew Conway – The Data Science Venn Diagram

Why Data Scientists/Machine Engineers should adopt software engineering best practices
Core team UI/UX Data Engineer Data Scientist/ML Engineer Software Engineer (Frontend/Backend/DevOps) Project Manager Team Lead Business Analysts Solution Architect Working Product

Phase I: Model Development and Prototyping Phase II: Deployment and
MLOps • Deploying model prototype at scale • Data Feedback loop • Model Monitoring • Continuous Delivery • Analyzing data • Deriving business insights • Prototyping models • Testing and Evaluation Scope of work for Data Scientists/ML Engineers

• Readable, efficient and maintainable code • Proper documentation •
Proper versioning • Good dependency management • Efficient testing • Efficient CI/CD implementation Essential software engineering best practices for Data Scientists/ML Engineers Objective: To write clean, stable, reusable code that can be maintained for years and is easily scalable

Readable and maintainable code • Use a style guide (e.g.
pylint, yapf, pep8), Google Style Guide • Naming conventions • Classes, variables and parameters should be nouns • Methods should be verbs • Avoid abbreviations except it is universally known • Methods • Should be an easily described functionality • Should be reusable throughout the project • Should handle a single thing • Should provide a useful output • Classes/Objects • Avoid unrelated functionality in your class (SOLID) – Single Responsibility Principle • Use functions for functionality unrelated to any object/class

Documentation • Use of docstrings • Classes • Brief description
of the class • Docstring for constructor method and for each method in the class • Methods and functions • Brief description of the functionality • Description of the arguments, with type and default value • Description of outputs, with their type • Description of error/exception with reasons NB: Splinx can be used to auto-generate HTML-based documentation for consumers/users of API • Repository/Component level documentation • The readme.md should contain the purpose of the respository/component and how to use it

Dependency management • It is important for builds to run
anywhere and stable overtime across environments • Pip is the most popular dependency manager for python • Use requirements.txt file to record all project dependencies so others can install them easily. The exact version of the package should be specified alongside • While pip is the common dependency manager, poetry is becoming widespread because of its simplicity. [auto creates a virtual env, adds/removes packages from pyproject.toml, akin to requirements.txt for pip] • Config files should be .env or .json

Versioning • Machine Learning System entails: data + source code
+ model • Source code versioning: version control e.g. git. • Data versioning: Data Versioning Control (DVC) or MLflow • Model Versioning: MLflow • NB: Use .gitignore to exclude files which are not relevant to the project (e.g. IDE configuration) • Cloud solutions have self-managed services that handle code, data and model versioning

Testing • Unit testing: Involves testing individual units of a
software to validate that the tested unit meets design specification • Integration testing: Involves testing the combined individual units of a software as a unit/group • System testing: Involves end-to-end testing of system specifications • Monitoring: Drift testing, Seam testing (unit testing our data inputs and outputs to ensure model lies within tolerance)

CI/CD (Continuous Integration/Continuous Deployment) • Continuous Integration • New builds
and automated testing is initiated as soon as code is merged • CD begins if all tests passes • Continuous Deployment • Deploys changes automatically to production • Trained model is served as a prediction service for predictions • Monitoring: Track summary statistics of the live data and monitor online performance of your model to send triggers or rollback to previous version when values deviate from expectations • CI/CD is implemented to automatically build, test and safely deploy your application, so you can iterate quickly when developing new software.

Case Study: Goodreads Average Rating Prediction Goodreads is an American
social cataloging website that allows individuals to search freely its database of books, annotations, quotes, and reviews. About Goodreads This project involves the use of book features from the goodreads datasets to predict the average rating of a book. The average rating of a book ranges from 1 to 5. Several features from the books datasets and generated features have been explored to carry out a classification task Scope

Cloud-based Self-Managed Services

Logical Structure of Sagemaker Source: AWS re:Invent 2018

Key Takeaways • Data Scientists/Machine Learning Engineers work in a
team with other engineers and managers to deliver a complete solution • Data Scientists/Machine Learning Engineers should be able to write clean, reusable code that can be maintained for a long time and can scale easily

Thank You

Questions

Port Harcourt School of AI Workshop

Port Harcourt School of AI Workshop

Ese

More Decks by Ese

Other Decks in Technology

Featured

Transcript

Software Engineering Best Practices for Data Scientists and Machine Learning

Eseoghene Emuraye • Statistical Consultant - Provide consultation in programming

Outline • Learning objectives • Why Data Scientists/Machine Engineers should

Learning Objectives • Understand the roles played by data scientists/machine

Why Data Scientists/Machine Engineers should adopt software engineering best practices

Why Data Scientists/Machine Engineers should adopt software engineering best practices

Phase I: Model Development and Prototyping Phase II: Deployment and

• Readable, efficient and maintainable code • Proper documentation •

Readable and maintainable code • Use a style guide (e.g.

Documentation • Use of docstrings • Classes • Brief description

Dependency management • It is important for builds to run

Versioning • Machine Learning System entails: data + source code

Testing • Unit testing: Involves testing individual units of a

CI/CD (Continuous Integration/Continuous Deployment) • Continuous Integration • New builds

Case Study: Goodreads Average Rating Prediction Goodreads is an American

Cloud-based Self-Managed Services

Logical Structure of Sagemaker Source: AWS re:Invent 2018

Key Takeaways • Data Scientists/Machine Learning Engineers work in a

Thank You

Questions