ML Integrations on Application Development

Call me Ninz! - Software Engineer @ Hacarus - Studying
MSCS @ Ateneo de Manila University - Music enthusiast - Physics and astronomy is <3 - I love data! - Curious, always.

Let’s begin! • Introduction • Integration Challenges • Architectural and
Development Approach

• Product delivery • Code quality • Maintainability • Architecture/Boiler
plates • User oriented • App development • Product delivery • Model accuracy • Visualization of data • Notebooks • Data-driven • Mathematical computations

The software engineer • Code from data scientist ARE SOMETIMES
not efﬁcient and clean. • Models are based on limited training and testing data. • Data formats, input and output. • D O C U M E N T A T I O N !

The Challenges - Making sure code is efﬁcient and maintainable
- Resource limitations - Code quality - Error handling - Data integrity for both input and results - Proper feedback loop for data scientists and developers

Efficiency and Maintainability This crucial for both Machine Learning and
Application development side. Usually the focus are the following: - Resource Limitations - Code Quality - Error Handling

Resource Limitations - Data scientists usually work on environments with
limited resources. - Good for creating and verifying models.

Resource Limitations - Real world applications should scale depending on
user demands - Scale with right amount of resource. - Some applications can have speciﬁc memory and resource constraint - Software developers should cater to both

Code Quality Remember: Data scientists are NOT software developers.

Code Quality - Don’t expect 100% code quality - The
quality of the codebase falls into software developers - Type hints are very useful - Tools that improve code readability are highly encourage.

Error Handling model.ﬁt() model.predict() Codes coming from data scientists are
usually abstracted and high level. Data scientists and application developers must agree on how to handle errors.

Data Integrity - Pre-processing of data inputs - Consistency between
expected inputs and outputs - Making sure the right results are displayed on the application side - Making sure the right data is passed to the machine learning side

Feedback loop - DOCUMENT as many things as you can
- Agree on implementation key points such as - Release versions and deployment - Data pipelines - Validation, etc - Regular meetings with data science team is a must!

Solutions and Approach (what we’ve learned in our team)

Proper Resource Handling - Memory vs CPU - CPU needed
for training - Memory needed for model storage - Consider the kind of algorithm the model uses - Sparse modeling usually performs well in smaller resource setup

Proper Resource Handling - Type of deployment - Mobile? -
Cloud? - Local? - Multi Threading VS Multiprocessing - Usually have a thin layer of python interface between.

Code Quality - Data scientists and Application team should use
linters and automatic code formatting tools. - Agree on conventions on function deﬁnitions and interfaces. - Code reviews - Use Type Hints and other tools that IDEs utilize

Sample Interface from abc import ABC, abstractmethod class ModelInterface(ABC): @abstractmethod
def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData]) -> None: # Throw error when not implemented raise NotImplementedError() @abstractmethod def fit(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: # Throw error when not implemented raise NotImplementedError()

Error Handling - Standardize Errors - Meaningful errors - Warnings
vs Errors vs Fatal Errors - Continuous integration and automated tests

Speciﬁc Errors # List errors thrown by models class NotLatestVersionException(Exception):
"""Version mismatch for algorithm""" class NotFittedException(Exception): """Model is not fitted""" class DataSizeException(Exception): """Invalid data size for training""" class NoTrainDataException(Exception): """No training data""" - Errors are clear and descriptive - Case to case basis

Error Handling Continuous integration and automated tests are important in
making sure errors are handled right away.

Data Integrity - Create data classes for strict type implementation
- Pre processing should be atomic in nature. - Single operation per data only - Data output and results must be stored as granular as possible

Data Class class TraningData: """TraningData represents labeled dataset""" def __init__(self,
data: Iterable, label: Label = None, metadata=None) -> None: """ Parameters ---------- data : Iterable, shape, matrix) label : Label.GOOD | Label.BAD metadata : other info """ self.data = data self.label = label self.metadata = metadata

Data Integrity: Atomic Operations Data Cleaning Feature Reduction Principal Component
Analysis Training/Prediction

Data Integrity: Granularity of Data Results Raw Image Annotations Black
and White Image Processed Image Raw Image with result overlay

Feedback Loop Without proper feedback loop and communication, it is
very difﬁcult to work with machine learning developers and data scientist.

Feedback Loop Proactive Documentation - When app developers notice something
missing, we inform data science team right away - Documentation in advance even if the feature is still being developed

Feedback Loop Version Handling - ML libraries and applications uses
different versioning - One application might use a different version of the ML

Feedback Loop Deployment - Software developers must give capability to
ML team to deploy new versions of algorithms - Deployment must be reversible and backwards compatible

Feedback Loop Team Building Activities (for nerdy people) - Kaggle
Challenge - Software engineers doing ML exercises with data scientists and vice versa. - Solves online challenges, etc. - Makes it easier to align with ML team.

Questions

ML Integrations on Application Development

ML Integrations on Application Development

Hacarus Inc.

More Decks by Hacarus Inc.

Other Decks in Technology

Featured

Transcript