Slide 1

Slide 1 text

ML Integrations on Application Development

Slide 2

Slide 2 text

Call me Ninz! - Software Engineer @ Hacarus - Studying MSCS @ Ateneo de Manila University - Music enthusiast - Physics and astronomy is <3 - I love data! - Curious, always.

Slide 3

Slide 3 text

Let’s begin! ● Introduction ● Integration Challenges ● Architectural and Development Approach

Slide 4

Slide 4 text

VS

Slide 5

Slide 5 text

● Product delivery ● Code quality ● Maintainability ● Architecture/Boiler plates ● User oriented ● App development ● Product delivery ● Model accuracy ● Visualization of data ● Notebooks ● Data-driven ● Mathematical computations

Slide 6

Slide 6 text

The software engineer ● Code from data scientist ARE SOMETIMES not efficient and clean. ● Models are based on limited training and testing data. ● Data formats, input and output. ● D O C U M E N T A T I O N !

Slide 7

Slide 7 text

The Challenges - Making sure code is efficient and maintainable - Resource limitations - Code quality - Error handling - Data integrity for both input and results - Proper feedback loop for data scientists and developers

Slide 8

Slide 8 text

Efficiency and Maintainability This crucial for both Machine Learning and Application development side. Usually the focus are the following: - Resource Limitations - Code Quality - Error Handling

Slide 9

Slide 9 text

Resource Limitations - Data scientists usually work on environments with limited resources. - Good for creating and verifying models.

Slide 10

Slide 10 text

Resource Limitations - Real world applications should scale depending on user demands - Scale with right amount of resource. - Some applications can have specific memory and resource constraint - Software developers should cater to both

Slide 11

Slide 11 text

Code Quality Remember: Data scientists are NOT software developers.

Slide 12

Slide 12 text

Code Quality - Don’t expect 100% code quality - The quality of the codebase falls into software developers - Type hints are very useful - Tools that improve code readability are highly encourage.

Slide 13

Slide 13 text

Error Handling model.fit() model.predict() Codes coming from data scientists are usually abstracted and high level. Data scientists and application developers must agree on how to handle errors.

Slide 14

Slide 14 text

Data Integrity - Pre-processing of data inputs - Consistency between expected inputs and outputs - Making sure the right results are displayed on the application side - Making sure the right data is passed to the machine learning side

Slide 15

Slide 15 text

Feedback loop - DOCUMENT as many things as you can - Agree on implementation key points such as - Release versions and deployment - Data pipelines - Validation, etc - Regular meetings with data science team is a must!

Slide 16

Slide 16 text

Solutions and Approach (what we’ve learned in our team)

Slide 17

Slide 17 text

Proper Resource Handling - Memory vs CPU - CPU needed for training - Memory needed for model storage - Consider the kind of algorithm the model uses - Sparse modeling usually performs well in smaller resource setup

Slide 18

Slide 18 text

Proper Resource Handling - Type of deployment - Mobile? - Cloud? - Local? - Multi Threading VS Multiprocessing - Usually have a thin layer of python interface between.

Slide 19

Slide 19 text

Code Quality - Data scientists and Application team should use linters and automatic code formatting tools. - Agree on conventions on function definitions and interfaces. - Code reviews - Use Type Hints and other tools that IDEs utilize

Slide 20

Slide 20 text

Sample Interface from abc import ABC, abstractmethod class ModelInterface(ABC): @abstractmethod def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData]) -> None: # Throw error when not implemented raise NotImplementedError() @abstractmethod def fit(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: # Throw error when not implemented raise NotImplementedError()

Slide 21

Slide 21 text

Error Handling - Standardize Errors - Meaningful errors - Warnings vs Errors vs Fatal Errors - Continuous integration and automated tests

Slide 22

Slide 22 text

Specific Errors # List errors thrown by models class NotLatestVersionException(Exception): """Version mismatch for algorithm""" class NotFittedException(Exception): """Model is not fitted""" class DataSizeException(Exception): """Invalid data size for training""" class NoTrainDataException(Exception): """No training data""" - Errors are clear and descriptive - Case to case basis

Slide 23

Slide 23 text

Error Handling Continuous integration and automated tests are important in making sure errors are handled right away.

Slide 24

Slide 24 text

Data Integrity - Create data classes for strict type implementation - Pre processing should be atomic in nature. - Single operation per data only - Data output and results must be stored as granular as possible

Slide 25

Slide 25 text

Data Class class TraningData: """TraningData represents labeled dataset""" def __init__(self, data: Iterable, label: Label = None, metadata=None) -> None: """ Parameters ---------- data : Iterable, shape, matrix) label : Label.GOOD | Label.BAD metadata : other info """ self.data = data self.label = label self.metadata = metadata

Slide 26

Slide 26 text

Data Integrity: Atomic Operations Data Cleaning Feature Reduction Principal Component Analysis Training/Prediction

Slide 27

Slide 27 text

Data Integrity: Granularity of Data Results Raw Image Annotations Black and White Image Processed Image Raw Image with result overlay

Slide 28

Slide 28 text

Feedback Loop Without proper feedback loop and communication, it is very difficult to work with machine learning developers and data scientist.

Slide 29

Slide 29 text

Feedback Loop Proactive Documentation - When app developers notice something missing, we inform data science team right away - Documentation in advance even if the feature is still being developed

Slide 30

Slide 30 text

Feedback Loop Version Handling - ML libraries and applications uses different versioning - One application might use a different version of the ML

Slide 31

Slide 31 text

Feedback Loop Deployment - Software developers must give capability to ML team to deploy new versions of algorithms - Deployment must be reversible and backwards compatible

Slide 32

Slide 32 text

Feedback Loop Team Building Activities (for nerdy people) - Kaggle Challenge - Software engineers doing ML exercises with data scientists and vice versa. - Solves online challenges, etc. - Makes it easier to align with ML team.

Slide 33

Slide 33 text

Questions