The Software Engineering Part of Data Science

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Call me Ninz! - Software Engineer @ Hacarus - Studying MSCS @ Ateneo de Manila University - Music enthusiast - Physics and astronomy is <3 - I love data! - Curious, always.

Slide 3

Slide 3 text

Let’s begin! ● Introduction ● Identifying development challenges ● Deﬁning solutions ● Software engineer to data scientist ● Improvements from Python

Slide 4

Slide 4 text

Slide 5

Slide 5 text

● Product delivery ● Product delivery

Slide 6

Slide 6 text

● Product delivery ● Code quality ● Maintainability ● Architecture/Boiler plates ● User oriented ● App development ● Product delivery ● Model accuracy ● Visualization of data ● Notebooks ● Data-driven ● Mathematical computations

Slide 7

Slide 7 text

The software engineer ● Code from data scientist SOMETIMES ARE NOT EFFICIENT and CLEAN! ● Models are based on limited training and testing data. ● Data formats, input and output. ● D O C U M E N T A T I O N !

Slide 8

Slide 8 text

Models

Slide 9

Slide 9 text

Python bridges the gap ● Common language ● Easy to understand ● Consistency of setup (pyenv, pipenv, etc) ● Faster prototyping

Slide 10

Slide 10 text

Development Challenges

Slide 11

Slide 11 text

Starting the Project ● Data scientists and software engineers must align their requirements ○ Avoid high level planning ○ Be very speciﬁc in resource requirements ● “What data are we using?” ● “Who will be using the data?”

Slide 12

Slide 12 text

Code standards ● Most data scientists don’t use linters! UGHHH SAD! ● If there will be errors, make it straight to the point ● Don’t expect 100% code quality from Data Scientists ● Utilize language feature ○ Type hints in python ○ Data classes

Slide 13

Slide 13 text

Architectural Challenges ● How different algorithms consume resource ○ CPU and memory limitations ○ Storage options ○ Multiprocessing vs multithreading ● How models will be used by applications ○ Interfaces ○ Communication protocols

Slide 14

Slide 14 text

Deployment ● Identify deployment methods ○ Cloud or Local ○ Mobile ○ Micro Controllers, etc ● Scaling up cloud deployments ● Scaling down for local deployments (minimal resource environment) ● Continuous Integration

Slide 15

Slide 15 text

What we’ve learned (Solutions)

Slide 16

Slide 16 text

Data Driven Development ● Application structure and requirements will depend on: ○ What are the data inputs for the algorithms? ○ What are the data outputs/results? ○ How will the data be transformed throughout its lifecycle ● Python provides several libraries that can be utilized in data driven development like Data Classes

Slide 17

Slide 17 text

Data Driven Development Data inputs Transformation / processing Apply Model Output P i p e l i n e Data sources Data storage Data workﬂow

Slide 18

Slide 18 text

Code Standards ● Code Quality ● Error Handling ● Data Integrity

Slide 19

Slide 19 text

Code Quality Remember: Data scientists are NOT software developers.

Slide 20

Slide 20 text

Code Quality - Data scientists and Application team should use linters and automatic code formatting tools. - Conventions on function deﬁnitions and interfaces. - Code reviews - Use Type Hints and other tools that IDEs utilize

Slide 21

Slide 21 text

Type hints from typing import List class User: def __init__(self, name: str, age: int, alive: bool, hobbies: List[str]) -> object: print(f"My name is {name}")

Slide 22

Slide 22 text

Error Handling - Standardize Errors - Meaningful errors - Warnings vs Errors vs Fatal Errors

Slide 23

Slide 23 text

Speciﬁc Errors # List errors thrown by models class NotLatestVersionException(Exception): """Version mismatch for algorithm""" class NotFittedException(Exception): """Model is not fitted""" class DataSizeException(Exception): """Invalid data size for training""" class NoTrainDataException(Exception): """No training data""" - Errors are clear and descriptive - Case to case basis

Slide 24

Slide 24 text

Data Integrity - Create data classes for strict type implementation - Pre processing should be atomic in nature. - Single operation per data only - Data output and results must be stored as granular as possible

Slide 25

Slide 25 text

Class Deﬁnition class TraningData: """TraningData represents labeled dataset""" def __init__(self, data: Iterable, label: Label = None, metadata=None) -> None: """ Parameters ---------- data : Iterable, shape, matrix) label : Label.GOOD | Label.BAD metadata : other info """ self.data = data self.label = label self.metadata = metadata

Slide 26

Slide 26 text

Data Integrity: Atomic Operations Data Cleaning Principal Component Analysis Feature Reduction Training/Prediction

Slide 27

Slide 27 text

Data Integrity: Granularity of Data Results Raw Image Annotations Black and White Image Processed Image Raw Image with result overlay

Slide 28

Slide 28 text

Architectural Challenges ● Memory vs CPU ● CPU needed for training ● Memory needed for model storage ● Consider the kind of algorithm the model uses ● Sparse modeling usually performs well in smaller resource setup

Slide 29

Slide 29 text

Architectural Challenges ● Multithreading ○ Good for memory bound algorithms (Decision Trees) ○ Easy to implement, much simpler ● Multiprocessing ○ Provides more throughput ○ Better for algorithms with high CPU consumption (NN) ○ A lot more difﬁcult to implement

Slide 30

Slide 30 text

Architectural Challenges ● Software engineer must provide the “glue code” for data scientists ○ Utilize interfaces in python ○ Class deﬁnitions

Slide 31

Slide 31 text

Sample Interface from abc import ABC, abstractmethod class ModelInterface(ABC): @abstractmethod def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData]) -> None: # Throw error when not implemented raise NotImplementedError() @abstractmethod def predict(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: # Throw error when not implemented raise NotImplementedError()

Slide 32

Slide 32 text

Sample Interface class MyModel(ModelInterface): def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData]) -> None: self.model = RandomForestClassfier().fit(X, y) # Do some other stuff def predict(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: result = self.model.predict(X) # Process result here return result

Slide 33

Slide 33 text

Deployment ● Cloud has “inﬁnite” resources ● Local deployments have limited resources

Slide 34

Slide 34 text

Deployment As software engineers, we need to do the ff: - Identify if the model will work better in cloud or local - Provide inference so data scientist can adjust

Slide 35

Slide 35 text

Deployment Scaling up - Take note of algo runtime - Anything greater than O(n^2) won’t work even if you scale.

Slide 36

Slide 36 text

Deployment Scaling down - Identify space complexity for setup - O( F * n logn) algorithms should be the target *F is size of feature set

Slide 37

Slide 37 text

Deployment Continuous integration and automated tests are important in making sure errors are handled right away.

Slide 38

Slide 38 text

I want to be a Data Scientist! (♫♪Like no one ever was)

Slide 39

Slide 39 text

Shifting to Data Science ● Formal study! Data Science has LOTS OF MATH! ○ Grad school offers data science courses ○ Usually one year straight courses ● Find related jobs ○ Internships ○ Data engineering is not data science but close enough :P

Slide 40

Slide 40 text

Shifting to Data Science ● Do research ○ A lot of research grants involve trainings, etc ○ Good mentors in the form of PhD students ** Taking online tutorials and lessons are often not that good **

Slide 41

Slide 41 text

All thanks to Python

Slide 42

Slide 42 text

Software development Continuous addition of new features and usage - import importlib.metadata, from concurrent.futures import ThreadPoolExecutor - Paradigm shifts - Functional support - Object oriented support - Rich library for application development - Micro service friendly structures

Slide 43

Slide 43 text

Data Science Continuous addition of new features and usage - Data classes - Support for data driven development - SUPER Extensive data science libraries - Easy to use for data scientist

Slide 44

Slide 44 text

Questions Code samples will be available later today @ https://github.com/pprmint/pycontw