Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Software Engineering Part of Data Science

227382dbd5e033db211c159edf32853c?s=47 Hacarus Inc.
September 21, 2019

The Software Engineering Part of Data Science

Presentation slides at PyCon Taiwan 2019


Hacarus Inc.

September 21, 2019


  1. The Software Engineering Part of Data Science

  2. Call me Ninz! - Software Engineer @ Hacarus - Studying

    MSCS @ Ateneo de Manila University - Music enthusiast - Physics and astronomy is <3 - I love data! - Curious, always.
  3. Let’s begin! • Introduction • Identifying development challenges • Defining

    solutions • Software engineer to data scientist • Improvements from Python
  4. VS

  5. • Product delivery • Product delivery

  6. • Product delivery • Code quality • Maintainability • Architecture/Boiler

    plates • User oriented • App development • Product delivery • Model accuracy • Visualization of data • Notebooks • Data-driven • Mathematical computations
  7. The software engineer • Code from data scientist SOMETIMES ARE

    NOT EFFICIENT and CLEAN! • Models are based on limited training and testing data. • Data formats, input and output. • D O C U M E N T A T I O N !
  8. Models

  9. Python bridges the gap • Common language • Easy to

    understand • Consistency of setup (pyenv, pipenv, etc) • Faster prototyping
  10. Development Challenges

  11. Starting the Project • Data scientists and software engineers must

    align their requirements ◦ Avoid high level planning ◦ Be very specific in resource requirements • “What data are we using?” • “Who will be using the data?”
  12. Code standards • Most data scientists don’t use linters! UGHHH

    SAD! • If there will be errors, make it straight to the point • Don’t expect 100% code quality from Data Scientists • Utilize language feature ◦ Type hints in python ◦ Data classes
  13. Architectural Challenges • How different algorithms consume resource ◦ CPU

    and memory limitations ◦ Storage options ◦ Multiprocessing vs multithreading • How models will be used by applications ◦ Interfaces ◦ Communication protocols
  14. Deployment • Identify deployment methods ◦ Cloud or Local ◦

    Mobile ◦ Micro Controllers, etc • Scaling up cloud deployments • Scaling down for local deployments (minimal resource environment) • Continuous Integration
  15. What we’ve learned (Solutions)

  16. Data Driven Development • Application structure and requirements will depend

    on: ◦ What are the data inputs for the algorithms? ◦ What are the data outputs/results? ◦ How will the data be transformed throughout its lifecycle • Python provides several libraries that can be utilized in data driven development like Data Classes
  17. Data Driven Development Data inputs Transformation / processing Apply Model

    Output P i p e l i n e Data sources Data storage Data workflow
  18. Code Standards • Code Quality • Error Handling • Data

  19. Code Quality Remember: Data scientists are NOT software developers.

  20. Code Quality - Data scientists and Application team should use

    linters and automatic code formatting tools. - Conventions on function definitions and interfaces. - Code reviews - Use Type Hints and other tools that IDEs utilize
  21. Type hints from typing import List class User: def __init__(self,

    name: str, age: int, alive: bool, hobbies: List[str]) -> object: print(f"My name is {name}")
  22. Error Handling - Standardize Errors - Meaningful errors - Warnings

    vs Errors vs Fatal Errors
  23. Specific Errors # List errors thrown by models class NotLatestVersionException(Exception):

    """Version mismatch for algorithm""" class NotFittedException(Exception): """Model is not fitted""" class DataSizeException(Exception): """Invalid data size for training""" class NoTrainDataException(Exception): """No training data""" - Errors are clear and descriptive - Case to case basis
  24. Data Integrity - Create data classes for strict type implementation

    - Pre processing should be atomic in nature. - Single operation per data only - Data output and results must be stored as granular as possible
  25. Class Definition class TraningData: """TraningData represents labeled dataset""" def __init__(self,

    data: Iterable, label: Label = None, metadata=None) -> None: """ Parameters ---------- data : Iterable, shape, matrix) label : Label.GOOD | Label.BAD metadata : other info """ self.data = data self.label = label self.metadata = metadata
  26. Data Integrity: Atomic Operations Data Cleaning Principal Component Analysis Feature

    Reduction Training/Prediction
  27. Data Integrity: Granularity of Data Results Raw Image Annotations Black

    and White Image Processed Image Raw Image with result overlay
  28. Architectural Challenges • Memory vs CPU • CPU needed for

    training • Memory needed for model storage • Consider the kind of algorithm the model uses • Sparse modeling usually performs well in smaller resource setup
  29. Architectural Challenges • Multithreading ◦ Good for memory bound algorithms

    (Decision Trees) ◦ Easy to implement, much simpler • Multiprocessing ◦ Provides more throughput ◦ Better for algorithms with high CPU consumption (NN) ◦ A lot more difficult to implement
  30. Architectural Challenges • Software engineer must provide the “glue code”

    for data scientists ◦ Utilize interfaces in python ◦ Class definitions
  31. Sample Interface from abc import ABC, abstractmethod class ModelInterface(ABC): @abstractmethod

    def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData]) -> None: # Throw error when not implemented raise NotImplementedError() @abstractmethod def predict(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: # Throw error when not implemented raise NotImplementedError()
  32. Sample Interface class MyModel(ModelInterface): def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData])

    -> None: self.model = RandomForestClassfier().fit(X, y) # Do some other stuff def predict(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: result = self.model.predict(X) # Process result here return result
  33. Deployment • Cloud has “infinite” resources • Local deployments have

    limited resources
  34. Deployment As software engineers, we need to do the ff:

    - Identify if the model will work better in cloud or local - Provide inference so data scientist can adjust
  35. Deployment Scaling up - Take note of algo runtime -

    Anything greater than O(n^2) won’t work even if you scale.
  36. Deployment Scaling down - Identify space complexity for setup -

    O( F * n logn) algorithms should be the target *F is size of feature set
  37. Deployment Continuous integration and automated tests are important in making

    sure errors are handled right away.
  38. I want to be a Data Scientist! (♫♪Like no one

    ever was)
  39. Shifting to Data Science • Formal study! Data Science has

    LOTS OF MATH! ◦ Grad school offers data science courses ◦ Usually one year straight courses • Find related jobs ◦ Internships ◦ Data engineering is not data science but close enough :P
  40. Shifting to Data Science • Do research ◦ A lot

    of research grants involve trainings, etc ◦ Good mentors in the form of PhD students ** Taking online tutorials and lessons are often not that good **
  41. All thanks to Python

  42. Software development Continuous addition of new features and usage -

    import importlib.metadata, from concurrent.futures import ThreadPoolExecutor - Paradigm shifts - Functional support - Object oriented support - Rich library for application development - Micro service friendly structures
  43. Data Science Continuous addition of new features and usage -

    Data classes - Support for data driven development - SUPER Extensive data science libraries - Easy to use for data scientist
  44. Questions Code samples will be available later today @ https://github.com/pprmint/pycontw