Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Software Engineering Part of Data Science

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for Hacarus Inc. Hacarus Inc.
September 21, 2019

The Software Engineering Part of Data Science

Presentation slides at PyCon Taiwan 2019
https://tw.pycon.org/2019/

Avatar for Hacarus Inc.

Hacarus Inc.

September 21, 2019
Tweet

More Decks by Hacarus Inc.

Other Decks in Technology

Transcript

  1. Call me Ninz! - Software Engineer @ Hacarus - Studying

    MSCS @ Ateneo de Manila University - Music enthusiast - Physics and astronomy is <3 - I love data! - Curious, always.
  2. Let’s begin! • Introduction • Identifying development challenges • Defining

    solutions • Software engineer to data scientist • Improvements from Python
  3. VS

  4. • Product delivery • Code quality • Maintainability • Architecture/Boiler

    plates • User oriented • App development • Product delivery • Model accuracy • Visualization of data • Notebooks • Data-driven • Mathematical computations
  5. The software engineer • Code from data scientist SOMETIMES ARE

    NOT EFFICIENT and CLEAN! • Models are based on limited training and testing data. • Data formats, input and output. • D O C U M E N T A T I O N !
  6. Python bridges the gap • Common language • Easy to

    understand • Consistency of setup (pyenv, pipenv, etc) • Faster prototyping
  7. Starting the Project • Data scientists and software engineers must

    align their requirements ◦ Avoid high level planning ◦ Be very specific in resource requirements • “What data are we using?” • “Who will be using the data?”
  8. Code standards • Most data scientists don’t use linters! UGHHH

    SAD! • If there will be errors, make it straight to the point • Don’t expect 100% code quality from Data Scientists • Utilize language feature ◦ Type hints in python ◦ Data classes
  9. Architectural Challenges • How different algorithms consume resource ◦ CPU

    and memory limitations ◦ Storage options ◦ Multiprocessing vs multithreading • How models will be used by applications ◦ Interfaces ◦ Communication protocols
  10. Deployment • Identify deployment methods ◦ Cloud or Local ◦

    Mobile ◦ Micro Controllers, etc • Scaling up cloud deployments • Scaling down for local deployments (minimal resource environment) • Continuous Integration
  11. Data Driven Development • Application structure and requirements will depend

    on: ◦ What are the data inputs for the algorithms? ◦ What are the data outputs/results? ◦ How will the data be transformed throughout its lifecycle • Python provides several libraries that can be utilized in data driven development like Data Classes
  12. Data Driven Development Data inputs Transformation / processing Apply Model

    Output P i p e l i n e Data sources Data storage Data workflow
  13. Code Quality - Data scientists and Application team should use

    linters and automatic code formatting tools. - Conventions on function definitions and interfaces. - Code reviews - Use Type Hints and other tools that IDEs utilize
  14. Type hints from typing import List class User: def __init__(self,

    name: str, age: int, alive: bool, hobbies: List[str]) -> object: print(f"My name is {name}")
  15. Specific Errors # List errors thrown by models class NotLatestVersionException(Exception):

    """Version mismatch for algorithm""" class NotFittedException(Exception): """Model is not fitted""" class DataSizeException(Exception): """Invalid data size for training""" class NoTrainDataException(Exception): """No training data""" - Errors are clear and descriptive - Case to case basis
  16. Data Integrity - Create data classes for strict type implementation

    - Pre processing should be atomic in nature. - Single operation per data only - Data output and results must be stored as granular as possible
  17. Class Definition class TraningData: """TraningData represents labeled dataset""" def __init__(self,

    data: Iterable, label: Label = None, metadata=None) -> None: """ Parameters ---------- data : Iterable, shape, matrix) label : Label.GOOD | Label.BAD metadata : other info """ self.data = data self.label = label self.metadata = metadata
  18. Data Integrity: Granularity of Data Results Raw Image Annotations Black

    and White Image Processed Image Raw Image with result overlay
  19. Architectural Challenges • Memory vs CPU • CPU needed for

    training • Memory needed for model storage • Consider the kind of algorithm the model uses • Sparse modeling usually performs well in smaller resource setup
  20. Architectural Challenges • Multithreading ◦ Good for memory bound algorithms

    (Decision Trees) ◦ Easy to implement, much simpler • Multiprocessing ◦ Provides more throughput ◦ Better for algorithms with high CPU consumption (NN) ◦ A lot more difficult to implement
  21. Architectural Challenges • Software engineer must provide the “glue code”

    for data scientists ◦ Utilize interfaces in python ◦ Class definitions
  22. Sample Interface from abc import ABC, abstractmethod class ModelInterface(ABC): @abstractmethod

    def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData]) -> None: # Throw error when not implemented raise NotImplementedError() @abstractmethod def predict(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: # Throw error when not implemented raise NotImplementedError()
  23. Sample Interface class MyModel(ModelInterface): def fit(self, X: Iterable[FeatureData], y: Iterable[LabelData])

    -> None: self.model = RandomForestClassfier().fit(X, y) # Do some other stuff def predict(self, X: Iterable[FeatureData]) -> Iterable[LabelData]: result = self.model.predict(X) # Process result here return result
  24. Deployment As software engineers, we need to do the ff:

    - Identify if the model will work better in cloud or local - Provide inference so data scientist can adjust
  25. Deployment Scaling up - Take note of algo runtime -

    Anything greater than O(n^2) won’t work even if you scale.
  26. Deployment Scaling down - Identify space complexity for setup -

    O( F * n logn) algorithms should be the target *F is size of feature set
  27. Shifting to Data Science • Formal study! Data Science has

    LOTS OF MATH! ◦ Grad school offers data science courses ◦ Usually one year straight courses • Find related jobs ◦ Internships ◦ Data engineering is not data science but close enough :P
  28. Shifting to Data Science • Do research ◦ A lot

    of research grants involve trainings, etc ◦ Good mentors in the form of PhD students ** Taking online tutorials and lessons are often not that good **
  29. Software development Continuous addition of new features and usage -

    import importlib.metadata, from concurrent.futures import ThreadPoolExecutor - Paradigm shifts - Functional support - Object oriented support - Rich library for application development - Micro service friendly structures
  30. Data Science Continuous addition of new features and usage -

    Data classes - Support for data driven development - SUPER Extensive data science libraries - Easy to use for data scientist