$30 off During Our Annual Pro Sale. View Details »

DevOpsPorto Meetup27: Performing Analytics ASAP by Diego Reiriz Cores

DevOpsPorto Meetup27: Performing Analytics ASAP by Diego Reiriz Cores

Talk delivered by Diego Reiriz Cores

DevOpsPorto

May 16, 2019
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. DataOps
    Creating Data Based Solutions ASAP

    View Slide

  2. ¿ Who am I in a nutshell?
    - Data/ML/Meme Engineer @
    - AI Master Student
    - VigoBrain AI MeetUp CoOrganizer

    View Slide

  3. GRADIANT SPACE

    View Slide

  4. View Slide

  5. View Slide


  6. ¿ What Is DataOps ?

    View Slide


  7. What Is DataOps?
    DataOps is an automated, process-oriented methodology, used
    by analytic and data teams, to improve the quality and reduce
    the cycle time of data analytics ...
    DataOps applies to the entire data lifecycle from data
    preparation to reporting, and recognizes the interconnected
    nature of the data analytics team and IT operations.
    DataOps - Wikipedia

    View Slide

  8. DataOps applies 3 Methodologies...
    DevOps Agile
    SPC
    (Statistic Process
    Control)

    View Slide

  9. Lean Manufactring - SPC
    Is a systematic method for the
    minimization of waste (muda) within a
    manufacturing system without
    sacrificing productivity

    View Slide

  10. Manifesto

    View Slide

  11. Manifesto
    1. Continually satisfy your customer
    2. Value working analytics
    3. Embrace change
    9. Analytics is code
    10. Make it reproducible
    16. Monitor quality and performance

    View Slide


  12. ¿How many times have you
    seen all this methodologies
    aplyed to data based solutions?

    View Slide

  13. When you work with data...

    View Slide

  14. Deployments... ● Works with Google on Apache Beam
    project
    ● Apache Spark Committer
    ● Co-author of O'Reilly's Learning
    Spark and High Performance Spark.
    Holden Karau @holdenkarau

    View Slide

  15. So I Tricked You with this talk

    View Slide

  16. My Team Journey

    View Slide

  17. ● Strong Software Engineering Skills
    ● We use Gitflow as our repository workflow
    ● We package all our work
    ● We embrace TDD and DDD
    ● Everything we code goes through CI/CD
    ● We encourage clean & reusable code
    ● We usually use Scrum
    Team Background

    View Slide

  18. We automated tons of things in our software development lyfecicle
    - code formatting → we run a linter on each commit
    - feature checking → we embrace TDD so almost all our code is tested by default
    - code quality → static code analysis with sonarqube
    - deploymenys → almost all are done with docker/k8
    - monitoring → we have automatic alerts
    - BI dashboards generation → we use tools like Metabase/Superset
    I usually have more confidence on my automated processes that in myself
    Good SW Engineering practices means been lazy

    View Slide

  19. That allows us to spend time on
    Automate more things that I don’t want to spend my time on them
    Create more data pipelines or enrich current pipelines
    Do more analytics
    Explore ML/DL models
    Improve current models metrics
    Improve current system quality
    Research more ways to be more lazy

    View Slide

  20. Engines
    Analytics
    POCs & Reports
    Testing
    and
    Production
    Environemnt
    Visualization
    Layer
    Data Layer
    Backend
    plumber

    View Slide

  21. There’s Pain & Tears behind all
    thoose technologies

    View Slide

  22. Be careful with notebooks
    environments
    It’s really easy to pollute your
    notebook environment with
    other people dependencies and
    configurations

    View Slide

  23. We are using a bunch of technologies, so there’s a ton of points of
    failure (I)
    Backend
    if something went wrong on the R part it could destroy our k8 pod
    We need brute force strategies to scale this
    It’s hard to test R side

    View Slide

  24. Analytics Backend
    We are using a bunch of technologies, so there’s a ton of points of
    failure (II)
    Backend
    We detected memory usage problems on plumber parsing HTTP requests
    HTTP
    plumber
    Monitorin
    g
    We have tests on both backends

    View Slide

  25. Serving DL model over Spark, what could go wrong...
    Engines
    Data Pipeline
    Data Layer Autoencoder Training
    Shared
    FS
    weigths.h5
    arch.json

    View Slide

  26. If you want to embrace DataOps you may need new roles

    View Slide

  27. - Create advanced analytics
    - Interact with business and help them
    - Create reports
    - Research on AI
    Data Scientist
    Abilities
    Responsabilities
    - Math & Statistics Background
    - Create insights using business domain knowldege
    - Good communication skills (verbally & visually)
    Weakness
    - Programming skills
    - System creation/management skills
    https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

    View Slide

  28. Data Engineer
    - Create data pipelines
    - Choose right tools for data proccesing
    - Combine multiple technologies to create solutions
    Abilities
    Responsabilities
    - Programming Background
    - Knowldege in distributed systems
    - System creation and management
    Weakness
    - Not a system person
    - Weak analytics skills (compared to Data Scientists)
    https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

    View Slide

  29. ML Engineer
    - Operationalizing Data scientist’s work
    - Optimizing ML
    Abilities
    Responsabilities
    - Data Engineering Abilites
    - Strong Data Scientist Abilities
    - Strong Engineer Principles
    Weakness
    - Knows too many things
    https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

    View Slide

  30. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

    View Slide

  31. Engines
    Analytics
    POCs & Reports
    Testing
    and
    Production
    Environemnt
    Visualization
    Layer
    Data Layer
    Backend
    plumber

    View Slide

  32. Things we are thinking about
    - Use DSC to version of data and experiments
    - Waste less resources
    - Jupyterhub
    - Automatic scaling for spark and flink clusters
    - Have a good VCS for notebooks:
    - manage versions, diffs, pull requests
    - Automate notebooks validation → ¿automatic tests on notebooks?

    View Slide

  33. ¿Questions?

    View Slide

  34. View Slide