Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ProTalk #2 | Productivity in Data Environment

ProTalk #2 | Productivity in Data Environment

09f1153a9956cb8d47bc7fe641dfc848?s=128

Proclub Telkom University

March 15, 2019
Tweet

Transcript

  1. Productivity In Data Environment What, How, Why we need to

    be productive on Data
  2. Profile Name: Kadek Byan Prihandana Jati Company: KMKLabs Position: Software

    Engineer - Data Engineer
  3. Productivity

  4. Why we need productivity? We need to be productive because:

    - Everything change so fast - We need more feedback A.S.A.P from our system - We have complex problem We need to be productive in Data Environment because the value of data is to give an amount of feedback for users (either it was a system / people), and given time to fulfill the needs.
  5. Example Things changes so fast: Imagine in the next 5

    minutes, there will be a popular soccer game, a lot of people will access our site: vidio.com and come to see their favorite team. <insert play-count on vidio site here>
  6. Data Team Responsibilities Users: - Site Users - Engineers -

    Management Product: - Numbers (Play Count, Report for Partners, ML Models) - Engineers (Video Player Error Rate) - Management (Daily, Weekly Report)
  7. Data Team Responsibilities - Deliver (Development + Deploy) the Data

    Product A.S.A.P - The Data must be Integrated from Data Warehouse into Reporting - Minimize the Report Error & make sure every data is reliable - Monitoring core services
  8. Imagine without Productivity - Daily Report become Weekly Report -

    The data that must be present on next week, must be delayed - Feedback for users become slower & Decision (either it was business decision & fixing feature decision) will be all delayed. - Doom
  9. How to be productive - Development Best Practices [Works Best

    in Detail] - Data Product Automation (ETL) [Automation] - Job Scheduling & Monitoring [Maintenance] - Collaboration [Insights about what’s next]
  10. Development Best Practices - Test Driven Development - Continuous Deployment

    - Team Autonomy - XP Culture
  11. Development Best Practices Impact to Productivity - Engineers confident about

    their code - Engineers validating design & performance at testing phase - System become more stable & predictable (testing + continuous deployment)
  12. Development Best Practices - Key Takeaway - To be productive

    is to get rid of details when you already know where to go - To be productive is to work best in details & make yourself accountable - Work with industry best practices and separate the “cowboy” culture aside from production makes you & your team cut a lot of time & accelerate faster to your direction. - XP culture tear down silo, fear of collaboration and another “personal” problem.
  13. Data Product Automation Data Product Periodic Report Machine Learning Model

    Dashboard Visualization Tracking System Report Automation Program (ETL)
  14. Data Product Automation Machine Learning Model Real Data (Feedback from

    Users) Dev & Deployment Pipeline Report Automation Program (ETL) Dev & Deployment Pipeline Datawarehouse Generated Report
  15. Data Product Automation - ETL Diagram Spark Code Dataproc Clusters

    Cluster Worker Cluster Worker Cluster Worker Expected Data BigQuery Storage
  16. Data Product Automation - Productivity Tools - Apache Spark (Scala)

    - Google Cloud Dataproc - Google Cloud Storage - Google Cloud BigQuery - CI (Jenkins)
  17. Data Product Automation - Why? Why we need this automation?

    - Speed - Time Efficiency - Reduce things to be managed
  18. Data Product Automation - Key Takeaways - To be productive

    is to know what to be managed and what to be removed from our scope - If we could delegate a system / work to machine, we should to give those type of work to machine - Try to know what users needs, aggregate those knowledge, automate it, you will be more productive than you just doing 1 things again and again. It takes your creativity brain capability.
  19. Scheduling & Monitoring We need to monitor each of our

    data product which have been deployed to production system. A bit error on our data product might causing a whole of our processes are disrupted, which can lead to unproductive issues. Let say we have 20 - 30 table which created from our code and we setup the schedule to be a cron job. This will lead to unproductive issues, such as: - We don’t know what jobs and when it will be failed - We must see logs on cron job system - Meanwhile we fix things, data needs to be delivered a.s.a.p
  20. Scheduling & Monitoring - Airflow

  21. Scheduling & Monitoring - Airflow

  22. Scheduling & Monitoring - Datadog

  23. Scheduling & Monitoring - Key Takeaway - A Big Works

    can’t be accomplished overnight. So we need to maintain our creation with responsibility. - Data Product Automation & Development Best practices, soon can cause issues on our productivity, this monitoring phase will aid those issues and we could get back to the train.
  24. Exploration & Collaboration All of the roles in Data Team

    should be in a collaboration (intense collaboration). Because data creation is useless if the other roles can’t access this data or the data usability are not utilized by the other roles like Data Analyst or Data Scientist guy. Also, most of the time, we need to do exploration to our data. To play with our data, knowing the schema, the aggregation metrics that we could produce and gain small insights from it.
  25. Exploration & Collaboration - Apache Zeppelin

  26. Exploration & Collaboration We have tests on our code, but

    we still need to have documentation about our data. The kind of documentation is just like: - Where table A is stored? - The schema of table A - When the latest update for table A - The process of making table A
  27. Exploration & Collaboration - Apache Atlas

  28. Exploration & Collaboration - Apache Atlas

  29. Exploration & Collaboration - Apache Atlas

  30. Exploration & Collaboration - Key Takeaway - Always separate the

    things that you should done and the things that will be the next innovation. Separate research & daily routine, to get the optimum productivity level. - Collaboration is a must, to complete our perspective about problem that we currently tackle.
  31. Q & A

  32. Productivity in Data Environment Problem that we have: 1. Reporting

    (Realtime/Batch) 2. Services based on Data (Spam Filtering, Recommendation Engine) 3. Scheduling 4. Performance 5. Data Management 6. Data Exploration (Analytics Platform) 7. Programming Culture
  33. Tools - Continuous Integration (CI) - Redis - Postgresql, Mysql,

    CassandraDB - Kubernetes - Google Cloud PubSub - Google Cloud DataFlow - Google Cloud Dataproc - Google BigQuery - Google Cloud Storage - Apache Airflow - Apache Atlas - Apache Spark
  34. Tools Concept - Storage (Database, In-Mem Storage, Data Warehouse) -

    Pipeline Data Ingestion - Analytics Tools - Dashboard Tools - Machine Learning Model Pipeline
  35. Study Case: Apache Spark + Dataproc Cluster

  36. Study Case: Pipeline (Pubsub + Dataflow)

  37. Study Case: Apache Airflow

  38. Study Case: Apache Zeppelin

  39. Study Case: Apache Atlas