ProTalk #2 | Productivity in Data Environment

Productivity In Data Environment What, How, Why we need to
be productive on Data

Profile Name: Kadek Byan Prihandana Jati Company: KMKLabs Position: Software
Engineer - Data Engineer

Productivity

Why we need productivity? We need to be productive because:
- Everything change so fast - We need more feedback A.S.A.P from our system - We have complex problem We need to be productive in Data Environment because the value of data is to give an amount of feedback for users (either it was a system / people), and given time to fulfill the needs.

Example Things changes so fast: Imagine in the next 5
minutes, there will be a popular soccer game, a lot of people will access our site: vidio.com and come to see their favorite team. <insert play-count on vidio site here>

Data Team Responsibilities Users: - Site Users - Engineers -
Management Product: - Numbers (Play Count, Report for Partners, ML Models) - Engineers (Video Player Error Rate) - Management (Daily, Weekly Report)

Data Team Responsibilities - Deliver (Development + Deploy) the Data
Product A.S.A.P - The Data must be Integrated from Data Warehouse into Reporting - Minimize the Report Error & make sure every data is reliable - Monitoring core services

Imagine without Productivity - Daily Report become Weekly Report -
The data that must be present on next week, must be delayed - Feedback for users become slower & Decision (either it was business decision & fixing feature decision) will be all delayed. - Doom

How to be productive - Development Best Practices [Works Best
in Detail] - Data Product Automation (ETL) [Automation] - Job Scheduling & Monitoring [Maintenance] - Collaboration [Insights about what’s next]

Development Best Practices - Test Driven Development - Continuous Deployment
- Team Autonomy - XP Culture

Development Best Practices Impact to Productivity - Engineers confident about
their code - Engineers validating design & performance at testing phase - System become more stable & predictable (testing + continuous deployment)

Development Best Practices - Key Takeaway - To be productive
is to get rid of details when you already know where to go - To be productive is to work best in details & make yourself accountable - Work with industry best practices and separate the “cowboy” culture aside from production makes you & your team cut a lot of time & accelerate faster to your direction. - XP culture tear down silo, fear of collaboration and another “personal” problem.

Data Product Automation Data Product Periodic Report Machine Learning Model
Dashboard Visualization Tracking System Report Automation Program (ETL)

Data Product Automation Machine Learning Model Real Data (Feedback from
Users) Dev & Deployment Pipeline Report Automation Program (ETL) Dev & Deployment Pipeline Datawarehouse Generated Report

Data Product Automation - ETL Diagram Spark Code Dataproc Clusters
Cluster Worker Cluster Worker Cluster Worker Expected Data BigQuery Storage

Data Product Automation - Productivity Tools - Apache Spark (Scala)
- Google Cloud Dataproc - Google Cloud Storage - Google Cloud BigQuery - CI (Jenkins)

Data Product Automation - Why? Why we need this automation?
- Speed - Time Efficiency - Reduce things to be managed

Data Product Automation - Key Takeaways - To be productive
is to know what to be managed and what to be removed from our scope - If we could delegate a system / work to machine, we should to give those type of work to machine - Try to know what users needs, aggregate those knowledge, automate it, you will be more productive than you just doing 1 things again and again. It takes your creativity brain capability.

Scheduling & Monitoring We need to monitor each of our
data product which have been deployed to production system. A bit error on our data product might causing a whole of our processes are disrupted, which can lead to unproductive issues. Let say we have 20 - 30 table which created from our code and we setup the schedule to be a cron job. This will lead to unproductive issues, such as: - We don’t know what jobs and when it will be failed - We must see logs on cron job system - Meanwhile we fix things, data needs to be delivered a.s.a.p

Scheduling & Monitoring - Airflow

Scheduling & Monitoring - Datadog

Scheduling & Monitoring - Key Takeaway - A Big Works
can’t be accomplished overnight. So we need to maintain our creation with responsibility. - Data Product Automation & Development Best practices, soon can cause issues on our productivity, this monitoring phase will aid those issues and we could get back to the train.

Exploration & Collaboration All of the roles in Data Team
should be in a collaboration (intense collaboration). Because data creation is useless if the other roles can’t access this data or the data usability are not utilized by the other roles like Data Analyst or Data Scientist guy. Also, most of the time, we need to do exploration to our data. To play with our data, knowing the schema, the aggregation metrics that we could produce and gain small insights from it.

Exploration & Collaboration - Apache Zeppelin

Exploration & Collaboration We have tests on our code, but
we still need to have documentation about our data. The kind of documentation is just like: - Where table A is stored? - The schema of table A - When the latest update for table A - The process of making table A

Exploration & Collaboration - Apache Atlas

Exploration & Collaboration - Key Takeaway - Always separate the
things that you should done and the things that will be the next innovation. Separate research & daily routine, to get the optimum productivity level. - Collaboration is a must, to complete our perspective about problem that we currently tackle.

Productivity in Data Environment Problem that we have: 1. Reporting
(Realtime/Batch) 2. Services based on Data (Spam Filtering, Recommendation Engine) 3. Scheduling 4. Performance 5. Data Management 6. Data Exploration (Analytics Platform) 7. Programming Culture

Tools - Continuous Integration (CI) - Redis - Postgresql, Mysql,
CassandraDB - Kubernetes - Google Cloud PubSub - Google Cloud DataFlow - Google Cloud Dataproc - Google BigQuery - Google Cloud Storage - Apache Airflow - Apache Atlas - Apache Spark

Tools Concept - Storage (Database, In-Mem Storage, Data Warehouse) -
Pipeline Data Ingestion - Analytics Tools - Dashboard Tools - Machine Learning Model Pipeline

Study Case: Apache Spark + Dataproc Cluster

Study Case: Pipeline (Pubsub + Dataflow)

Study Case: Apache Airflow

Study Case: Apache Zeppelin

Study Case: Apache Atlas

ProTalk #2 | Productivity in Data Environment

ProTalk #2 | Productivity in Data Environment

More Decks by Proclub Telkom University

Other Decks in Technology

Featured

Transcript