Slide 1

Slide 1 text

Agile Data Analytics Aditya Satrya Data Engineering Manager at Mekari linkedin.com/in/asatrya Data Talk | Jabar Digital Service | Feb 2021

Slide 2

Slide 2 text

Challenges of Data Analytics The Goalposts Keep Moving They don’t know what they want They need everything ASAP The questions never end Data Errors Data errors can prevent your data pipeline from flowing correctly Bad data harms the hard-won credibility of the data-analytics team Data Pipeline Maintenance Never Ends Each change must be made carefully so that it doesn’t break operational analytics Manual Process Fatigue Work long hours to compensate for the gap between performance and expectations Hotfix & quick-dirty solution 2

Slide 3

Slide 3 text

Agile 4 Values Individuals and interactions over processes and tools Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan 12 Principles Early and Continuous Delivery of Valuable Software Respond to change Frequent Delivery Regular Reflection and Adjustment Working software is the only measure Technical excellence Simplicity Collaborate with your customers Motivated Individuals Face-to-Face Conversation Self-Organizing Teams 3 Incremental approach Technical excellence People interaction

Slide 4

Slide 4 text

Incremental Approach 4

Slide 5

Slide 5 text

Incremental Approach 5

Slide 6

Slide 6 text

#1 | Prioritize Regularly Focus on High-Leverage Activities 6

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

Notes on Using Scrum for Data Analytics Team ● Product owner == analytics owner, who is responsible for driving business outcomes from the insights delivered. ● Typical user stories: ○ deliver actionable insights ○ improve data quality or dataops automation ○ enhanced data governance ● Cross-functional data team 8

Slide 9

Slide 9 text

#2 | Be focused Learn to say “no” (with local wisdom ;p) Preserve blocks of focused time Limit the amount of WIP 9

Slide 10

Slide 10 text

#3 | Metric-driven Use metric to drive progress. Evaluate your effort early and often Find low effort ways to validate work Reduce risk early 10

Slide 11

Slide 11 text

Technical Excellence 11

Slide 12

Slide 12 text

Technical Excellence: Optimize for iteration speed & quality 12

Slide 13

Slide 13 text

Technical Excellence: Optimize for iteration speed & quality 1. Orchestrate data pipeline 2. Improve resiliency 3. Add data test 4. Use Git & branching strategy 5. Implement CI/CD pipeline 6. Use multiple environment 7. Self-Service Data platform 13

Slide 14

Slide 14 text

#1 | Orchestrate Data Pipeline Scheduling Dependencies Monitoring / SLA Alerting Retrying Backfilling 14

Slide 15

Slide 15 text

Orchestration Tools 15

Slide 16

Slide 16 text

#2 | Improve Pipeline Resiliency Ingest data as-raw-as-possible Prevent losing data because of process failure Enable to reprocess data when there’s changes in business rules Create idempotent & deterministic processes Job == function([input_dataset]) → [output_dataset] No external factor Current time, Mutable database query, Random numbers Service lookup Known & bounded input data No side-effect 16

Slide 17

Slide 17 text

#3 | Add Data Tests Use write-audit-publish (WAP) pattern 17

Slide 18

Slide 18 text

Data Tests Business logic tests validate assumptions about the data. For example: • Customer Validation – Each customer should exist in a dimension table • Data Validation – At least 90 percent of data should match entries in a dimension table Input tests check data prior to each stage in the analytics pipeline. For example: • Count Verification – Check that row counts are in the right range, ... • Conformity – US Zip5 codes are five digits, US phone numbers are 10 digits, ... • History – The number of prospects always increases, ... • Balance – Week over week, sales should not vary by more than 10%, ... • Temporal Consistency – Transaction dates are in the past, end dates are later than start dates, ... • Application Consistency – Body temperature is within a range around 98.6F/37C, ... • Field Validation – All required fields are present, correctly entered, ... Output tests check the results of an operation, like a Cartesian join. For example: • Completeness – Number of customer prospects should increase with time • Range Verification – Number of physicians in the US is less than 1.5 million 18

Slide 19

Slide 19 text

Data Validation Tools 19

Slide 20

Slide 20 text

#4 | Use Git & Branching Strategy Reproducible Collaborate safely Versioned Code review 20

Slide 21

Slide 21 text

#5 | Implement CI/CD pipeline 21 Integration Run automated tasks Prevent deploying changes that break production systems

Slide 22

Slide 22 text

#6 | Use Multiple Environments 22 Deploy to staging (with staging data & pipeline) If staging is OK, deploy to production

Slide 23

Slide 23 text

#7 | Self-Service Data Platform Orchestrator Transformation + custom operators Data Scientist, Data Analyst, Data Engineer Tables (data warehouse, data marts)

Slide 24

Slide 24 text

People Interaction 24

Slide 25

Slide 25 text

Good leadership 25

Slide 26

Slide 26 text

Cross-team dependency: Anticipate the misalignment of priorities ● Shared OKR ● Regularly asking for updates 26

Slide 27

Slide 27 text

Manage the expectation Be transparent about the process 27

Slide 28

Slide 28 text

Projects fail because of under-communicating Not over-communicating 28

Slide 29

Slide 29 text

Non-engineering bottlenecks: Approval/feedback from decision-maker 29

Slide 30

Slide 30 text

Don't defer approvals until the end. Don't delay feedback. 30

Slide 31

Slide 31 text

Thank you 31