Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Agile Data Analytics

Aditya Satrya
February 19, 2021

Agile Data Analytics

Slides of my sharing session talk in Jabar Digital Service

Aditya Satrya

February 19, 2021
Tweet

More Decks by Aditya Satrya

Other Decks in Technology

Transcript

  1. Agile Data Analytics Aditya Satrya Data Engineering Manager at Mekari

    linkedin.com/in/asatrya Data Talk | Jabar Digital Service | Feb 2021
  2. Challenges of Data Analytics The Goalposts Keep Moving They don’t

    know what they want They need everything ASAP The questions never end Data Errors Data errors can prevent your data pipeline from flowing correctly Bad data harms the hard-won credibility of the data-analytics team Data Pipeline Maintenance Never Ends Each change must be made carefully so that it doesn’t break operational analytics Manual Process Fatigue Work long hours to compensate for the gap between performance and expectations Hotfix & quick-dirty solution 2
  3. Agile 4 Values Individuals and interactions over processes and tools

    Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan 12 Principles Early and Continuous Delivery of Valuable Software Respond to change Frequent Delivery Regular Reflection and Adjustment Working software is the only measure Technical excellence Simplicity Collaborate with your customers Motivated Individuals Face-to-Face Conversation Self-Organizing Teams 3 Incremental approach Technical excellence People interaction
  4. 7

  5. Notes on Using Scrum for Data Analytics Team • Product

    owner == analytics owner, who is responsible for driving business outcomes from the insights delivered. • Typical user stories: ◦ deliver actionable insights ◦ improve data quality or dataops automation ◦ enhanced data governance • Cross-functional data team 8
  6. #2 | Be focused Learn to say “no” (with local

    wisdom ;p) Preserve blocks of focused time Limit the amount of WIP 9
  7. #3 | Metric-driven Use metric to drive progress. Evaluate your

    effort early and often Find low effort ways to validate work Reduce risk early 10
  8. Technical Excellence: Optimize for iteration speed & quality 1. Orchestrate

    data pipeline 2. Improve resiliency 3. Add data test 4. Use Git & branching strategy 5. Implement CI/CD pipeline 6. Use multiple environment 7. Self-Service Data platform 13
  9. #2 | Improve Pipeline Resiliency Ingest data as-raw-as-possible Prevent losing

    data because of process failure Enable to reprocess data when there’s changes in business rules Create idempotent & deterministic processes Job == function([input_dataset]) → [output_dataset] No external factor Current time, Mutable database query, Random numbers Service lookup Known & bounded input data No side-effect 16
  10. Data Tests Business logic tests validate assumptions about the data.

    For example: • Customer Validation – Each customer should exist in a dimension table • Data Validation – At least 90 percent of data should match entries in a dimension table Input tests check data prior to each stage in the analytics pipeline. For example: • Count Verification – Check that row counts are in the right range, ... • Conformity – US Zip5 codes are five digits, US phone numbers are 10 digits, ... • History – The number of prospects always increases, ... • Balance – Week over week, sales should not vary by more than 10%, ... • Temporal Consistency – Transaction dates are in the past, end dates are later than start dates, ... • Application Consistency – Body temperature is within a range around 98.6F/37C, ... • Field Validation – All required fields are present, correctly entered, ... Output tests check the results of an operation, like a Cartesian join. For example: • Completeness – Number of customer prospects should increase with time • Range Verification – Number of physicians in the US is less than 1.5 million 18
  11. #5 | Implement CI/CD pipeline 21 Integration Run automated tasks

    Prevent deploying changes that break production systems
  12. #6 | Use Multiple Environments 22 Deploy to staging (with

    staging data & pipeline) If staging is OK, deploy to production
  13. #7 | Self-Service Data Platform Orchestrator Transformation + custom operators

    Data Scientist, Data Analyst, Data Engineer Tables (data warehouse, data marts)