know what they want They need everything ASAP The questions never end Data Errors Data errors can prevent your data pipeline from flowing correctly Bad data harms the hard-won credibility of the data-analytics team Data Pipeline Maintenance Never Ends Each change must be made carefully so that it doesn’t break operational analytics Manual Process Fatigue Work long hours to compensate for the gap between performance and expectations Hotfix & quick-dirty solution 2
Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan 12 Principles Early and Continuous Delivery of Valuable Software Respond to change Frequent Delivery Regular Reflection and Adjustment Working software is the only measure Technical excellence Simplicity Collaborate with your customers Motivated Individuals Face-to-Face Conversation Self-Organizing Teams 3 Incremental approach Technical excellence People interaction
owner == analytics owner, who is responsible for driving business outcomes from the insights delivered. • Typical user stories: ◦ deliver actionable insights ◦ improve data quality or dataops automation ◦ enhanced data governance • Cross-functional data team 8
data pipeline 2. Improve resiliency 3. Add data test 4. Use Git & branching strategy 5. Implement CI/CD pipeline 6. Use multiple environment 7. Self-Service Data platform 13
data because of process failure Enable to reprocess data when there’s changes in business rules Create idempotent & deterministic processes Job == function([input_dataset]) → [output_dataset] No external factor Current time, Mutable database query, Random numbers Service lookup Known & bounded input data No side-effect 16
For example: • Customer Validation – Each customer should exist in a dimension table • Data Validation – At least 90 percent of data should match entries in a dimension table Input tests check data prior to each stage in the analytics pipeline. For example: • Count Verification – Check that row counts are in the right range, ... • Conformity – US Zip5 codes are five digits, US phone numbers are 10 digits, ... • History – The number of prospects always increases, ... • Balance – Week over week, sales should not vary by more than 10%, ... • Temporal Consistency – Transaction dates are in the past, end dates are later than start dates, ... • Application Consistency – Body temperature is within a range around 98.6F/37C, ... • Field Validation – All required fields are present, correctly entered, ... Output tests check the results of an operation, like a Cartesian join. For example: • Completeness – Number of customer prospects should increase with time • Range Verification – Number of physicians in the US is less than 1.5 million 18