Slide 1

Slide 1 text

Data engineering – the key to data success Lessons from a data veteran 13-09-2023 – Utrecht – Big data expo

Slide 2

Slide 2 text

This presentation • What does Rainforest Alliance do? • The challenge of data engineering • Principles for development • Principles for testing • Questions

Slide 3

Slide 3 text

3 Rudolf van der Heide Senior data engineer and team leader at Rainforest Alliance More than 25 years of data experience Data engineer Data analyst Business analyst Accidental DBA Quality manager Data Quality Analyst Project manager Data architect Data modeler Tester Troubleshooter Security Officer

Slide 4

Slide 4 text

Introduction to Rainforest Alliance

Slide 5

Slide 5 text

Our vision is a world where people and nature thrive in harmony

Slide 6

Slide 6 text

What we do We try to help farmers to do better farming, have better incomes, take better care of the environments. 2 group of products: • Certification of tropical products (Coffee, Cocoa, Tea, hazelnut, banana) • Landscale & Community projects And what is needed to support it

Slide 7

Slide 7 text

Our data challenges Getting lots of different data from many different countries

Slide 8

Slide 8 text

A few definitions I use data engineering because that’s the term I use most, but think analytics engineers, or data pipeline developers, ETL developers, data scientists I will discuss data pipelines in general, but many principles are also important for reporting, visualisation and complex models. Business data users are your data stewards, business analysts, data management team, data power users

Slide 9

Slide 9 text

The challenge of data and analytics engineering Data engineering is not software engineering

Slide 10

Slide 10 text

A short story of software and data products

Slide 11

Slide 11 text

Practical issues that crash your data pipelines ∙ New tables ∙ New attributes ∙ Changed data formats ∙ Bugfixes ∙ New business rules ∙ Poor data quality ∙ New codes ∙ User creativity ∙ Incomplete data ∙ Reuse of existing codes and attributes Small changes often have more impact than the big ones.

Slide 12

Slide 12 text

Example 1: Code changes over time IT servicedesk data Source data Priority First Month Last Month 1 Sep-11 Feb-12 2 Sep-11 Feb-12 3 Sep-11 Feb-12 HIGH Feb-12 Jul-12 MIDDLE Feb-12 Jul-12 LOW Feb-12 Jul-12 HIGH Aug-12 now MIDDLE Aug-12 now NORMAL Aug-12 now reporting Priority HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL

Slide 13

Slide 13 text

Example 2: changed order of records in source Source: claim_ number product _code cost_ type amount 1234 A07 20 450 1234 A07 40 350 5678 C814 20 450 5678 C814 40 350 5678 C814 80 150 3456 K81 20 370 3456 K81 40 280 7231 A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234 A07 800 450 350 5678 C814 950 450 350 150 3456 K81 650 370 280 7231 A774 450 450 Reporting:

Slide 14

Slide 14 text

Example 2: changed order of records in source Source: claim_ number product _code cost_ type amount 1234 A07 40 350 1234 A07 20 450 3456 K81 40 280 3456 K81 20 370 5678 C814 80 150 5678 C814 40 350 5678 C814 20 450 7231 A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234 A07 350 350 1234 A07 450 450 3456 K81 280 380 3456 K81 370 370 5678 C814 150 150 5678 C814 350 360 5678 C814 450 450 7231 A774 450 450 Reporting:

Slide 15

Slide 15 text

example 3: new filetype Uploaded files per license, in report: last updated file of type GMR license_id filename documenttype uploadedon status 2file-18587 GMR 23/11/2022 07:41 1 2file-15602 KML 14/09/2022 09:28 1 3file-12931 GMR 14/07/2022 14:52 4 5file-16870 GMR 10/10/2022 06:38 4 7file-14095 GMR 10/08/2022 16:38 1 8file-10506 GMR 31/05/2022 17:03 1 9file-10037 GMR 20/05/2022 15:49 1 10file-18968 KML 05/12/2022 18:48 4 10file-16695 GMR 05/10/2022 17:29 4 10file-15173 GMR 05/09/2022 18:46 4 12file-3273 GMR 21/12/2021 12:15 1 13file-17448 GMR 25/10/2022 13:32 1 13file-14695 GMR 25/08/2022 04:04 4 13file-5118 GMR 27/01/2022 03:35 1 license_id filename documenttype uploadedon status 48file-25106 StructuredGMR 18/04/2023 13:32 1 58file-16280 GMR 25/09/2022 10:39 1 58file-15995 GMR 19/09/2022 11:33 4

Slide 16

Slide 16 text

What it means for data engineering We are at the end of a long data chain, inheriting all the issues Accept that your source data can change any time You need to build for resilience and flexibility You need to be able to quickly validate You are developing solutions for datasets, not for a number of specific cases You need to be able to quickly spot changes (errors) in your source data

Slide 17

Slide 17 text

Consequences for development

Slide 18

Slide 18 text

Build and test with production data 1. Only production data will cover a significant variety of data and data issues 2. Production data have the relevant day-to-day changes in the data your data pipelines need to cover 3. You need realistic volumes of data 4. Trying to distinguish between real data issues and flawed test data will cost you lots of time and frustration

Slide 19

Slide 19 text

19 … but you may face heavy resistance Principled: • It is against our standards! • Developers should never work with production data Practical: • Confidentiality • Volume Most of the objections are coming from lack of appreciation of the differences with software engineering

Slide 20

Slide 20 text

Don’t try to do everything 3 main quality criteria: Completeness Correctness Timeliness One or two of the there is reasonably achievable If you want to do all 3 perfectly the costs will go up exponentially And in most analytic use cases there is no need for it So challenge your business

Slide 21

Slide 21 text

Don’t try to do everything: examples 3 main quality criteria: Completeness Correctness Timeliness One or two of the there is reasonably achievable If you want to do all 3 perfectly the costs will go up exponentially And in most analytic use cases there is no need for it So challenge your business

Slide 22

Slide 22 text

Build with troubleshooting in mind You need to be able to easily validate the correctness (without the need of engineering knowledge) (sql example) Intermediate results are very useful for that If you can only validate your pipeline or data product by changing the code to analyse part of it, or storing intermediate results in a table for analysis, it would be a good idea to store it as a standard part of your process.

Slide 23

Slide 23 text

Don’t hardcode code translations Codes are not constant, they may change over time, differ between systems, have changing interpretations, new ones added So make them part of your data management process, not of your code

Slide 24

Slide 24 text

24 Basic code translation structure

Slide 25

Slide 25 text

Example 1: Code changes over time IT servicedesk data Source data Priority First Month Last Month 1 Sep-11 Feb-12 2 Sep-11 Feb-12 3 Sep-11 Feb-12 HIGH Feb-12 Jul-12 MIDDLE Feb-12 Jul-12 LOW Feb-12 Jul-12 HIGH Aug-12 now MIDDLE Aug-12 now NORMAL Aug-12 now reporting Priority HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL

Slide 26

Slide 26 text

26 Basic code translation structure Source _system master code code A Ticket priority 1 A Ticket priority 2 A Ticket priority 3 B Ticket priority HIGH B Ticket priority MIDDLE B Ticket priority LOW C Ticket priority HIGH C Ticket priority MIDDLE C Ticket priority NORMAL Reporting master code reporting code Ticket priority HIGH Ticket priority MIDDLE Ticket priority NORMAL Source _system master code code Reporting master code reporting code A Ticket priority 1 Ticket priority HIGH A Ticket priority 2 Ticket priority MIDDLE A Ticket priority 3 Ticket priority NORMAL B Ticket priority HIGH Ticket priority HIGH B Ticket priority MIDDLE Ticket priority MIDDLE B Ticket priority LOW Ticket priority NORMAL C Ticket priority HIGH Ticket priority HIGH C Ticket priority MIDDLE Ticket priority MIDDLE C Ticket priority NORMAL Ticket priority NORMAL

Slide 27

Slide 27 text

27 Use of code translation structures For: Code translations Custom classifications and groupings Selection criteria

Slide 28

Slide 28 text

Testing and monitoring

Slide 29

Slide 29 text

Your objective for testing and monitoring We want to prove that the full dataset is correct We want to be able to spot changes and issues caused by your sources Not only now, but also in production, after every load The good thing: A good monitoring process works perfectly well for regressions and integration testing You need production data

Slide 30

Slide 30 text

Approach 1 : reconciliation Ideally find ways to prove that the result is correct One way is using reconciliation reports, like finance uses. This is best designed by business data users. SALES per region Source Data warehouse Reporting Europe 1,700,000 1,700,000 1,700,000 Africa 940,000 942,000 942,000 Asia 365,000 363,000 363,000 US 9,993,000 9,993,000 9,993,000 Latin America 210,000 210,000 210,000

Slide 31

Slide 31 text

Approach: statistics Statistics can help you spot errors and changes The easiest and most successful systems I have seen for that, run standard queries every day and compare the result with previous runs

Slide 32

Slide 32 text

What to measure? • Number of records per table • Number of empty values per column (including empty strings or default dates) • Number of unique values per column • Number of missing foreign key references • Number of new business keys (new customers, new shops, new products) Meaningful Business statistics: Average sales per order

Slide 33

Slide 33 text

Example 2: changed order of records in source Source: claim_ number product _code cost_ type amount 1234A07 40 350 1234A07 20 450 3456K81 40 280 3456K81 20 370 5678C814 80 150 5678C814 40 350 5678C814 20 450 7231A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234A07 350 350 1234A07 450 450 3456K81 280 380 3456K81 370 370 5678C814 150 150 5678C814 350 360 5678C814 450 450 7231A774 450 450 Reporting: This was spotted by a data scientist during the mandatory data loading check by looking at average amount per product code

Slide 34

Slide 34 text

Create signals to highlight possible issues Basic assumption: Previous load was correct, compare statistics with previous loads. Keep in mind that exception may only become visible after some time. How to do it, depends on your business and data. Examples of signals: Record counts: >2% growth, >0,5%, extra alert if the growth is more than twice last month average Negative growth for more than 1 day More than 3 days 0 growth for all tables except reference tables

Slide 35

Slide 35 text

35 Examples of signals Number of unique values per column: Supermarket chain, shop number in the sales table. Usually no change, if there is a change the business should know. Code fields: Should be stable, if growth you need a sanity check Number of empty values: Either a fixed value (known errors), or a fixed percentage If there is a big change, investigation is needed. Sometimes it might be easier to look at non-empty values

Slide 36

Slide 36 text

Suddenly the record count is doubled

Slide 37

Slide 37 text

Fast growing table

Slide 38

Slide 38 text

Carthesian product

Slide 39

Slide 39 text

example 3: new filetype Uploaded files per license, in report: last updated file of type GMR license_id filename documenttype uploadedon status 2file-18587 GMR 23/11/2022 07:41 1 2file-15602 KML 14/09/2022 09:28 1 3file-12931 GMR 14/07/2022 14:52 4 5file-16870 GMR 10/10/2022 06:38 4 7file-14095 GMR 10/08/2022 16:38 1 8file-10506 GMR 31/05/2022 17:03 1 9file-10037 GMR 20/05/2022 15:49 1 10file-18968 KML 05/12/2022 18:48 4 10file-16695 GMR 05/10/2022 17:29 4 10file-15173 GMR 05/09/2022 18:46 4 12file-3273 GMR 21/12/2021 12:15 1 13file-17448 GMR 25/10/2022 13:32 1 13file-14695 GMR 25/08/2022 04:04 4 13file-5118 GMR 27/01/2022 03:35 1 license_id filename documenttype uploadedon status 48file-25106 StructuredGMR 18/04/2023 13:32 1 58file-16280 GMR 25/09/2022 10:39 1 58file-15995 GMR 19/09/2022 11:33 4 We could see it in the empty record counts

Slide 40

Slide 40 text

Questions?

Slide 41

Slide 41 text

rainforest-alliance.org

Slide 42

Slide 42 text

No content