Rainforest Alliance

Data engineering – the key to data success Lessons from
a data veteran 13-09-2023 – Utrecht – Big data expo

This presentation • What does Rainforest Alliance do? • The
challenge of data engineering • Principles for development • Principles for testing • Questions

3 Rudolf van der Heide Senior data engineer and team
leader at Rainforest Alliance More than 25 years of data experience Data engineer Data analyst Business analyst Accidental DBA Quality manager Data Quality Analyst Project manager Data architect Data modeler Tester Troubleshooter Security Officer

Introduction to Rainforest Alliance

Our vision is a world where people and nature thrive
in harmony

What we do We try to help farmers to do
better farming, have better incomes, take better care of the environments. 2 group of products: • Certification of tropical products (Coffee, Cocoa, Tea, hazelnut, banana) • Landscale & Community projects And what is needed to support it

Our data challenges Getting lots of different data from many
different countries

A few definitions I use data engineering because that’s the
term I use most, but think analytics engineers, or data pipeline developers, ETL developers, data scientists I will discuss data pipelines in general, but many principles are also important for reporting, visualisation and complex models. Business data users are your data stewards, business analysts, data management team, data power users

The challenge of data and analytics engineering Data engineering is
not software engineering

A short story of software and data products

Practical issues that crash your data pipelines ∙ New tables
∙ New attributes ∙ Changed data formats ∙ Bugfixes ∙ New business rules ∙ Poor data quality ∙ New codes ∙ User creativity ∙ Incomplete data ∙ Reuse of existing codes and attributes Small changes often have more impact than the big ones.

Example 1: Code changes over time IT servicedesk data Source
data Priority First Month Last Month 1 Sep-11 Feb-12 2 Sep-11 Feb-12 3 Sep-11 Feb-12 HIGH Feb-12 Jul-12 MIDDLE Feb-12 Jul-12 LOW Feb-12 Jul-12 HIGH Aug-12 now MIDDLE Aug-12 now NORMAL Aug-12 now reporting Priority HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL

Example 2: changed order of records in source Source: claim_
number product _code cost_ type amount 1234 A07 20 450 1234 A07 40 350 5678 C814 20 450 5678 C814 40 350 5678 C814 80 150 3456 K81 20 370 3456 K81 40 280 7231 A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234 A07 800 450 350 5678 C814 950 450 350 150 3456 K81 650 370 280 7231 A774 450 450 Reporting:

number product _code cost_ type amount 1234 A07 40 350 1234 A07 20 450 3456 K81 40 280 3456 K81 20 370 5678 C814 80 150 5678 C814 40 350 5678 C814 20 450 7231 A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234 A07 350 350 1234 A07 450 450 3456 K81 280 380 3456 K81 370 370 5678 C814 150 150 5678 C814 350 360 5678 C814 450 450 7231 A774 450 450 Reporting:

example 3: new filetype Uploaded files per license, in report:
last updated file of type GMR license_id filename documenttype uploadedon status 2file-18587 GMR 23/11/2022 07:41 1 2file-15602 KML 14/09/2022 09:28 1 3file-12931 GMR 14/07/2022 14:52 4 5file-16870 GMR 10/10/2022 06:38 4 7file-14095 GMR 10/08/2022 16:38 1 8file-10506 GMR 31/05/2022 17:03 1 9file-10037 GMR 20/05/2022 15:49 1 10file-18968 KML 05/12/2022 18:48 4 10file-16695 GMR 05/10/2022 17:29 4 10file-15173 GMR 05/09/2022 18:46 4 12file-3273 GMR 21/12/2021 12:15 1 13file-17448 GMR 25/10/2022 13:32 1 13file-14695 GMR 25/08/2022 04:04 4 13file-5118 GMR 27/01/2022 03:35 1 license_id filename documenttype uploadedon status 48file-25106 StructuredGMR 18/04/2023 13:32 1 58file-16280 GMR 25/09/2022 10:39 1 58file-15995 GMR 19/09/2022 11:33 4

What it means for data engineering We are at the
end of a long data chain, inheriting all the issues Accept that your source data can change any time You need to build for resilience and flexibility You need to be able to quickly validate You are developing solutions for datasets, not for a number of specific cases You need to be able to quickly spot changes (errors) in your source data

Consequences for development

Build and test with production data 1. Only production data
will cover a significant variety of data and data issues 2. Production data have the relevant day-to-day changes in the data your data pipelines need to cover 3. You need realistic volumes of data 4. Trying to distinguish between real data issues and flawed test data will cost you lots of time and frustration

19 … but you may face heavy resistance Principled: •
It is against our standards! • Developers should never work with production data Practical: • Confidentiality • Volume Most of the objections are coming from lack of appreciation of the differences with software engineering

Don’t try to do everything 3 main quality criteria: Completeness
Correctness Timeliness One or two of the there is reasonably achievable If you want to do all 3 perfectly the costs will go up exponentially And in most analytic use cases there is no need for it So challenge your business

Don’t try to do everything: examples 3 main quality criteria:
Completeness Correctness Timeliness One or two of the there is reasonably achievable If you want to do all 3 perfectly the costs will go up exponentially And in most analytic use cases there is no need for it So challenge your business

Build with troubleshooting in mind You need to be able
to easily validate the correctness (without the need of engineering knowledge) (sql example) Intermediate results are very useful for that If you can only validate your pipeline or data product by changing the code to analyse part of it, or storing intermediate results in a table for analysis, it would be a good idea to store it as a standard part of your process.

Don’t hardcode code translations Codes are not constant, they may
change over time, differ between systems, have changing interpretations, new ones added So make them part of your data management process, not of your code

24 Basic code translation structure

Example 1: Code changes over time IT servicedesk data Source
data Priority First Month Last Month 1 Sep-11 Feb-12 2 Sep-11 Feb-12 3 Sep-11 Feb-12 HIGH Feb-12 Jul-12 MIDDLE Feb-12 Jul-12 LOW Feb-12 Jul-12 HIGH Aug-12 now MIDDLE Aug-12 now NORMAL Aug-12 now reporting Priority HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL

26 Basic code translation structure Source _system master code code
A Ticket priority 1 A Ticket priority 2 A Ticket priority 3 B Ticket priority HIGH B Ticket priority MIDDLE B Ticket priority LOW C Ticket priority HIGH C Ticket priority MIDDLE C Ticket priority NORMAL Reporting master code reporting code Ticket priority HIGH Ticket priority MIDDLE Ticket priority NORMAL Source _system master code code Reporting master code reporting code A Ticket priority 1 Ticket priority HIGH A Ticket priority 2 Ticket priority MIDDLE A Ticket priority 3 Ticket priority NORMAL B Ticket priority HIGH Ticket priority HIGH B Ticket priority MIDDLE Ticket priority MIDDLE B Ticket priority LOW Ticket priority NORMAL C Ticket priority HIGH Ticket priority HIGH C Ticket priority MIDDLE Ticket priority MIDDLE C Ticket priority NORMAL Ticket priority NORMAL

27 Use of code translation structures For: Code translations Custom
classifications and groupings Selection criteria

Testing and monitoring

Your objective for testing and monitoring We want to prove
that the full dataset is correct We want to be able to spot changes and issues caused by your sources Not only now, but also in production, after every load The good thing: A good monitoring process works perfectly well for regressions and integration testing You need production data

Approach 1 : reconciliation Ideally find ways to prove that
the result is correct One way is using reconciliation reports, like finance uses. This is best designed by business data users. SALES per region Source Data warehouse Reporting Europe 1,700,000 1,700,000 1,700,000 Africa 940,000 942,000 942,000 Asia 365,000 363,000 363,000 US 9,993,000 9,993,000 9,993,000 Latin America 210,000 210,000 210,000

Approach: statistics Statistics can help you spot errors and changes
The easiest and most successful systems I have seen for that, run standard queries every day and compare the result with previous runs

What to measure? • Number of records per table •
Number of empty values per column (including empty strings or default dates) • Number of unique values per column • Number of missing foreign key references • Number of new business keys (new customers, new shops, new products) Meaningful Business statistics: Average sales per order

number product _code cost_ type amount 1234A07 40 350 1234A07 20 450 3456K81 40 280 3456K81 20 370 5678C814 80 150 5678C814 40 350 5678C814 20 450 7231A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234A07 350 350 1234A07 450 450 3456K81 280 380 3456K81 370 370 5678C814 150 150 5678C814 350 360 5678C814 450 450 7231A774 450 450 Reporting: This was spotted by a data scientist during the mandatory data loading check by looking at average amount per product code

Create signals to highlight possible issues Basic assumption: Previous load
was correct, compare statistics with previous loads. Keep in mind that exception may only become visible after some time. How to do it, depends on your business and data. Examples of signals: Record counts: >2% growth, >0,5%, extra alert if the growth is more than twice last month average Negative growth for more than 1 day More than 3 days 0 growth for all tables except reference tables

35 Examples of signals Number of unique values per column:
Supermarket chain, shop number in the sales table. Usually no change, if there is a change the business should know. Code fields: Should be stable, if growth you need a sanity check Number of empty values: Either a fixed value (known errors), or a fixed percentage If there is a big change, investigation is needed. Sometimes it might be easier to look at non-empty values

Suddenly the record count is doubled

Fast growing table

Carthesian product

example 3: new filetype Uploaded files per license, in report:
last updated file of type GMR license_id filename documenttype uploadedon status 2file-18587 GMR 23/11/2022 07:41 1 2file-15602 KML 14/09/2022 09:28 1 3file-12931 GMR 14/07/2022 14:52 4 5file-16870 GMR 10/10/2022 06:38 4 7file-14095 GMR 10/08/2022 16:38 1 8file-10506 GMR 31/05/2022 17:03 1 9file-10037 GMR 20/05/2022 15:49 1 10file-18968 KML 05/12/2022 18:48 4 10file-16695 GMR 05/10/2022 17:29 4 10file-15173 GMR 05/09/2022 18:46 4 12file-3273 GMR 21/12/2021 12:15 1 13file-17448 GMR 25/10/2022 13:32 1 13file-14695 GMR 25/08/2022 04:04 4 13file-5118 GMR 27/01/2022 03:35 1 license_id filename documenttype uploadedon status 48file-25106 StructuredGMR 18/04/2023 13:32 1 58file-16280 GMR 25/09/2022 10:39 1 58file-15995 GMR 19/09/2022 11:33 4 We could see it in the empty record counts

Questions?

rainforest-alliance.org

Rainforest Alliance

Rainforest Alliance

More Decks by Marketing OGZ

Featured

Transcript