Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rainforest Alliance

Marketing OGZ
September 15, 2023
97

Rainforest Alliance

Marketing OGZ

September 15, 2023
Tweet

Transcript

  1. Data engineering – the key to data success Lessons from

    a data veteran 13-09-2023 – Utrecht – Big data expo
  2. This presentation • What does Rainforest Alliance do? • The

    challenge of data engineering • Principles for development • Principles for testing • Questions
  3. 3 Rudolf van der Heide Senior data engineer and team

    leader at Rainforest Alliance More than 25 years of data experience Data engineer Data analyst Business analyst Accidental DBA Quality manager Data Quality Analyst Project manager Data architect Data modeler Tester Troubleshooter Security Officer
  4. What we do We try to help farmers to do

    better farming, have better incomes, take better care of the environments. 2 group of products: • Certification of tropical products (Coffee, Cocoa, Tea, hazelnut, banana) • Landscale & Community projects And what is needed to support it
  5. A few definitions I use data engineering because that’s the

    term I use most, but think analytics engineers, or data pipeline developers, ETL developers, data scientists I will discuss data pipelines in general, but many principles are also important for reporting, visualisation and complex models. Business data users are your data stewards, business analysts, data management team, data power users
  6. Practical issues that crash your data pipelines ∙ New tables

    ∙ New attributes ∙ Changed data formats ∙ Bugfixes ∙ New business rules ∙ Poor data quality ∙ New codes ∙ User creativity ∙ Incomplete data ∙ Reuse of existing codes and attributes Small changes often have more impact than the big ones.
  7. Example 1: Code changes over time IT servicedesk data Source

    data Priority First Month Last Month 1 Sep-11 Feb-12 2 Sep-11 Feb-12 3 Sep-11 Feb-12 HIGH Feb-12 Jul-12 MIDDLE Feb-12 Jul-12 LOW Feb-12 Jul-12 HIGH Aug-12 now MIDDLE Aug-12 now NORMAL Aug-12 now reporting Priority HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL
  8. Example 2: changed order of records in source Source: claim_

    number product _code cost_ type amount 1234 A07 20 450 1234 A07 40 350 5678 C814 20 450 5678 C814 40 350 5678 C814 80 150 3456 K81 20 370 3456 K81 40 280 7231 A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234 A07 800 450 350 5678 C814 950 450 350 150 3456 K81 650 370 280 7231 A774 450 450 Reporting:
  9. Example 2: changed order of records in source Source: claim_

    number product _code cost_ type amount 1234 A07 40 350 1234 A07 20 450 3456 K81 40 280 3456 K81 20 370 5678 C814 80 150 5678 C814 40 350 5678 C814 20 450 7231 A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234 A07 350 350 1234 A07 450 450 3456 K81 280 380 3456 K81 370 370 5678 C814 150 150 5678 C814 350 360 5678 C814 450 450 7231 A774 450 450 Reporting:
  10. example 3: new filetype Uploaded files per license, in report:

    last updated file of type GMR license_id filename documenttype uploadedon status 2file-18587 GMR 23/11/2022 07:41 1 2file-15602 KML 14/09/2022 09:28 1 3file-12931 GMR 14/07/2022 14:52 4 5file-16870 GMR 10/10/2022 06:38 4 7file-14095 GMR 10/08/2022 16:38 1 8file-10506 GMR 31/05/2022 17:03 1 9file-10037 GMR 20/05/2022 15:49 1 10file-18968 KML 05/12/2022 18:48 4 10file-16695 GMR 05/10/2022 17:29 4 10file-15173 GMR 05/09/2022 18:46 4 12file-3273 GMR 21/12/2021 12:15 1 13file-17448 GMR 25/10/2022 13:32 1 13file-14695 GMR 25/08/2022 04:04 4 13file-5118 GMR 27/01/2022 03:35 1 license_id filename documenttype uploadedon status 48file-25106 StructuredGMR 18/04/2023 13:32 1 58file-16280 GMR 25/09/2022 10:39 1 58file-15995 GMR 19/09/2022 11:33 4
  11. What it means for data engineering We are at the

    end of a long data chain, inheriting all the issues Accept that your source data can change any time You need to build for resilience and flexibility You need to be able to quickly validate You are developing solutions for datasets, not for a number of specific cases You need to be able to quickly spot changes (errors) in your source data
  12. Build and test with production data 1. Only production data

    will cover a significant variety of data and data issues 2. Production data have the relevant day-to-day changes in the data your data pipelines need to cover 3. You need realistic volumes of data 4. Trying to distinguish between real data issues and flawed test data will cost you lots of time and frustration
  13. 19 … but you may face heavy resistance Principled: •

    It is against our standards! • Developers should never work with production data Practical: • Confidentiality • Volume Most of the objections are coming from lack of appreciation of the differences with software engineering
  14. Don’t try to do everything 3 main quality criteria: Completeness

    Correctness Timeliness One or two of the there is reasonably achievable If you want to do all 3 perfectly the costs will go up exponentially And in most analytic use cases there is no need for it So challenge your business
  15. Don’t try to do everything: examples 3 main quality criteria:

    Completeness Correctness Timeliness One or two of the there is reasonably achievable If you want to do all 3 perfectly the costs will go up exponentially And in most analytic use cases there is no need for it So challenge your business
  16. Build with troubleshooting in mind You need to be able

    to easily validate the correctness (without the need of engineering knowledge) (sql example) Intermediate results are very useful for that If you can only validate your pipeline or data product by changing the code to analyse part of it, or storing intermediate results in a table for analysis, it would be a good idea to store it as a standard part of your process.
  17. Don’t hardcode code translations Codes are not constant, they may

    change over time, differ between systems, have changing interpretations, new ones added So make them part of your data management process, not of your code
  18. Example 1: Code changes over time IT servicedesk data Source

    data Priority First Month Last Month 1 Sep-11 Feb-12 2 Sep-11 Feb-12 3 Sep-11 Feb-12 HIGH Feb-12 Jul-12 MIDDLE Feb-12 Jul-12 LOW Feb-12 Jul-12 HIGH Aug-12 now MIDDLE Aug-12 now NORMAL Aug-12 now reporting Priority HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL HIGH MIDDLE NORMAL
  19. 26 Basic code translation structure Source _system master code code

    A Ticket priority 1 A Ticket priority 2 A Ticket priority 3 B Ticket priority HIGH B Ticket priority MIDDLE B Ticket priority LOW C Ticket priority HIGH C Ticket priority MIDDLE C Ticket priority NORMAL Reporting master code reporting code Ticket priority HIGH Ticket priority MIDDLE Ticket priority NORMAL Source _system master code code Reporting master code reporting code A Ticket priority 1 Ticket priority HIGH A Ticket priority 2 Ticket priority MIDDLE A Ticket priority 3 Ticket priority NORMAL B Ticket priority HIGH Ticket priority HIGH B Ticket priority MIDDLE Ticket priority MIDDLE B Ticket priority LOW Ticket priority NORMAL C Ticket priority HIGH Ticket priority HIGH C Ticket priority MIDDLE Ticket priority MIDDLE C Ticket priority NORMAL Ticket priority NORMAL
  20. 27 Use of code translation structures For: Code translations Custom

    classifications and groupings Selection criteria
  21. Your objective for testing and monitoring We want to prove

    that the full dataset is correct We want to be able to spot changes and issues caused by your sources Not only now, but also in production, after every load The good thing: A good monitoring process works perfectly well for regressions and integration testing You need production data
  22. Approach 1 : reconciliation Ideally find ways to prove that

    the result is correct One way is using reconciliation reports, like finance uses. This is best designed by business data users. SALES per region Source Data warehouse Reporting Europe 1,700,000 1,700,000 1,700,000 Africa 940,000 942,000 942,000 Asia 365,000 363,000 363,000 US 9,993,000 9,993,000 9,993,000 Latin America 210,000 210,000 210,000
  23. Approach: statistics Statistics can help you spot errors and changes

    The easiest and most successful systems I have seen for that, run standard queries every day and compare the result with previous runs
  24. What to measure? • Number of records per table •

    Number of empty values per column (including empty strings or default dates) • Number of unique values per column • Number of missing foreign key references • Number of new business keys (new customers, new shops, new products) Meaningful Business statistics: Average sales per order
  25. Example 2: changed order of records in source Source: claim_

    number product _code cost_ type amount 1234A07 40 350 1234A07 20 450 3456K81 40 280 3456K81 20 370 5678C814 80 150 5678C814 40 350 5678C814 20 450 7231A774 20 450 claim_ number product _code amount total amount _20 amount _40 amount _80 1234A07 350 350 1234A07 450 450 3456K81 280 380 3456K81 370 370 5678C814 150 150 5678C814 350 360 5678C814 450 450 7231A774 450 450 Reporting: This was spotted by a data scientist during the mandatory data loading check by looking at average amount per product code
  26. Create signals to highlight possible issues Basic assumption: Previous load

    was correct, compare statistics with previous loads. Keep in mind that exception may only become visible after some time. How to do it, depends on your business and data. Examples of signals: Record counts: >2% growth, >0,5%, extra alert if the growth is more than twice last month average Negative growth for more than 1 day More than 3 days 0 growth for all tables except reference tables
  27. 35 Examples of signals Number of unique values per column:

    Supermarket chain, shop number in the sales table. Usually no change, if there is a change the business should know. Code fields: Should be stable, if growth you need a sanity check Number of empty values: Either a fixed value (known errors), or a fixed percentage If there is a big change, investigation is needed. Sometimes it might be easier to look at non-empty values
  28. example 3: new filetype Uploaded files per license, in report:

    last updated file of type GMR license_id filename documenttype uploadedon status 2file-18587 GMR 23/11/2022 07:41 1 2file-15602 KML 14/09/2022 09:28 1 3file-12931 GMR 14/07/2022 14:52 4 5file-16870 GMR 10/10/2022 06:38 4 7file-14095 GMR 10/08/2022 16:38 1 8file-10506 GMR 31/05/2022 17:03 1 9file-10037 GMR 20/05/2022 15:49 1 10file-18968 KML 05/12/2022 18:48 4 10file-16695 GMR 05/10/2022 17:29 4 10file-15173 GMR 05/09/2022 18:46 4 12file-3273 GMR 21/12/2021 12:15 1 13file-17448 GMR 25/10/2022 13:32 1 13file-14695 GMR 25/08/2022 04:04 4 13file-5118 GMR 27/01/2022 03:35 1 license_id filename documenttype uploadedon status 48file-25106 StructuredGMR 18/04/2023 13:32 1 58file-16280 GMR 25/09/2022 10:39 1 58file-15995 GMR 19/09/2022 11:33 4 We could see it in the empty record counts