Slide 1

Slide 1 text

Big Data Expo 2023 How to accelerate your analytics team with analytics engineering Zhou Su Analytic s Enginee r zhou.su @xebia. com Loek Botman Analytic s Enginee r

Slide 2

Slide 2 text

/ Logistic Services Analytics ● Logistic Services: warehousing & delivery services to sellers on our platform ● Analytics: business decisions backed by data

Slide 3

Slide 3 text

Back to a pre-digital era

Slide 4

Slide 4 text

But it’s hard to gather information from a huge pile of books

Slide 5

Slide 5 text

Someone needed to clean up the mess ?

Slide 6

Slide 6 text

The librarian needed to clean up the mess

Slide 7

Slide 7 text

Data warehouses our modern libraries Data engineer Data analyst ?

Slide 8

Slide 8 text

Analytics Engineers our modern librarians Data engineer Data analyst Analytics Engineer

Slide 9

Slide 9 text

‘Transform’ bigger than you think ● Data modeling ● Data testing ● Source freshness testing ● Data timeliness ● Data documentation & definitions ● Implementing right to be forgotten ● Code reusability ● Code linting ● Data accessibility ● Development vs production environments ● …

Slide 10

Slide 10 text

Why? Prevent duplicate definitions Increase productivity Increase discoverability Prevent data errors Prevent data downtime Not re-inventing the wheel

Slide 11

Slide 11 text

Data engineer Supplies data to the data warehouse Tech-focused Strong programming component Analytics Engineer Data Analyst Brings engineering principles to the field of analytics Business- & Tech-focused Data warehousing SQL fluency Gets insights from the data Business-focused Communication & stakeholder management skills Data visualization

Slide 12

Slide 12 text

Our journey

Slide 13

Slide 13 text

Centralized ETL All transformations in one project Testing testing testing! Testing has become part of our definition of done Documentation in GIT/BQ Documentation is part of our definition of done and is exposed CI/CD Easy to develop and automate tests before going to production Atscale cubes, BQ scripts, etc over separate locations and tools Scattered ETL Testing in scripts was tedious and mostly manual. Hardly any testing Lack of documentation Both on our data products, as well as conventions, pipelines and stack Development Even with great care; hard to control progress and roll back if necessary 01 02 03 04 01 02 03 04 Before and after Q4 2020 In Q4 2020 we started rethinking our analytics workflow

Slide 14

Slide 14 text

Build the foundation 󰠺󰠻 ● Tooling ● Modeling convention & standards

Slide 15

Slide 15 text

Tech stack

Slide 16

Slide 16 text

Workflow container registry

Slide 17

Slide 17 text

We have a development environment next to our production environment lsa-dev lsa DEV PRO logistic_services project project dataset dataset logistic_services dataset dbt run dbt run dbt run

Slide 18

Slide 18 text

Build foundational data models Create data models for key entities as basic building blocks

Slide 19

Slide 19 text

Data warehouse architecture Source s Bases -recasted -renamed Marts -entities Commo n -enriched entity Exposu res -dashboard s -authorized views One big table -for analysis

Slide 20

Slide 20 text

Data warehouse architecture

Slide 21

Slide 21 text

Data warehouse architecture

Slide 22

Slide 22 text

Data warehouse architecture

Slide 23

Slide 23 text

Building data products Dashboards Analysis Data

Slide 24

Slide 24 text

Ensure quality ● CI/CD ● Tests ● Monitoring

Slide 25

Slide 25 text

GitLab CI

Slide 26

Slide 26 text

Data tests

Slide 27

Slide 27 text

Source freshness tests

Slide 28

Slide 28 text

Data quality dashboard dbt-project-evaluator + custom macro

Slide 29

Slide 29 text

Increase analyst productivity ● Proper local development setup ● Macros to compare tables after refactors

Slide 30

Slide 30 text

Proper local development setup

Slide 31

Slide 31 text

Macros to compare tables after refactors

Slide 32

Slide 32 text

Looking back Higher productivity More collaboration Less stress More trust

Slide 33

Slide 33 text

Tips

Slide 34

Slide 34 text

Some practical tips ● Start with the technical stuff ● Continue with more focus on data modeling and conventions ● Think about centralization vs. decentralization

Slide 35

Slide 35 text

Thanks