Slide 1

Slide 1 text

FORCES AND THREATS or Why Architecture and Metadata are Important? Štefan Urbánek [email protected] @Stiivi DataNatives, November 2019

Slide 2

Slide 2 text

OUTLINE ▪︎ Forces and Threats ▪︎ Traditional Suffering ▪︎ Damping the Forces and Threats ▪︎ How to start? ▪︎ Conclusions

Slide 3

Slide 3 text

Data warehouse – a concept, not a technology. X Y Z

Slide 4

Slide 4 text

FORCES AND THREATS

Slide 5

Slide 5 text

Data Warehouse Forces and Threats: interactions that, when unopposed, will change the motion of technical, economical, social, organisational and process structures with high potential of causing suffering

Slide 6

Slide 6 text

CHANGEforce Data Warehouse Technology * you wish! ? ? ? ? ? * *

Slide 7

Slide 7 text

CHANGEforce Data Warehouse Technology ? ? ? ? ? 1995 2000 2005 2010 2015

Slide 8

Slide 8 text

GROWTHforce + + + + + + + + structure volume + +

Slide 9

Slide 9 text

COMPLEXITY force … potential relationship ownership definition production

Slide 10

Slide 10 text

COMPLEXITY force label/definition consumer, owner n2 potential relationships* *only between tables, not even mentioning columns … What is relevant?

Slide 11

Slide 11 text

THREATS* Mistakes in data might lead not only to wrong business decisions, but might also have legal, financial or existential implications. *serious and real

Slide 12

Slide 12 text

TRADITIONAL SUFFERING usually chronic

Slide 13

Slide 13 text

SUFFERING ▪︎ Bad consistency and no transparency No definitions, too many definitions, obscure definitions. Vague opinions in production. ▪︎ Slow time-to-market Time from a requirement or from observing a change to deployment in the production takes too much time. ▪︎ Low performance … despite having the best hardware, systems, algorithms. (some of it)

Slide 14

Slide 14 text

How do we know, the data we are looking at is the data we think we are looking at?

Slide 15

Slide 15 text

PERFORMANCE We solved “CPU starvation” problem!!! Why are our [internal] clients getting data 48-72 hours later? ! 20-30x " ⨝ ⨝ ∑ ⨝ ⨝ … ⨝ quite big quite a lot ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ ⨝ stand-alone ETL process/script

Slide 16

Slide 16 text

probably the same, who knows? IS ∑ IS ∑ uncontrolled growth

Slide 17

Slide 17 text

DAMPING THE FORCES AND THREATS

Slide 18

Slide 18 text

ARCHITECTURE + METADATA separation of concerns and reduction of complexity potential reduction and annotation of problem space and facilitation of reasoning ∑ A→B

Slide 19

Slide 19 text

ARCHITECTURE

Slide 20

Slide 20 text

Data Warehouse “Agreed-upon Analytical Truth” Metadata Sandbox/Playground Staging ~1:1 “Cleaned Augmented Operational Reality” Sources Quality Assurance Humans Cubes, Cuboids and Aggregates Machines External Data Platform Data Regulated Data ∑ ∑ proof-of-concepts, ad-hoc analysis business rules “typed tables” External API decision making automation analytics-augmented application 3rd NF, ̣, ❄, … data scientists decision makers financial datamart(s) quality indicators src tgt ownership data models transformations ∑ A→B ? …

Slide 21

Slide 21 text

Data Warehouse “Agreed-upon Analytical Truth” Metadata Sandbox/Playground Staging ~1:1 “Cleaned Augmented Operational Reality” Sources Quality Assurance Humans Cubes, Cuboids and Aggregates Machines External Data Platform Data Regulated Data ∑ ∑ proof-of-concepts, ad-hoc analysis business rules “typed tables” External API decision making automation analytics-augmented application 3rd NF, ̣, ❄, … data scientists decision makers financial datamart(s) quality indicators src tgt ownership data models transformations ∑ A→B ? …

Slide 22

Slide 22 text

METADATA

Slide 23

Slide 23 text

METADATA ▪︎ Data Warehouse Assets concepts, entities, attributes, definitions, business rules, quality indicators, concept ownerships, … ▪︎ Many Perspectives conceptual, logical, physical, multi-dimensional, security, … ▪︎ Formalised, stored, shared, used revenue ? ? ? visits customers

Slide 24

Slide 24 text

logical → physical multidimensional hierarchical → 3rd normal form logical 3rd normal form logical → physical query → precomputed + computed denormalisation → joins physical → logical … → …

Slide 25

Slide 25 text

Metadata Processing Metadata data models transformations ∑ A→B Compose Compile SQL physical schema dialect reality semantics realisation relational algebra Execute A,B,C parameters .cob .java .py

Slide 26

Slide 26 text

HOW?

Slide 27

Slide 27 text

STARTING WITH ARCHITECTURE 1. Pick one: If in doubt – any known to work. Any separation of concerns is better than none. 2. Make it formal and documented. Otherwise our effort will be dissolved and the content swampified. 3. Stick with it for a while and observe. 4. Adjust as necessary.

Slide 28

Slide 28 text

STARTING WITH METADATA 1. Pick a problem 2. Use a spreadsheet Software at hand, no installation needed; universal, readable and editable by non-engineers. 3. Suffer through the spreadsheet-exchange drill phase Mirror of our processes – seeing the genuine pain points will be useful later. 4. Use functional approach to metadata composition and application … from those spreadsheets. Example: relational algebra library in the language of our ecosystem. 99.(later) Move spreadsheets into a metadata repository

Slide 29

Slide 29 text

“HELLO METADATA” PROBLEMS ▪︎ Data quality indicators1 ▪︎ Structural (model ® schema) consistency check1 ▪︎ Automation of common patterns denormalisation, aggregation, pivot ▪︎ Automate “relationalization” of freely-structured data JSON → relational ▪︎ Browsability 1non-invasive, non-destructive

Slide 30

Slide 30 text

Doing Things To Data Doing More Things To Data … Doing Things To Data Doing More Things To Data … Pipelines without metadata Pipelines with metadata metadata data

Slide 31

Slide 31 text

DATA QUALITY INDICATORS Doing Things To Data Doing More Things To Data … metadata data quality measurements data quality indicators data metadata definition, computation, warning/error thresholds, ownership, affected business entity, …

Slide 32

Slide 32 text

COMMON PATTERNS Automatically Generated Artefacts Metadata Manually Crafted Artefacts IS ∑ denormalize aggregate pivot patterns ∑ controlled growth probably the same, who knows? IS ∑ IS ∑ uncontrolled growth

Slide 33

Slide 33 text

VISUALISATION AND EXPLORATION Browse-ability: How can we explore a metric? How can we drill down? User Interface Metadata Physical Data Region … name Sales Revenue Visits … … 3 2 1 id Cubes Geography … name Date 2 … id … 1 Dimensions Europe Germany Berlin regions Country City Levels 2 region_code country_name … 2 Country 3 country_iso 1 key Region … name id City 2 dim label 2 region_name city_name city_code … countries cities generated which column? concept-to-user propagation

Slide 34

Slide 34 text

GET /cube/sales/aggregate? cut=date:2010 & split=status:1 & drilldown=date|region & page=10 page_size=100 & SQL → Metadata Logical Model Physical Physical Data Store Query Context Input Output Cube all attributes base attributes ⨝ joins database metadata Store Mapper locale parameters create schema collect and sort dependencies map attributes mappings mappings of base attributes fact table naming convention hierarchies Star Schema ̣/❄ compile attributes base attributes dependant attributes columns make star (topological sort) query attributes SQL Query Context create context base columns column expressions for attributes SELECT, GROUP BY “star” join statement FROM conditions WHERE Cubes 1.1 – SQL Query Construction A,B,C? SQL

Slide 35

Slide 35 text

TRANSPARENT REPRESENTATIONS Physical Data Store(s) Pre-Aggregated 3 rd Normal Form source of truth derived and managed artefacts Metadata ∑ ∑ ∑ ∑ Multi-Dimensional Query Server ∑ Aggregator metadata repository past 12 months ? ⨝s are expensive Alternative artefacts: a multi-dimensional data store

Slide 36

Slide 36 text

CONCLUSIONS

Slide 37

Slide 37 text

SHIELD AGAINST FORCES AND THREATS ▪︎ Change ▪︎ Growth (structural) ▪︎ Complexity ▪︎ Threats financial, legal, existential

Slide 38

Slide 38 text

Force/Threat Architecture Metadata Change separation of concerns abstraction, generalisation Growth (structural) separation of concerns, modularity optimisation through better reasoning Complexity separation of concerns, destroy-ability reduction of problem-space, coping with heterogeneity Threats transparency, separation of quality data accounting, verifiable data quality, provable consistency, source of truth

Slide 39

Slide 39 text

There is path out of the suffering caused by the data warehouse forces and threats: The “shield” of architecture and metadata.

Slide 40

Slide 40 text

THANK YOU Štefan Urbánek [email protected] @Stiivi