Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Warehouse Forces and Threats

Data Warehouse Forces and Threats

Forces and threats in a data warehouse and why architecture are metadata are important.

Stefan Urbanek

November 25, 2019
Tweet

More Decks by Stefan Urbanek

Other Decks in Programming

Transcript

  1. FORCES AND THREATS
    or Why Architecture and Metadata are Important?
    Štefan Urbánek
    [email protected]
    @Stiivi DataNatives, November 2019

    View full-size slide

  2. OUTLINE
    ▪︎ Forces and Threats
    ▪︎ Traditional Suffering
    ▪︎ Damping the Forces and Threats
    ▪︎ How to start?
    ▪︎ Conclusions

    View full-size slide

  3. Data warehouse – a concept,
    not a technology.
    X Y Z

    View full-size slide

  4. FORCES AND THREATS

    View full-size slide

  5. Data Warehouse Forces and Threats:
    interactions that, when unopposed,
    will change the motion of
    technical, economical, social,
    organisational and process structures
    with high potential of causing suffering

    View full-size slide

  6. CHANGEforce
    Data Warehouse
    Technology
    * you wish!
    ?
    ?
    ?
    ?
    ?
    *
    *

    View full-size slide

  7. CHANGEforce
    Data Warehouse
    Technology
    ?
    ?
    ?
    ?
    ?
    1995
    2000
    2005
    2010
    2015

    View full-size slide

  8. GROWTHforce
    +
    +
    +
    +
    +
    +
    +
    +
    structure
    volume
    +
    +

    View full-size slide

  9. COMPLEXITY force

    potential relationship
    ownership definition
    production

    View full-size slide

  10. COMPLEXITY force
    label/definition
    consumer, owner
    n2 potential relationships*
    *only between tables,
    not even mentioning columns …
    What is relevant?

    View full-size slide

  11. THREATS*
    Mistakes in data
    might lead not only to wrong business decisions,
    but might also have
    legal, financial or existential implications.
    *serious and real

    View full-size slide

  12. TRADITIONAL SUFFERING
    usually chronic

    View full-size slide

  13. SUFFERING
    ▪︎ Bad consistency and no transparency
    No definitions, too many definitions, obscure definitions. Vague opinions in production.
    ▪︎ Slow time-to-market
    Time from a requirement or from observing a change to deployment in the production takes too
    much time.
    ▪︎ Low performance
    … despite having the best hardware, systems, algorithms.
    (some of it)

    View full-size slide

  14. How do we know, the data we are looking at
    is the data we think we are looking at?

    View full-size slide

  15. PERFORMANCE
    We solved “CPU starvation” problem!!!
    Why are our [internal] clients getting data 48-72 hours later?
    ! 20-30x "



    ⨝ ⨝ …

    quite big quite a lot
    ⨝ ⨝
    ⨝ ⨝






    stand-alone ETL process/script

    View full-size slide

  16. probably the same, who knows?
    IS

    IS

    uncontrolled growth

    View full-size slide

  17. DAMPING THE FORCES
    AND THREATS

    View full-size slide

  18. ARCHITECTURE + METADATA
    separation of concerns
    and reduction of complexity potential
    reduction and annotation
    of problem space
    and facilitation of reasoning

    A→B

    View full-size slide

  19. ARCHITECTURE

    View full-size slide

  20. Data Warehouse
    “Agreed-upon Analytical Truth”
    Metadata
    Sandbox/Playground
    Staging ~1:1 “Cleaned Augmented
    Operational Reality”
    Sources
    Quality Assurance
    Humans
    Cubes, Cuboids and Aggregates
    Machines
    External Data
    Platform Data
    Regulated Data


    proof-of-concepts, ad-hoc analysis
    business rules
    “typed tables”
    External API
    decision making
    automation
    analytics-augmented
    application
    3rd NF, ̣, ❄, …
    data scientists
    decision makers
    financial
    datamart(s)
    quality indicators
    src tgt
    ownership
    data models transformations

    A→B ?

    View full-size slide

  21. Data Warehouse
    “Agreed-upon Analytical Truth”
    Metadata
    Sandbox/Playground
    Staging ~1:1 “Cleaned Augmented
    Operational Reality”
    Sources
    Quality Assurance
    Humans
    Cubes, Cuboids and Aggregates
    Machines
    External Data
    Platform Data
    Regulated Data


    proof-of-concepts, ad-hoc analysis
    business rules
    “typed tables”
    External API
    decision making
    automation
    analytics-augmented
    application
    3rd NF, ̣, ❄, …
    data scientists
    decision makers
    financial
    datamart(s)
    quality indicators
    src tgt
    ownership
    data models transformations

    A→B ?

    View full-size slide

  22. METADATA
    ▪︎ Data Warehouse Assets
    concepts, entities, attributes, definitions,
    business rules, quality indicators,
    concept ownerships, …
    ▪︎ Many Perspectives
    conceptual, logical, physical, multi-dimensional, security, …
    ▪︎ Formalised, stored, shared, used
    revenue
    ?
    ?
    ?
    visits
    customers

    View full-size slide

  23. logical → physical
    multidimensional hierarchical → 3rd normal form logical
    3rd normal form logical → physical
    query → precomputed + computed
    denormalisation → joins
    physical → logical
    … → …

    View full-size slide

  24. Metadata Processing
    Metadata
    data models
    transformations

    A→B
    Compose Compile SQL
    physical schema
    dialect
    reality
    semantics
    realisation
    relational algebra
    Execute
    A,B,C
    parameters
    .cob
    .java
    .py

    View full-size slide

  25. STARTING WITH ARCHITECTURE
    1. Pick one:
    If in doubt – any known to work. Any separation of concerns is better than none.
    2. Make it formal and documented.
    Otherwise our effort will be dissolved and the content swampified.
    3. Stick with it for a while and observe.
    4. Adjust as necessary.

    View full-size slide

  26. STARTING WITH METADATA
    1. Pick a problem
    2. Use a spreadsheet
    Software at hand, no installation needed; universal, readable and editable by non-engineers.
    3. Suffer through the spreadsheet-exchange drill phase
    Mirror of our processes – seeing the genuine pain points will be useful later.
    4. Use functional approach to metadata composition and application
    … from those spreadsheets. Example: relational algebra library in the language of our ecosystem.
    99.(later) Move spreadsheets into a metadata repository

    View full-size slide

  27. “HELLO METADATA” PROBLEMS
    ▪︎ Data quality indicators1
    ▪︎ Structural (model ® schema) consistency check1
    ▪︎ Automation of common patterns
    denormalisation, aggregation, pivot
    ▪︎ Automate “relationalization” of freely-structured data
    JSON → relational
    ▪︎ Browsability
    1non-invasive, non-destructive

    View full-size slide

  28. Doing Things To Data
    Doing More Things
    To Data

    Doing Things To Data
    Doing More Things
    To Data

    Pipelines without metadata
    Pipelines with metadata
    metadata
    data

    View full-size slide

  29. DATA QUALITY INDICATORS
    Doing Things To Data
    Doing More Things
    To Data

    metadata
    data quality
    measurements
    data quality
    indicators
    data
    metadata
    definition, computation, warning/error thresholds,
    ownership, affected business entity, …

    View full-size slide

  30. COMMON PATTERNS
    Automatically Generated Artefacts
    Metadata
    Manually Crafted Artefacts
    IS

    denormalize
    aggregate
    pivot
    patterns

    controlled growth
    probably the same, who knows?
    IS

    IS

    uncontrolled growth

    View full-size slide

  31. VISUALISATION AND EXPLORATION
    Browse-ability: How can we explore a metric? How can we drill down?
    User Interface
    Metadata
    Physical Data
    Region

    name
    Sales
    Revenue
    Visits


    3
    2
    1
    id
    Cubes
    Geography

    name
    Date
    2

    id …
    1
    Dimensions
    Europe
    Germany
    Berlin
    regions
    Country
    City
    Levels
    2 region_code
    country_name

    2
    Country
    3
    country_iso
    1
    key
    Region

    name
    id
    City
    2
    dim label
    2
    region_name
    city_name
    city_code

    countries
    cities
    generated
    which column?
    concept-to-user propagation

    View full-size slide

  32. GET /cube/sales/aggregate? cut=date:2010
    & split=status:1
    &
    drilldown=date|region
    & page=10 page_size=100
    &
    SQL

    Metadata
    Logical Model Physical
    Physical Data Store
    Query Context
    Input
    Output
    Cube
    all attributes
    base attributes
    ⨝ joins
    database
    metadata
    Store
    Mapper
    locale
    parameters
    create schema
    collect and sort
    dependencies
    map attributes
    mappings
    mappings of
    base attributes
    fact table
    naming
    convention
    hierarchies
    Star Schema
    ̣/❄
    compile attributes
    base attributes
    dependant
    attributes
    columns
    make star
    (topological sort)
    query attributes
    SQL Query Context
    create context
    base columns
    column expressions for attributes
    SELECT, GROUP BY
    “star” join statement
    FROM
    conditions
    WHERE
    Cubes 1.1 – SQL Query Construction
    A,B,C?
    SQL

    View full-size slide

  33. TRANSPARENT REPRESENTATIONS
    Physical Data Store(s)
    Pre-Aggregated
    3
    rd
    Normal Form
    source of truth
    derived and managed artefacts
    Metadata




    Multi-Dimensional
    Query Server
    ∑ Aggregator
    metadata repository
    past 12 months
    ?
    ⨝s are expensive
    Alternative artefacts: a multi-dimensional data store

    View full-size slide

  34. SHIELD AGAINST FORCES AND THREATS
    ▪︎ Change
    ▪︎ Growth (structural)
    ▪︎ Complexity
    ▪︎ Threats
    financial, legal, existential

    View full-size slide

  35. Force/Threat Architecture Metadata
    Change separation of concerns abstraction, generalisation
    Growth (structural) separation of concerns, modularity
    optimisation through better
    reasoning
    Complexity separation of concerns, destroy-ability
    reduction of problem-space,
    coping with heterogeneity
    Threats transparency, separation of quality
    data accounting,
    verifiable data quality,
    provable consistency,
    source of truth

    View full-size slide

  36. There is path out of the suffering caused by the data warehouse
    forces and threats:
    The “shield” of architecture and metadata.

    View full-size slide

  37. THANK YOU
    Štefan Urbánek
    [email protected]
    @Stiivi

    View full-size slide