Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AzureBootcamp2023: Building a Lakehouse Platform on Azure by Hansjörg Wingeier & Mathias Herzog

AzureBootcamp2023: Building a Lakehouse Platform on Azure by Hansjörg Wingeier & Mathias Herzog

In this talk, we will introduce Hadron, a project that involves building a new platform for die Mobiliar to run all BI, ML, Analytics, and AI workloads. The platform is built on the Lakehouse architecture from Databricks and hosted on Azure. We use multiple Databricks Workspaces and UnityCatalog from Databricks. Our project needs to adhere to the pre-defined cloud architecture for application deployment on Azure within our company. The data on the platform is organized using the DataMesh/DataProduct approach. We began the project in February 2022 and have made significant progress in building the infrastructure, allowing for large-scale data products to be deployed on the platform.

Our talk will cover the following topics:

Introduction: We’ll discuss our goals for the new platform and why we chose the Databricks Lakehouse approach.
The key components of the Lakehouse architecture: Unity Catalog, Databricks Workspaces, and how access control is managed.
Overview of the cloud architecture within our company and how it informs the infrastructure of the platform.
Infrastructure overview: how we organize Databricks Workspaces, Azure Data Lake Storages, and how the storage is networked with the workspaces to ensure secure access.
Automation: We’ll explain how we use GitLab Pipelines and Terraform, as well as Databricks APIs, to automate the deployment of resources and manage access controls.
Key decisions made during the architecture design and why they were important.
Challenges we encountered during the project and how we overcame them.
Current status of the infrastructure: what data is available and who is currently using the platform.
Next steps: where we plan to go from here.
🙂 HANSJÖRG WINGEIER ⚡️ IT Architect @ Die Mobiliar
🙂 MATHIAS HERZOG ⚡️ Cloud Consultant @ peakscale.ch

Azure Zurich User Group
PRO

May 11, 2023
Tweet

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. Building a Lakehouse Platform on Azure with Databricks
    Hansjörg Wingeier, IT Architect, die Mobiliar
    Mathias Herzog, Senior Cloud Consultant , Peak Scale

    View Slide

  2. Name
    Hansjörg Wingeier
    Job
    IT Architect
    Company
    die Mobiliar
    Team
    Bricks @ Die Mobiliar
    Contact
    https://www.linkedin.com/in/hansjoerg-wingeier/
    Fun Stuff
    Krav Maga, Reading
    Tags (edit)
    Infra: Azure, Databricks SW: Python, Java, SQL, ML, RL, Data Processing IAC: Terraform More (99)

    View Slide

  3. Name
    Mathias Herzog
    Job
    Infrastructure Artist
    Company
    peakscale.cloud
    Team
    Bricks @ Die Mobiliar
    Contact
    linkedin.com/in/mathias-herzog-888a6788
    Fun Stuff
    Hiking, Biking, Skiing, doing fun stuff
    Tags (edit)
    Infra: Azure, K8s, Linux, Databricks SW: Python, Go, JS IAC: Terraform, Ansible More (99)

    View Slide

  4. Vision
    23.05.2023 4

    View Slide

  5. Vision - The Journey from EDWH to Dataproducts & DataMesh
    23.05.2023 5
    Governance
    Data Mesh
    Building a lakehouse platform on Azure with Databricks
    Delivering data driven value at scale
    sourcing
    managing
    accessing
    How to work with data?
    How to produce data?
    How to provide data?

    View Slide

  6. 23.05.2023 6
    Databricks Lakehouse Mobiliar Cloud Architecture
    Components & Deployment
    Connecting Components
    Automate Deployment
    Conclusion Outlook
    Context
    Hadron
    Infrastructure
    Final
    Remarks
    Agenda
    Building a lakehouse platform on Azure with Databricks

    View Slide

  7. Part I
    Context
    23.05.2023 7

    View Slide

  8. Databricks
    Lakehouse
    23.05.2023 8

    View Slide

  9. Databricks Lakehouse Platform
    23.05.2023 9
    https://www.databricks.com/product/data-lakehouse
    Metastore
    Users
    SPs
    Groups
    Unity
    Catalog
    Storage Workspaces
    Building a lakehouse platform on Azure with Databricks

    View Slide

  10. Unity Catalog Metastore
    23.05.2023 10
    Metastore
    Users
    SPs
    Groups
    Unity
    Catalog
    Storage
    ExternalLocation ExternalLocation DefaultStorage
    Metastore
    Users
    SPs
    Groups
    Unity
    Catalog
    Storage
    Catalog Catalog Catalog
    Schema
    Schema
    Schema Schema Schema
    Building a lakehouse platform on Azure with Databricks

    View Slide

  11. (Azure) Databricks Workspace
    23.05.2023 11
    Operate
    Develop
    Spark infrastructure
    As needed Batches and Streams ML SQL / Dashboards Workflows
    Workspaces
    Building a lakehouse platform on Azure with Databricks

    View Slide

  12. Mobiliar
    Cloud Architecture
    23.05.2023 12

    View Slide

  13. 13
    Development process
    The Development process at Die Mobiliar is defined according to the DevOps principles and, specifically, the Continuous
    Delivery ones and defines the following steps that are to be carried out for each addition of functionality or bugfix:
    Building a lakehouse platform on Azure with Databricks 23.05.2023

    View Slide

  14. 14
    Development process
    Continuous Deployment
    Continuous Integration
    Building a lakehouse platform on Azure with Databricks 23.05.2023
    CI automatically build, tests and integrates code changes in a repository
    CD automatically deploys code changes to customers directly. Every commit is a potential release

    View Slide

  15. Project structure PreProd Prod




    App 2 App 1
    Mobiliar Cloud Architecture - CI/CD Pipeline
    15
    Deploy infrastructure
    App1-preprod-runner +pp1-prod-runner
    App2-preprod-runner App2-prod-runner
    23.05.2023
    IIQ/IAM
    (on-premise)
    Azure AD
    Mobi-Metastore
    Users
    SPs
    Groups
    Unity
    Catalog
    Domain X
    App 1
    Project
    Domain Y
    App 2
    Project

    View Slide

  16. Part II
    Hadron Infrastructure
    23.05.2023 16

    View Slide

  17. Components & Deployment
    23.05.2023 17

    View Slide

  18. Main components of the Mobiliar Lakehouse
    18
    "DBx4DEV (Databricks Developer Workplace)"
    Workplace of Data Engineers to develop dbxservices
    One single instance for all Data Engineers
    Workspaces
    Metastore
    Users
    SPs
    Groups
    Unity
    Catalog
    Storage
    DBx4Dev
    DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 DSW UC6
    RT APP1 RT APP2 RT APP3
    "DBxDSW (Databricks Data Scientist Workplace)"
    Workplace of Data Scientists to work with productive data
    One instance per usecase
    "DBxRT (Databricks Runtime)"
    Databricks Workspace on which workflows are deployed to
    One Instance for every application and stage
    Storage Accounts
    The Storage accounts used by the Unity Catalog to store the data
    One instance for every application and stage
    Databricks Unity Catalog
    Centralized governance and access to decentralized data
    Building a lakehouse platform on Azure with Databricks 23.05.2023

    View Slide

  19. Deployment of main components
    23.05.2023 19
    PreProd Prod
    IIQ/IAM
    (on-premise)
    Azure AD
    Mobi-Metastore
    Users
    SPs
    Groups
    Unity Catalog
    App 2 App 1 HAD
    Workspaces
    Storages
    App 2 App 1 HAD
    Preprod Data Prod Data Sample
    Data
    DBx4Dev
    RT APP1 Pre RT APP1 Prd
    RT APP2 Pre RT APP2 Prd
    DSW UC1 DSW UC2 DSW UC3
    DSW UC4 DSW UC5
    How to ensure
    the isolation?
    Building a lakehouse platform on Azure with Databricks

    View Slide

  20. Connecting Components
    23.05.2023 20

    View Slide

  21. Databricks Layer
    23.05.2023 21
    PreProd Prod
    IIQ/IAM
    (on-premise)
    Azure AD
    Mobi-Metastore
    Users
    SPs
    Groups
    Unity Catalog
    App 2 App 1 HAD
    Workspaces
    Storages
    App 2 App 1 HAD
    Preprod Data Prod Data Sample
    Data
    DBx4Dev
    RT APP1 Pre RT APP1 Prd
    RT APP2 Pre RT APP2 Prd
    DSW UC1 DSW UC2 DSW UC3
    DSW UC4 DSW UC5
    EL App1 Pre EL App2 Pre EL App1 Prd EL App2 Prd EL Sample
    Cat App1 Prd Cat App2 Prd
    Cat App1 Pre Cat App2 Pre Cat Sample
    grants grants grants grants
    grants
    uses uses uses uses uses
    Building a lakehouse platform on Azure with Databricks

    View Slide

  22. 23.05.2023 22
    PreProd Prod
    IIQ/IAM
    (on-premise)
    Azure AD
    Unity Catalog
    Mobi-Metastore
    Users
    SPs
    Groups
    App 2 App 1 HAD
    Workspaces
    Storages
    App 2 App 1 HAD
    Preprod Data Prod Data Sample
    Data
    DBx4Dev
    RT APP1 Pre RT APP1 Prd
    RT APP2 Pre RT APP2 Prd
    DSW UC1 DSW UC2 DSW UC3
    DSW UC4 DSW UC5
    Network Layer
    EL App1 Pre
    Cat App1 Prd
    EL App2 Pre EL App1 Prd EL App2 Prd EL Sample
    Cat App2 Prd
    Cat App1 Pre Cat App2 Pre Cat Sample
    Building a lakehouse platform on Azure with Databricks

    View Slide

  23. Automate Deployment
    23.05.2023 23

    View Slide

  24. PreProd Prod
    App 2 App 1 HAD
    Databricks / Unity Catalog automation - grants
    24
    Deploy infrastructure
    HAD-preprod-runner HAD-prod-runner
    App1-preprod-runner App1-prod-runner
    App2-preprod-runner App2-prod-runner
    Mobi-Metastore
    Users
    SPs
    Groups
    Unity Catalog
    RT APP1 Prd
    deploys
    Building a lakehouse platform on Azure with Databricks 23.05.2023
    Domain IT
    HAD
    Project
    Domain X
    App 1
    Project
    Domain X
    App 1
    Project
    Gitlab structure

    View Slide

  25. Deployment of Storage, Runtimes, & DSW
    23.05.2023 25
    Prod
    IIQ/IAM
    (on-premise)
    Azure AD
    Mobi-Metastore
    Users
    SPs
    Groups
    Unity Catalog
    App 2 App 1 HAD
    Workspaces
    Storages
    App 2 App 1 HAD
    RT APP1 Prd
    RT APP2 Prd
    Cat App2 Prd
    EL App2 Prd
    uses
    Cat App1 Prd
    EL App1 Prd
    uses
    • Domain 1
    • HAD
    Order-infrastructure-job
    • App 1
    CreateStorage-iac
    CreateRuntime-iac
    CreateDSW-iac
    Project structure
    DSW UC1
    DSWUC1User
    DSWUC1User
    grant access
    DSWUC1User
    Building a lakehouse platform on Azure with Databricks

    View Slide

  26. Deployment of Storage, Runtimes, & DSW
    23.05.2023 26
    Building a lakehouse platform on Azure with Databricks
    The order process pipeline connecting things

    View Slide

  27. Terraform or RestAPI
    23.05.2023
    Basispräsentation 27
    Declarative, IaC pattern
    Domain Specific (DSL)
    Widely accepted at Die Mobiliar
    Manges state (housekeeping is easy)
    A Terraform Stack provided by Die Mobiliar
    Full blown programming language
    Code assisting
    Debugging is easy
    Easy to test
    Hard to manage n:m relations (complex logic)
    Debugging is cumbersome
    Hard to use imperatively (using loops, etc.)
    Provider initialization per subscription and WS
    Not all Databricks API functionality is available
    No state management
    More work

    View Slide

  28. Part III
    Final Remarks
    23.05.2023 28

    View Slide

  29. Conclusion
    23.05.2023 29
    Building a lakehouse platform on Azure with Databricks
    • The defined infrastructure around the Databricks Lakehouse components is working
    • Automation was possible
    • Everything is in development, we faced significant feature changes
    e.g., multiple Metastores in the same region were possible, now only one Metastore per region is supported)
    • Easy to let costs explode

    View Slide

  30. 23.05.2023 30
    Our first productive Data Products
    Building a lakehouse platform on Azure with Databricks

    View Slide

  31. Outlook
    • Monitor and control costs / Became costs aware / Capex vs Opex
    • Gain experience / How to develop efficiently
    • Onboard new teams
    31
    Building a lakehouse platform on Azure with Databricks 23.05.2023

    View Slide

  32. View Slide