Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AzureBootcamp2023: Building a Lakehouse Platform on Azure by Hansjörg Wingeier & Mathias Herzog

AzureBootcamp2023: Building a Lakehouse Platform on Azure by Hansjörg Wingeier & Mathias Herzog

In this talk, we will introduce Hadron, a project that involves building a new platform for die Mobiliar to run all BI, ML, Analytics, and AI workloads. The platform is built on the Lakehouse architecture from Databricks and hosted on Azure. We use multiple Databricks Workspaces and UnityCatalog from Databricks. Our project needs to adhere to the pre-defined cloud architecture for application deployment on Azure within our company. The data on the platform is organized using the DataMesh/DataProduct approach. We began the project in February 2022 and have made significant progress in building the infrastructure, allowing for large-scale data products to be deployed on the platform.

Our talk will cover the following topics:

Introduction: We’ll discuss our goals for the new platform and why we chose the Databricks Lakehouse approach.
The key components of the Lakehouse architecture: Unity Catalog, Databricks Workspaces, and how access control is managed.
Overview of the cloud architecture within our company and how it informs the infrastructure of the platform.
Infrastructure overview: how we organize Databricks Workspaces, Azure Data Lake Storages, and how the storage is networked with the workspaces to ensure secure access.
Automation: We’ll explain how we use GitLab Pipelines and Terraform, as well as Databricks APIs, to automate the deployment of resources and manage access controls.
Key decisions made during the architecture design and why they were important.
Challenges we encountered during the project and how we overcame them.
Current status of the infrastructure: what data is available and who is currently using the platform.
Next steps: where we plan to go from here.
🙂 HANSJÖRG WINGEIER ⚡️ IT Architect @ Die Mobiliar
🙂 MATHIAS HERZOG ⚡️ Cloud Consultant @ peakscale.ch

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. Building a Lakehouse Platform on Azure with Databricks Hansjörg Wingeier,

    IT Architect, die Mobiliar Mathias Herzog, Senior Cloud Consultant , Peak Scale
  2. Name Hansjörg Wingeier Job IT Architect Company die Mobiliar Team

    Bricks @ Die Mobiliar Contact https://www.linkedin.com/in/hansjoerg-wingeier/ Fun Stuff Krav Maga, Reading Tags (edit) Infra: Azure, Databricks SW: Python, Java, SQL, ML, RL, Data Processing IAC: Terraform More (99)
  3. Name Mathias Herzog Job Infrastructure Artist Company peakscale.cloud Team Bricks

    @ Die Mobiliar Contact linkedin.com/in/mathias-herzog-888a6788 Fun Stuff Hiking, Biking, Skiing, doing fun stuff Tags (edit) Infra: Azure, K8s, Linux, Databricks SW: Python, Go, JS IAC: Terraform, Ansible More (99)
  4. Vision - The Journey from EDWH to Dataproducts & DataMesh

    23.05.2023 5 Governance Data Mesh Building a lakehouse platform on Azure with Databricks Delivering data driven value at scale sourcing managing accessing How to work with data? How to produce data? How to provide data?
  5. 23.05.2023 6 Databricks Lakehouse Mobiliar Cloud Architecture Components & Deployment

    Connecting Components Automate Deployment Conclusion Outlook Context Hadron Infrastructure Final Remarks Agenda Building a lakehouse platform on Azure with Databricks
  6. Databricks Lakehouse Platform 23.05.2023 9 https://www.databricks.com/product/data-lakehouse Metastore Users SPs Groups

    Unity Catalog Storage Workspaces Building a lakehouse platform on Azure with Databricks
  7. Unity Catalog Metastore 23.05.2023 10 Metastore Users SPs Groups Unity

    Catalog Storage ExternalLocation ExternalLocation DefaultStorage Metastore Users SPs Groups Unity Catalog Storage Catalog Catalog Catalog Schema Schema Schema Schema Schema Building a lakehouse platform on Azure with Databricks
  8. (Azure) Databricks Workspace 23.05.2023 11 Operate Develop Spark infrastructure As

    needed Batches and Streams ML SQL / Dashboards Workflows Workspaces Building a lakehouse platform on Azure with Databricks
  9. 13 Development process The Development process at Die Mobiliar is

    defined according to the DevOps principles and, specifically, the Continuous Delivery ones and defines the following steps that are to be carried out for each addition of functionality or bugfix: Building a lakehouse platform on Azure with Databricks 23.05.2023
  10. 14 Development process Continuous Deployment Continuous Integration Building a lakehouse

    platform on Azure with Databricks 23.05.2023 CI automatically build, tests and integrates code changes in a repository CD automatically deploys code changes to customers directly. Every commit is a potential release
  11. Project structure PreProd Prod <Subscription> <Subscription> <Subscription> <Subscription> App 2

    App 1 Mobiliar Cloud Architecture - CI/CD Pipeline 15 Deploy infrastructure App1-preprod-runner +pp1-prod-runner App2-preprod-runner App2-prod-runner 23.05.2023 IIQ/IAM (on-premise) Azure AD Mobi-Metastore Users SPs Groups Unity Catalog Domain X App 1 Project Domain Y App 2 Project
  12. Main components of the Mobiliar Lakehouse 18 "DBx4DEV (Databricks Developer

    Workplace)" Workplace of Data Engineers to develop dbxservices One single instance for all Data Engineers Workspaces Metastore Users SPs Groups Unity Catalog Storage DBx4Dev DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 DSW UC6 RT APP1 RT APP2 RT APP3 "DBxDSW (Databricks Data Scientist Workplace)" Workplace of Data Scientists to work with productive data One instance per usecase "DBxRT (Databricks Runtime)" Databricks Workspace on which workflows are deployed to One Instance for every application and stage Storage Accounts The Storage accounts used by the Unity Catalog to store the data One instance for every application and stage Databricks Unity Catalog Centralized governance and access to decentralized data Building a lakehouse platform on Azure with Databricks 23.05.2023
  13. Deployment of main components 23.05.2023 19 PreProd Prod IIQ/IAM (on-premise)

    Azure AD Mobi-Metastore Users SPs Groups Unity Catalog App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD Preprod Data Prod Data Sample Data DBx4Dev RT APP1 Pre RT APP1 Prd RT APP2 Pre RT APP2 Prd DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 How to ensure the isolation? Building a lakehouse platform on Azure with Databricks
  14. Databricks Layer 23.05.2023 21 PreProd Prod IIQ/IAM (on-premise) Azure AD

    Mobi-Metastore Users SPs Groups Unity Catalog App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD Preprod Data Prod Data Sample Data DBx4Dev RT APP1 Pre RT APP1 Prd RT APP2 Pre RT APP2 Prd DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 EL App1 Pre EL App2 Pre EL App1 Prd EL App2 Prd EL Sample Cat App1 Prd Cat App2 Prd Cat App1 Pre Cat App2 Pre Cat Sample grants grants grants grants grants uses uses uses uses uses Building a lakehouse platform on Azure with Databricks
  15. 23.05.2023 22 PreProd Prod IIQ/IAM (on-premise) Azure AD Unity Catalog

    Mobi-Metastore Users SPs Groups App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD Preprod Data Prod Data Sample Data DBx4Dev RT APP1 Pre RT APP1 Prd RT APP2 Pre RT APP2 Prd DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 Network Layer EL App1 Pre Cat App1 Prd EL App2 Pre EL App1 Prd EL App2 Prd EL Sample Cat App2 Prd Cat App1 Pre Cat App2 Pre Cat Sample Building a lakehouse platform on Azure with Databricks
  16. PreProd Prod App 2 App 1 HAD Databricks / Unity

    Catalog automation - grants 24 Deploy infrastructure HAD-preprod-runner HAD-prod-runner App1-preprod-runner App1-prod-runner App2-preprod-runner App2-prod-runner Mobi-Metastore Users SPs Groups Unity Catalog RT APP1 Prd deploys Building a lakehouse platform on Azure with Databricks 23.05.2023 Domain IT HAD Project Domain X App 1 Project Domain X App 1 Project Gitlab structure
  17. Deployment of Storage, Runtimes, & DSW 23.05.2023 25 Prod IIQ/IAM

    (on-premise) Azure AD Mobi-Metastore Users SPs Groups Unity Catalog App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD RT APP1 Prd RT APP2 Prd Cat App2 Prd EL App2 Prd uses Cat App1 Prd EL App1 Prd uses • Domain 1 • HAD Order-infrastructure-job • App 1 CreateStorage-iac CreateRuntime-iac CreateDSW-iac Project structure DSW UC1 DSWUC1User DSWUC1User grant access DSWUC1User Building a lakehouse platform on Azure with Databricks
  18. Deployment of Storage, Runtimes, & DSW 23.05.2023 26 Building a

    lakehouse platform on Azure with Databricks The order process pipeline connecting things
  19. Terraform or RestAPI 23.05.2023 Basispräsentation 27 Declarative, IaC pattern Domain

    Specific (DSL) Widely accepted at Die Mobiliar Manges state (housekeeping is easy) A Terraform Stack provided by Die Mobiliar Full blown programming language Code assisting Debugging is easy Easy to test Hard to manage n:m relations (complex logic) Debugging is cumbersome Hard to use imperatively (using loops, etc.) Provider initialization per subscription and WS Not all Databricks API functionality is available No state management More work
  20. Conclusion 23.05.2023 29 Building a lakehouse platform on Azure with

    Databricks • The defined infrastructure around the Databricks Lakehouse components is working • Automation was possible • Everything is in development, we faced significant feature changes e.g., multiple Metastores in the same region were possible, now only one Metastore per region is supported) • Easy to let costs explode
  21. Outlook • Monitor and control costs / Became costs aware

    / Capex vs Opex • Gain experience / How to develop efficiently • Onboard new teams 31 Building a lakehouse platform on Azure with Databricks 23.05.2023