Slide 1

Slide 1 text

Building a Lakehouse Platform on Azure with Databricks Hansjörg Wingeier, IT Architect, die Mobiliar Mathias Herzog, Senior Cloud Consultant , Peak Scale

Slide 2

Slide 2 text

Name Hansjörg Wingeier Job IT Architect Company die Mobiliar Team Bricks @ Die Mobiliar Contact https://www.linkedin.com/in/hansjoerg-wingeier/ Fun Stuff Krav Maga, Reading Tags (edit) Infra: Azure, Databricks SW: Python, Java, SQL, ML, RL, Data Processing IAC: Terraform More (99)

Slide 3

Slide 3 text

Name Mathias Herzog Job Infrastructure Artist Company peakscale.cloud Team Bricks @ Die Mobiliar Contact linkedin.com/in/mathias-herzog-888a6788 Fun Stuff Hiking, Biking, Skiing, doing fun stuff Tags (edit) Infra: Azure, K8s, Linux, Databricks SW: Python, Go, JS IAC: Terraform, Ansible More (99)

Slide 4

Slide 4 text

Vision 23.05.2023 4

Slide 5

Slide 5 text

Vision - The Journey from EDWH to Dataproducts & DataMesh 23.05.2023 5 Governance Data Mesh Building a lakehouse platform on Azure with Databricks Delivering data driven value at scale sourcing managing accessing How to work with data? How to produce data? How to provide data?

Slide 6

Slide 6 text

23.05.2023 6 Databricks Lakehouse Mobiliar Cloud Architecture Components & Deployment Connecting Components Automate Deployment Conclusion Outlook Context Hadron Infrastructure Final Remarks Agenda Building a lakehouse platform on Azure with Databricks

Slide 7

Slide 7 text

Part I Context 23.05.2023 7

Slide 8

Slide 8 text

Databricks Lakehouse 23.05.2023 8

Slide 9

Slide 9 text

Databricks Lakehouse Platform 23.05.2023 9 https://www.databricks.com/product/data-lakehouse Metastore Users SPs Groups Unity Catalog Storage Workspaces Building a lakehouse platform on Azure with Databricks

Slide 10

Slide 10 text

Unity Catalog Metastore 23.05.2023 10 Metastore Users SPs Groups Unity Catalog Storage ExternalLocation ExternalLocation DefaultStorage Metastore Users SPs Groups Unity Catalog Storage Catalog Catalog Catalog Schema Schema Schema Schema Schema Building a lakehouse platform on Azure with Databricks

Slide 11

Slide 11 text

(Azure) Databricks Workspace 23.05.2023 11 Operate Develop Spark infrastructure As needed Batches and Streams ML SQL / Dashboards Workflows Workspaces Building a lakehouse platform on Azure with Databricks

Slide 12

Slide 12 text

Mobiliar Cloud Architecture 23.05.2023 12

Slide 13

Slide 13 text

13 Development process The Development process at Die Mobiliar is defined according to the DevOps principles and, specifically, the Continuous Delivery ones and defines the following steps that are to be carried out for each addition of functionality or bugfix: Building a lakehouse platform on Azure with Databricks 23.05.2023

Slide 14

Slide 14 text

14 Development process Continuous Deployment Continuous Integration Building a lakehouse platform on Azure with Databricks 23.05.2023 CI automatically build, tests and integrates code changes in a repository CD automatically deploys code changes to customers directly. Every commit is a potential release

Slide 15

Slide 15 text

Project structure PreProd Prod App 2 App 1 Mobiliar Cloud Architecture - CI/CD Pipeline 15 Deploy infrastructure App1-preprod-runner +pp1-prod-runner App2-preprod-runner App2-prod-runner 23.05.2023 IIQ/IAM (on-premise) Azure AD Mobi-Metastore Users SPs Groups Unity Catalog Domain X App 1 Project Domain Y App 2 Project

Slide 16

Slide 16 text

Part II Hadron Infrastructure 23.05.2023 16

Slide 17

Slide 17 text

Components & Deployment 23.05.2023 17

Slide 18

Slide 18 text

Main components of the Mobiliar Lakehouse 18 "DBx4DEV (Databricks Developer Workplace)" Workplace of Data Engineers to develop dbxservices One single instance for all Data Engineers Workspaces Metastore Users SPs Groups Unity Catalog Storage DBx4Dev DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 DSW UC6 RT APP1 RT APP2 RT APP3 "DBxDSW (Databricks Data Scientist Workplace)" Workplace of Data Scientists to work with productive data One instance per usecase "DBxRT (Databricks Runtime)" Databricks Workspace on which workflows are deployed to One Instance for every application and stage Storage Accounts The Storage accounts used by the Unity Catalog to store the data One instance for every application and stage Databricks Unity Catalog Centralized governance and access to decentralized data Building a lakehouse platform on Azure with Databricks 23.05.2023

Slide 19

Slide 19 text

Deployment of main components 23.05.2023 19 PreProd Prod IIQ/IAM (on-premise) Azure AD Mobi-Metastore Users SPs Groups Unity Catalog App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD Preprod Data Prod Data Sample Data DBx4Dev RT APP1 Pre RT APP1 Prd RT APP2 Pre RT APP2 Prd DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 How to ensure the isolation? Building a lakehouse platform on Azure with Databricks

Slide 20

Slide 20 text

Connecting Components 23.05.2023 20

Slide 21

Slide 21 text

Databricks Layer 23.05.2023 21 PreProd Prod IIQ/IAM (on-premise) Azure AD Mobi-Metastore Users SPs Groups Unity Catalog App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD Preprod Data Prod Data Sample Data DBx4Dev RT APP1 Pre RT APP1 Prd RT APP2 Pre RT APP2 Prd DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 EL App1 Pre EL App2 Pre EL App1 Prd EL App2 Prd EL Sample Cat App1 Prd Cat App2 Prd Cat App1 Pre Cat App2 Pre Cat Sample grants grants grants grants grants uses uses uses uses uses Building a lakehouse platform on Azure with Databricks

Slide 22

Slide 22 text

23.05.2023 22 PreProd Prod IIQ/IAM (on-premise) Azure AD Unity Catalog Mobi-Metastore Users SPs Groups App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD Preprod Data Prod Data Sample Data DBx4Dev RT APP1 Pre RT APP1 Prd RT APP2 Pre RT APP2 Prd DSW UC1 DSW UC2 DSW UC3 DSW UC4 DSW UC5 Network Layer EL App1 Pre Cat App1 Prd EL App2 Pre EL App1 Prd EL App2 Prd EL Sample Cat App2 Prd Cat App1 Pre Cat App2 Pre Cat Sample Building a lakehouse platform on Azure with Databricks

Slide 23

Slide 23 text

Automate Deployment 23.05.2023 23

Slide 24

Slide 24 text

PreProd Prod App 2 App 1 HAD Databricks / Unity Catalog automation - grants 24 Deploy infrastructure HAD-preprod-runner HAD-prod-runner App1-preprod-runner App1-prod-runner App2-preprod-runner App2-prod-runner Mobi-Metastore Users SPs Groups Unity Catalog RT APP1 Prd deploys Building a lakehouse platform on Azure with Databricks 23.05.2023 Domain IT HAD Project Domain X App 1 Project Domain X App 1 Project Gitlab structure

Slide 25

Slide 25 text

Deployment of Storage, Runtimes, & DSW 23.05.2023 25 Prod IIQ/IAM (on-premise) Azure AD Mobi-Metastore Users SPs Groups Unity Catalog App 2 App 1 HAD Workspaces Storages App 2 App 1 HAD RT APP1 Prd RT APP2 Prd Cat App2 Prd EL App2 Prd uses Cat App1 Prd EL App1 Prd uses • Domain 1 • HAD Order-infrastructure-job • App 1 CreateStorage-iac CreateRuntime-iac CreateDSW-iac Project structure DSW UC1 DSWUC1User DSWUC1User grant access DSWUC1User Building a lakehouse platform on Azure with Databricks

Slide 26

Slide 26 text

Deployment of Storage, Runtimes, & DSW 23.05.2023 26 Building a lakehouse platform on Azure with Databricks The order process pipeline connecting things

Slide 27

Slide 27 text

Terraform or RestAPI 23.05.2023 Basispräsentation 27 Declarative, IaC pattern Domain Specific (DSL) Widely accepted at Die Mobiliar Manges state (housekeeping is easy) A Terraform Stack provided by Die Mobiliar Full blown programming language Code assisting Debugging is easy Easy to test Hard to manage n:m relations (complex logic) Debugging is cumbersome Hard to use imperatively (using loops, etc.) Provider initialization per subscription and WS Not all Databricks API functionality is available No state management More work

Slide 28

Slide 28 text

Part III Final Remarks 23.05.2023 28

Slide 29

Slide 29 text

Conclusion 23.05.2023 29 Building a lakehouse platform on Azure with Databricks • The defined infrastructure around the Databricks Lakehouse components is working • Automation was possible • Everything is in development, we faced significant feature changes e.g., multiple Metastores in the same region were possible, now only one Metastore per region is supported) • Easy to let costs explode

Slide 30

Slide 30 text

23.05.2023 30 Our first productive Data Products Building a lakehouse platform on Azure with Databricks

Slide 31

Slide 31 text

Outlook • Monitor and control costs / Became costs aware / Capex vs Opex • Gain experience / How to develop efficiently • Onboard new teams 31 Building a lakehouse platform on Azure with Databricks 23.05.2023

Slide 32

Slide 32 text

No content