Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AzureBootcamp2023: Fully automated & cloud-native data platform by Tim Giger

AzureBootcamp2023: Fully automated & cloud-native data platform by Tim Giger

For a large insurer in Switzerland, we implemented a fully automated and cloud-native data platform using Azure and Bicep. Among other things, Azure DevOps, Azure Data Lake Storage, Azure Synapse and Databricks were used. Business stakeholders also have the ability to set up a 3-tier analytics workspace based on Azure Synapse and Azure DevOps at the click of a button and start exploring the data lake and implementing use cases. The data lake is loaded in real time via Kafka/Databricks and batch-oriented via Azure Data Factory - in total over 100 different data pipelines. These pipelines feed the core Data Warehouse, implemented with Azure Synapse.
🙂 TIM GIGER ⚡️ Senior Data & Analytics Consultant @ Swisscom

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. 1. Context 2. Platform Deployment 3. Data Ingestion 4. Data

    Access & Permissions 5. Sandbox Deployment To Do’s before Apéro:
  2. (Some) Requirements Multicloud Environment with complete network isolation & conditional

    access Supports (near-) real-time data loads At least 3-tier platform (Development, Integration, Production. 10+ Source Systems for data loads, Kafka main data provider No Microsoft-managed keys, only CMKs & additional storage encryption Need for “Sandbox Environments” for prototyping and tech-savy business users. Migration target for the current Data Warehouse (1000+ Tables) Regulated company (makes backup & security a bit more interesting)
  3. Context What is meant by a «Sandbox»? Sandbox INT Sandbox

    DEV Sandbox PRD Azure DevOps Project Pre-configured & templated CI/CD pipelines Also includes permissions based on AD, network integration & linked services (data lake) Pre-configured & user credential passtrough
  4. 6

  5. Architecture Simplified Platform Overview Landing Cleansed Curated EventGrid Integrate EventGrid

    Clean EventGrid Curate ADF Curate ADF Clean ADF Integrate Databricks Data Lake Storage Container: Landing & Cleansed Data Lake Storage Container: Curated DevOps Agent Integration Metadata Logs & Technical Metadata Synapse Sandbox A Sandbox B Sandbox … rg-platform rg-sandox-a rg-sandox-b rg-sandox-…
  6. Development Approach & Process Start with Bicep IaC scripts Manual

    creation & configuration of all needed services 1 Adjust and fix the scripts according to the requirements 4 Deploy it for a first test with az cli 7 Parametrize the Bicep scripts according to the enviroment 10 Get the template (ARM) and decompile it to Bicep 2 Try to exclude configurations with policies (e.g. Private DNS Zones) 5 Run again into 1’001 errors… 8 Run into 1’001 errors… 3 Modularize the scripts into reasonable Bicep modules 6 Fix the errors 9 Review the deployment 12 Develop a Azure CD Pipeline 11
  7. Imperative vs. Declarative IaC Chicken / egg problem (service needs

    to be in place for certain configurations) Depends also on desired Bicep modulariazion Complex configurations needs “treatment”after deployment, e.g. Databricks DBFS Root Encryption Complexitity in Bicep vs. complex deployment dependencies Imperative approach leads to more control but is less «handy»
  8. Parametrization is Key Make use of the several parametrization options

    à Have a concept in mind, where to use what a and group tasks in deployment tasks! • Directly with Bicep (Variables & Parameters) • Azure DevOps Variables • Token Replacement (DevOps Extension) • Bicep Functions (e.g. Subscription Lookup) • Imperative Scripts • Pipeline Variables • JSON configuration files
  9. Deployment How looks the deployment pipeline? “Complete mode In complete

    mode, Resource Manager deletes resources that exist in the resource group but aren't specified in the template.” https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/deployment-modes
  10. Data Architecture & Flow Data Sources Combination of Avro, Parquet

    and Delta Lake fileformat should be considered. Right format at the right place. Ingestion Azure Cloud Data Platform On-Premise Outsourced Data Center Dataset 1 Dataset 2 Dataset 3 Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Source System A Source System B Source System … Source System … SH-IR Topic 1 Topic 2 Topic 3 Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Kafka /INBOUND/69/Kafka/Dataset1 /INBOUND/69/Kafka/Dataset2 /INBOUND/69/Kafka/Dataset3 /INBOUND/53/IR/Dataset4 /INBOUND/47/IR/Dataset5 /INBOUND/71/… /INBOUND/71/… /CLEANSED/69/PII/Dataset1 /CLEANSED/69/PII/Dataset2 /CLEANSED/69/NoPII/Dataset2 /CLEANSED/69/NoPII/Dataset3 /CLEANSED/53/NoPII/Dataset4 /CLEANSED/47/NoPII/Dataset5 /CLEANSED/71/… /CLEANSED/71/… … /CURATED/69/PII/Dataset1 / CURATED/69/PII/Dataset2 / CURATED/69/NoPII/Dataset2 / CURATED/69/NoPII/Dataset3 / CURATED/53/NoPII/Dataset4 / CURATED/47/NoPII/Dataset5 / CURATED/71/… / CURATED/71/… … Inbound Cleansed Curated Core Sandbox A Sandbox B Sandbox … Integration Metadata Synapse Main
  11. Parametrized Data Streaming General Idea • Get topic metadata &

    schema configuration from Repo • Configure stream according to configuration, stored in SQL DB • »Normalize» incoming data objects according to metadata • Define cleaning and curation based on metadata (e.g. PII split)
  12. Exposure to Synapse (Serverless Pools) Get Metadata Drop external Tables

    Create external Tables based on Metadata Daily Trigger Data Lake Curated Layer Synapse Main Azure Data Factory Curate
  13. Data Lake Access Cleansed Assign ACLs based on Active Directory

    Groups. For each dataset 2 Active Directory roles: • ReadAll • ReadNoPIIonly /CLEANSED/{System}/PII/Dataset1 /CLEANSED/{System}/PII/Dataset2 /CLEANSED/{System}/NoPII/Dataset2 /CLEANSED/{System}/NoPII/Dataset3 /CLEANSED/{System}/NoPII/Dataset4 /CLEANSED/{System}/NoPII/Dataset5 /CLEANSED/{System}/… /CLEANSED/{System}/… … /CURATED/{System}/PII/Dataset1 / CURATED/{System}/PII/Dataset2 / CURATED/{System}/NoPII/Dataset2 / CURATED/{System}/NoPII/Dataset3 / CURATED/{System}/NoPII/Dataset4 / CURATED/{System}/NoPII/Dataset5 / CURATED/{System}/… / CURATED/{System}/… … Cleansed Curated Sandboxes Credential Passtrough Split PII Information based on Metadata Harmonize fileformat (Pseudo-)Normalize the Data Objects into tables
  14. Data Lake Access Define Data Lake structure & Mapping in

    JSON format PR to Azure Repo Run CD pipeline to set the ACL mapping
  15. User & Service Permissions scope, assignee, assignee_type, role,environments Microsoft.Storage/storageAccounts/landingdatalake#{environment}#, adf-integrate-#{environment}#,

    managed_identity, Storage Blob Data Contributor, all Microsoft.Storage/storageAccounts/ landingdatalake #{environment}#, adf-integrate-#{environment}#, managed_identity, Storage Blob Data Contributor, dev Microsoft.Storage/storageAccounts/curateddatalake#{environment}#, adf-cleanse-#{environment}#, managed_identity, Storage Blob Data Contributor, all Microsoft.EventGrid/topics/eventgrid-integrate-activites#{environment}#, adf-integrate-#{environment}#, managed_identity, EventGrid Data Sender, all … «services.csv» scope, assignee, assignee_type, role, environments /subscriptions/#{subscriptionId}#/resourceGroups/RG-#{environment}#,SG_AAD_DATA_ENGINEER, aad_group, Contributor, dev … «users.csv» User Powershell and az cli to deploy/modify the permissions
  16. Sandbox Automation First view • Create DevOps project with Azure

    Repo and CI/CD pipelines for the Synapse deployment • Add defined users and groups to the DevOps project and set permissions • Create 3-tier Synapse environment incl. network integration and Azure Repo linkage • Configure Synapse linked services and access rights Goals
  17. Sandbox Automation Final Approach YAML Config To set the configurations

    Custom DevOps Wrapper To automate DevOps Bicep Scripts To automate Synapse deployment Template Repository As a «blueprint» repo To clone into the new Project Incl. SynapsseCI/CD Pipelines CD Pipeline To deploy a sandbox
  18. Sandbox Configuration For a new Sandbox we just need: •

    A name • A removal date (if applicable) • Tags • Users in the form of AD groups or usernames • For Owners • For Contributors Deployment taks around 20 minutes.
  19. User’s Perspective User can access the Data Lake trough a

    3-tier Synapse environment and model/develop/prototype new use-cases Respective data owner approves Data Lake Access Role Order needed Data Lake Access Roles (Active Directory Groups) Wait until the configuration is deployed (deployment takes around 20’) Define roughly 5 parameters for the Sandbox configuration User orders Active Directory Group(s) and assign desired users
  20. • Set-up Governance to keep informed regarding changes in streamed

    data • Build the streaming pipelines for failure, they will fail at a ceratin time – surely • Think of how to reload the data, otherwise changes might be lost • You always need more metadata J Learnings • Bicep Modularization, don’t be too “dogmatic” • Start with GUI, go to Code (IaC) • Use Policies for certain deployments (e.g. private DNS zones) • Interoperability (network, AD, engineers) is key • Use a ”Platform Deployment” Environment to not break the DEV Environment. • Think of the network before you start coding • Check VS Marketplace for useful DevOps Extension • Ensure life-cycle & maintenance of the IaC scripts (e.g. update to new VM families) Infrastructure & Deployment Data & Data Pipelines