AzureBootcamp2023: Fully automated & cloud-native data platform by Tim Giger

Fully Automated & cloud-native Data Platform 11.05.23 | Tim Giger
| Swisscom Data & Analytics

1. Context 2. Platform Deployment 3. Data Ingestion 4. Data
Access & Permissions 5. Sandbox Deployment To Do’s before Apéro:

Context

(Some) Requirements Multicloud Environment with complete network isolation & conditional
access Supports (near-) real-time data loads At least 3-tier platform (Development, Integration, Production. 10+ Source Systems for data loads, Kafka main data provider No Microsoft-managed keys, only CMKs & additional storage encryption Need for “Sandbox Environments” for prototyping and tech-savy business users. Migration target for the current Data Warehouse (1000+ Tables) Regulated company (makes backup & security a bit more interesting)

Context What is meant by a «Sandbox»? Sandbox INT Sandbox
DEV Sandbox PRD Azure DevOps Project Pre-configured & templated CI/CD pipelines Also includes permissions based on AD, network integration & linked services (data lake) Pre-configured & user credential passtrough

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Architecture Simplified Platform Overview Landing Cleansed Curated EventGrid Integrate EventGrid
Clean EventGrid Curate ADF Curate ADF Clean ADF Integrate Databricks Data Lake Storage Container: Landing & Cleansed Data Lake Storage Container: Curated DevOps Agent Integration Metadata Logs & Technical Metadata Synapse Sandbox A Sandbox B Sandbox … rg-platform rg-sandox-a rg-sandox-b rg-sandox-…

Development Approach & Process Start with Bicep IaC scripts Manual
creation & configuration of all needed services 1 Adjust and fix the scripts according to the requirements 4 Deploy it for a first test with az cli 7 Parametrize the Bicep scripts according to the enviroment 10 Get the template (ARM) and decompile it to Bicep 2 Try to exclude configurations with policies (e.g. Private DNS Zones) 5 Run again into 1’001 errors… 8 Run into 1’001 errors… 3 Modularize the scripts into reasonable Bicep modules 6 Fix the errors 9 Review the deployment 12 Develop a Azure CD Pipeline 11

Did it work?

Imperative vs. Declarative IaC Chicken / egg problem (service needs
to be in place for certain configurations) Depends also on desired Bicep modulariazion Complex configurations needs “treatment”after deployment, e.g. Databricks DBFS Root Encryption Complexitity in Bicep vs. complex deployment dependencies Imperative approach leads to more control but is less «handy»

Parametrization is Key Make use of the several parametrization options
à Have a concept in mind, where to use what a and group tasks in deployment tasks! • Directly with Bicep (Variables & Parameters) • Azure DevOps Variables • Token Replacement (DevOps Extension) • Bicep Functions (e.g. Subscription Lookup) • Imperative Scripts • Pipeline Variables • JSON configuration files

Deployment How looks the deployment pipeline? “Complete mode In complete
mode, Resource Manager deletes resources that exist in the resource group but aren't specified in the template.” https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/deployment-modes

Data Architecture & Flow Data Sources Combination of Avro, Parquet
and Delta Lake fileformat should be considered. Right format at the right place. Ingestion Azure Cloud Data Platform On-Premise Outsourced Data Center Dataset 1 Dataset 2 Dataset 3 Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Source System A Source System B Source System … Source System … SH-IR Topic 1 Topic 2 Topic 3 Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Kafka /INBOUND/69/Kafka/Dataset1 /INBOUND/69/Kafka/Dataset2 /INBOUND/69/Kafka/Dataset3 /INBOUND/53/IR/Dataset4 /INBOUND/47/IR/Dataset5 /INBOUND/71/… /INBOUND/71/… /CLEANSED/69/PII/Dataset1 /CLEANSED/69/PII/Dataset2 /CLEANSED/69/NoPII/Dataset2 /CLEANSED/69/NoPII/Dataset3 /CLEANSED/53/NoPII/Dataset4 /CLEANSED/47/NoPII/Dataset5 /CLEANSED/71/… /CLEANSED/71/… … /CURATED/69/PII/Dataset1 / CURATED/69/PII/Dataset2 / CURATED/69/NoPII/Dataset2 / CURATED/69/NoPII/Dataset3 / CURATED/53/NoPII/Dataset4 / CURATED/47/NoPII/Dataset5 / CURATED/71/… / CURATED/71/… … Inbound Cleansed Curated Core Sandbox A Sandbox B Sandbox … Integration Metadata Synapse Main

Parametrized Data Streaming General Idea • Get topic metadata &
schema configuration from Repo • Configure stream according to configuration, stored in SQL DB • »Normalize» incoming data objects according to metadata • Define cleaning and curation based on metadata (e.g. PII split)

Exposure to Synapse (Serverless Pools) Get Metadata Drop external Tables
Create external Tables based on Metadata Daily Trigger Data Lake Curated Layer Synapse Main Azure Data Factory Curate

Data Lake Access Cleansed Assign ACLs based on Active Directory
Groups. For each dataset 2 Active Directory roles: • ReadAll • ReadNoPIIonly /CLEANSED/{System}/PII/Dataset1 /CLEANSED/{System}/PII/Dataset2 /CLEANSED/{System}/NoPII/Dataset2 /CLEANSED/{System}/NoPII/Dataset3 /CLEANSED/{System}/NoPII/Dataset4 /CLEANSED/{System}/NoPII/Dataset5 /CLEANSED/{System}/… /CLEANSED/{System}/… … /CURATED/{System}/PII/Dataset1 / CURATED/{System}/PII/Dataset2 / CURATED/{System}/NoPII/Dataset2 / CURATED/{System}/NoPII/Dataset3 / CURATED/{System}/NoPII/Dataset4 / CURATED/{System}/NoPII/Dataset5 / CURATED/{System}/… / CURATED/{System}/… … Cleansed Curated Sandboxes Credential Passtrough Split PII Information based on Metadata Harmonize fileformat (Pseudo-)Normalize the Data Objects into tables

Data Lake Access Define Data Lake structure & Mapping in
JSON format PR to Azure Repo Run CD pipeline to set the ACL mapping

User & Service Permissions scope, assignee, assignee_type, role,environments Microsoft.Storage/storageAccounts/landingdatalake#{environment}#, adf-integrate-#{environment}#,
managed_identity, Storage Blob Data Contributor, all Microsoft.Storage/storageAccounts/ landingdatalake #{environment}#, adf-integrate-#{environment}#, managed_identity, Storage Blob Data Contributor, dev Microsoft.Storage/storageAccounts/curateddatalake#{environment}#, adf-cleanse-#{environment}#, managed_identity, Storage Blob Data Contributor, all Microsoft.EventGrid/topics/eventgrid-integrate-activites#{environment}#, adf-integrate-#{environment}#, managed_identity, EventGrid Data Sender, all … «services.csv» scope, assignee, assignee_type, role, environments /subscriptions/#{subscriptionId}#/resourceGroups/RG-#{environment}#,SG_AAD_DATA_ENGINEER, aad_group, Contributor, dev … «users.csv» User Powershell and az cli to deploy/modify the permissions

Sandbox Automation First view • Create DevOps project with Azure
Repo and CI/CD pipelines for the Synapse deployment • Add defined users and groups to the DevOps project and set permissions • Create 3-tier Synapse environment incl. network integration and Azure Repo linkage • Configure Synapse linked services and access rights Goals

Sandbox Automation Second View

Sandbox Automation Final Approach YAML Config To set the configurations
Custom DevOps Wrapper To automate DevOps Bicep Scripts To automate Synapse deployment Template Repository As a «blueprint» repo To clone into the new Project Incl. SynapsseCI/CD Pipelines CD Pipeline To deploy a sandbox

Sandbox Configuration For a new Sandbox we just need: •
A name • A removal date (if applicable) • Tags • Users in the form of AD groups or usernames • For Owners • For Contributors Deployment taks around 20 minutes.

User’s Perspective User can access the Data Lake trough a
3-tier Synapse environment and model/develop/prototype new use-cases Respective data owner approves Data Lake Access Role Order needed Data Lake Access Roles (Active Directory Groups) Wait until the configuration is deployed (deployment takes around 20’) Define roughly 5 parameters for the Sandbox configuration User orders Active Directory Group(s) and assign desired users

Learnings

• Set-up Governance to keep informed regarding changes in streamed
data • Build the streaming pipelines for failure, they will fail at a ceratin time – surely • Think of how to reload the data, otherwise changes might be lost • You always need more metadata J Learnings • Bicep Modularization, don’t be too “dogmatic” • Start with GUI, go to Code (IaC) • Use Policies for certain deployments (e.g. private DNS zones) • Interoperability (network, AD, engineers) is key • Use a ”Platform Deployment” Environment to not break the DEV Environment. • Think of the network before you start coding • Check VS Marketplace for useful DevOps Extension • Ensure life-cycle & maintenance of the IaC scripts (e.g. update to new VM families) Infrastructure & Deployment Data & Data Pipelines

Contact Tim Giger Principal Data & Analytics Consultant [email protected]

AzureBootcamp2023: Fully automated & cloud-nati...

AzureBootcamp2023: Fully automated & cloud-native data platform by Tim Giger

More Decks by Azure Zurich User Group

Other Decks in Technology

Featured

Transcript