Slide 1

Slide 1 text

Fully Automated & cloud-native Data Platform 11.05.23 | Tim Giger | Swisscom Data & Analytics

Slide 2

Slide 2 text

1. Context 2. Platform Deployment 3. Data Ingestion 4. Data Access & Permissions 5. Sandbox Deployment To Do’s before Apéro:

Slide 3

Slide 3 text


Slide 4

Slide 4 text

(Some) Requirements Multicloud Environment with complete network isolation & conditional access Supports (near-) real-time data loads At least 3-tier platform (Development, Integration, Production. 10+ Source Systems for data loads, Kafka main data provider No Microsoft-managed keys, only CMKs & additional storage encryption Need for “Sandbox Environments” for prototyping and tech-savy business users. Migration target for the current Data Warehouse (1000+ Tables) Regulated company (makes backup & security a bit more interesting)

Slide 5

Slide 5 text

Context What is meant by a «Sandbox»? Sandbox INT Sandbox DEV Sandbox PRD Azure DevOps Project Pre-configured & templated CI/CD pipelines Also includes permissions based on AD, network integration & linked services (data lake) Pre-configured & user credential passtrough

Slide 6

Slide 6 text


Slide 7

Slide 7 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 8

Slide 8 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 9

Slide 9 text

Architecture Simplified Platform Overview Landing Cleansed Curated EventGrid Integrate EventGrid Clean EventGrid Curate ADF Curate ADF Clean ADF Integrate Databricks Data Lake Storage Container: Landing & Cleansed Data Lake Storage Container: Curated DevOps Agent Integration Metadata Logs & Technical Metadata Synapse Sandbox A Sandbox B Sandbox … rg-platform rg-sandox-a rg-sandox-b rg-sandox-…

Slide 10

Slide 10 text

Development Approach & Process Start with Bicep IaC scripts Manual creation & configuration of all needed services 1 Adjust and fix the scripts according to the requirements 4 Deploy it for a first test with az cli 7 Parametrize the Bicep scripts according to the enviroment 10 Get the template (ARM) and decompile it to Bicep 2 Try to exclude configurations with policies (e.g. Private DNS Zones) 5 Run again into 1’001 errors… 8 Run into 1’001 errors… 3 Modularize the scripts into reasonable Bicep modules 6 Fix the errors 9 Review the deployment 12 Develop a Azure CD Pipeline 11

Slide 11

Slide 11 text

Did it work?

Slide 12

Slide 12 text

Imperative vs. Declarative IaC Chicken / egg problem (service needs to be in place for certain configurations) Depends also on desired Bicep modulariazion Complex configurations needs “treatment”after deployment, e.g. Databricks DBFS Root Encryption Complexitity in Bicep vs. complex deployment dependencies Imperative approach leads to more control but is less «handy»

Slide 13

Slide 13 text

Parametrization is Key Make use of the several parametrization options à Have a concept in mind, where to use what a and group tasks in deployment tasks! • Directly with Bicep (Variables & Parameters) • Azure DevOps Variables • Token Replacement (DevOps Extension) • Bicep Functions (e.g. Subscription Lookup) • Imperative Scripts • Pipeline Variables • JSON configuration files

Slide 14

Slide 14 text

Deployment How looks the deployment pipeline? “Complete mode In complete mode, Resource Manager deletes resources that exist in the resource group but aren't specified in the template.”

Slide 15

Slide 15 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 16

Slide 16 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 17

Slide 17 text

Data Architecture & Flow Data Sources Combination of Avro, Parquet and Delta Lake fileformat should be considered. Right format at the right place. Ingestion Azure Cloud Data Platform On-Premise Outsourced Data Center Dataset 1 Dataset 2 Dataset 3 Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Dataset … Source System A Source System B Source System … Source System … SH-IR Topic 1 Topic 2 Topic 3 Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Topic … Kafka /INBOUND/69/Kafka/Dataset1 /INBOUND/69/Kafka/Dataset2 /INBOUND/69/Kafka/Dataset3 /INBOUND/53/IR/Dataset4 /INBOUND/47/IR/Dataset5 /INBOUND/71/… /INBOUND/71/… /CLEANSED/69/PII/Dataset1 /CLEANSED/69/PII/Dataset2 /CLEANSED/69/NoPII/Dataset2 /CLEANSED/69/NoPII/Dataset3 /CLEANSED/53/NoPII/Dataset4 /CLEANSED/47/NoPII/Dataset5 /CLEANSED/71/… /CLEANSED/71/… … /CURATED/69/PII/Dataset1 / CURATED/69/PII/Dataset2 / CURATED/69/NoPII/Dataset2 / CURATED/69/NoPII/Dataset3 / CURATED/53/NoPII/Dataset4 / CURATED/47/NoPII/Dataset5 / CURATED/71/… / CURATED/71/… … Inbound Cleansed Curated Core Sandbox A Sandbox B Sandbox … Integration Metadata Synapse Main

Slide 18

Slide 18 text

Parametrized Data Streaming General Idea • Get topic metadata & schema configuration from Repo • Configure stream according to configuration, stored in SQL DB • »Normalize» incoming data objects according to metadata • Define cleaning and curation based on metadata (e.g. PII split)

Slide 19

Slide 19 text

Exposure to Synapse (Serverless Pools) Get Metadata Drop external Tables Create external Tables based on Metadata Daily Trigger Data Lake Curated Layer Synapse Main Azure Data Factory Curate

Slide 20

Slide 20 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 21

Slide 21 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 22

Slide 22 text

Data Lake Access Cleansed Assign ACLs based on Active Directory Groups. For each dataset 2 Active Directory roles: • ReadAll • ReadNoPIIonly /CLEANSED/{System}/PII/Dataset1 /CLEANSED/{System}/PII/Dataset2 /CLEANSED/{System}/NoPII/Dataset2 /CLEANSED/{System}/NoPII/Dataset3 /CLEANSED/{System}/NoPII/Dataset4 /CLEANSED/{System}/NoPII/Dataset5 /CLEANSED/{System}/… /CLEANSED/{System}/… … /CURATED/{System}/PII/Dataset1 / CURATED/{System}/PII/Dataset2 / CURATED/{System}/NoPII/Dataset2 / CURATED/{System}/NoPII/Dataset3 / CURATED/{System}/NoPII/Dataset4 / CURATED/{System}/NoPII/Dataset5 / CURATED/{System}/… / CURATED/{System}/… … Cleansed Curated Sandboxes Credential Passtrough Split PII Information based on Metadata Harmonize fileformat (Pseudo-)Normalize the Data Objects into tables

Slide 23

Slide 23 text

Data Lake Access Define Data Lake structure & Mapping in JSON format PR to Azure Repo Run CD pipeline to set the ACL mapping

Slide 24

Slide 24 text

User & Service Permissions scope, assignee, assignee_type, role,environments Microsoft.Storage/storageAccounts/landingdatalake#{environment}#, adf-integrate-#{environment}#, managed_identity, Storage Blob Data Contributor, all Microsoft.Storage/storageAccounts/ landingdatalake #{environment}#, adf-integrate-#{environment}#, managed_identity, Storage Blob Data Contributor, dev Microsoft.Storage/storageAccounts/curateddatalake#{environment}#, adf-cleanse-#{environment}#, managed_identity, Storage Blob Data Contributor, all Microsoft.EventGrid/topics/eventgrid-integrate-activites#{environment}#, adf-integrate-#{environment}#, managed_identity, EventGrid Data Sender, all … «services.csv» scope, assignee, assignee_type, role, environments /subscriptions/#{subscriptionId}#/resourceGroups/RG-#{environment}#,SG_AAD_DATA_ENGINEER, aad_group, Contributor, dev … «users.csv» User Powershell and az cli to deploy/modify the permissions

Slide 25

Slide 25 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 26

Slide 26 text

Data Ingestion Access & Permissions Sandbox Deployment Platform Deployment

Slide 27

Slide 27 text

Sandbox Automation First view • Create DevOps project with Azure Repo and CI/CD pipelines for the Synapse deployment • Add defined users and groups to the DevOps project and set permissions • Create 3-tier Synapse environment incl. network integration and Azure Repo linkage • Configure Synapse linked services and access rights Goals

Slide 28

Slide 28 text

Sandbox Automation Second View

Slide 29

Slide 29 text

Sandbox Automation Final Approach YAML Config To set the configurations Custom DevOps Wrapper To automate DevOps Bicep Scripts To automate Synapse deployment Template Repository As a «blueprint» repo To clone into the new Project Incl. SynapsseCI/CD Pipelines CD Pipeline To deploy a sandbox

Slide 30

Slide 30 text

Sandbox Configuration For a new Sandbox we just need: • A name • A removal date (if applicable) • Tags • Users in the form of AD groups or usernames • For Owners • For Contributors Deployment taks around 20 minutes.

Slide 31

Slide 31 text

User’s Perspective User can access the Data Lake trough a 3-tier Synapse environment and model/develop/prototype new use-cases Respective data owner approves Data Lake Access Role Order needed Data Lake Access Roles (Active Directory Groups) Wait until the configuration is deployed (deployment takes around 20’) Define roughly 5 parameters for the Sandbox configuration User orders Active Directory Group(s) and assign desired users

Slide 32

Slide 32 text


Slide 33

Slide 33 text

• Set-up Governance to keep informed regarding changes in streamed data • Build the streaming pipelines for failure, they will fail at a ceratin time – surely • Think of how to reload the data, otherwise changes might be lost • You always need more metadata J Learnings • Bicep Modularization, don’t be too “dogmatic” • Start with GUI, go to Code (IaC) • Use Policies for certain deployments (e.g. private DNS zones) • Interoperability (network, AD, engineers) is key • Use a ”Platform Deployment” Environment to not break the DEV Environment. • Think of the network before you start coding • Check VS Marketplace for useful DevOps Extension • Ensure life-cycle & maintenance of the IaC scripts (e.g. update to new VM families) Infrastructure & Deployment Data & Data Pipelines

Slide 34

Slide 34 text

Contact Tim Giger Principal Data & Analytics Consultant

Slide 35

Slide 35 text

No content