Global Azure - designing and implementing a DWH in the cloud

DESIGNING AND IMPLEMENTING A DATAWAREHOUSE IN AZURE

Ivan Donev MVP Data Platform Data platform solution architect @
www.inspirit.bg Trainer @ www.sqlmasteracademy.com @ikdonev https://www.linkedin.com/in/ivandonev/

Agenda • The use case for our solution • The
datawarehouse layers • Core concepts that differ from traditional DWH • Key issues to be aware of • Demos • Unzipping with ADF • Setting Datalake permissions (Gen 2)

The project • Straight forward reporting requirement • Export data
to CSV/Excel based on multi table joins and filters • Singleton read operations • 100+ columns to export • Source – 1 trillion rows • Challenges for an on-prem solution • Capacity, licensing, hardware • Flexibility • Pressure on the source system during extraction

Traditional DWH layers Source systems Sales Marketing Product LOB App
Invoicing Store Use ETL to move and transform data Store in relational format EDWH or Hub and Spoke Model Analysis Services for Semantic Layer Reporting and delivery SSRS PowerBI Excel 3rd party tools

Traditional DWH layers Source systems Sales Marketing Product LOB App
Invoicing Store Use ETL to move and transform data Store in relational format EDWH or Hub and Spoke Model Analysis Services for Semantic Layer Reporting and delivery SSRS PowerBI Excel 3rd party tools • Use of on-prem SQL Server • Still can be implemented in Cloud • Key difficulties - Schema drifting and new data - Lambda architecture (speed layer) are difficult to implement within single technology stack - Scaling and extending the solution - Capacity and cost management

Modern DWH in the cloud – key features • Flexible
• Schema drifting • Data reuse • Capacity, provisioning and maintenance • Enables new scenarios at “low cost”

The modern DWH reference architecture

The modern DWH reference architecture • ETL is now ELT
• Two compute layers - Big data/ML - Relational (Synapse and/or SQL DB) • Still need relational for scaling of traditional SQL workloads/analysts’ queries • Incremental strategy is key in planning - Append only – easy in DL and Relational stores - Upsert – almost impossible in DL, easy in Rel. • Still need AAS/PBI Premium for scaling analytics workloads • SQL DB can be used instead of Synapse Analytics for small scale DWH implementations

The modern DWH reference architecture • SourceDB to DL: SSIS
with csv -> zip • Delimiter (choose wisely) • Encoding (UTF-8 works with Cyrillic) • Compression algorithm • DL zoning: DROP, RAW, BASE, CURATED • Permissions: specific for each zone • Processed files: archived • Storage format: PARQUET • Compute and querying: databricks • Critical milestones • Fixing encoding and data transfer • Compression • Processing and partitioning • Data flow joins to dimensions • Join performance

The source schema and the output • 3rd form normalized
• Header-detail of 2 sets of tables • Multiple nomenclatures • Including one with >10 mln records • Output: • Header • Join 3 times details • Join 9 times the 10 mln record nomenclature • Join ~10 times the other small tables

The data lake structure Source Data >1 T records DROP
Compressed RAW Just copy files BASE Cleansed data Dimension data Supporting tables PARQUET compression ENRICHED Single wide table

Incremental strategies in cloud with DL • Append only •
Requires insert only source • Updates/deletes are lost • DL writes by default are append • UPSERT • Not native for DL and/or spark • Can be implemented with: • Databricks delta • Union of full and delta and extracting the latest rows (if you have a timestamp_ • Separating insert/deletes/updates (you can use ADF Template “Parquet CRUD”

Demonstration • Data lake structure • Data lake permissions and
AAD • Azure data factory DROP to RAW • ADF dataflows • Sources • Joins and optimization • Referencing • Setting partitions

Costs • Beware of the “hidden” costs • Storage includes
additional costs for metadata • Databricks is split into two – VMs and workspaces • ADF is paid in two different aspects • DIU/hour pipeline activities with compute • Core/hour for data flows

Things to note • Data lake • Permissions • Test
the entire flow with SMALL data and small compute • Test and verify which compression works for you • Plan your incremental strategy in advance • ADF • DEBUG does not write data unless you debug a pipeline • Use separate integration environment for debug with small TTL (you are paying for the time the IR is turned on) • Set partitioning for large dataset operations explicitly. Know your data distribution! • Plan for compute acquiring • Use dataset referencing • Schema drifting – choose between late binding or strict schema

Global Azure - designing and implementing a DWH...

Global Azure - designing and implementing a DWH in the cloud

Ivan Donev

More Decks by Ivan Donev

Other Decks in Technology

Featured

Transcript

DESIGNING AND IMPLEMENTING A DATAWAREHOUSE IN AZURE

Ivan Donev MVP Data Platform Data platform solution architect @

Agenda • The use case for our solution • The

The project • Straight forward reporting requirement • Export data

Traditional DWH layers Source systems Sales Marketing Product LOB App

Traditional DWH layers Source systems Sales Marketing Product LOB App

Modern DWH in the cloud – key features • Flexible

The modern DWH reference architecture

The modern DWH reference architecture • ETL is now ELT

The modern DWH reference architecture • SourceDB to DL: SSIS

The source schema and the output • 3rd form normalized

The data lake structure Source Data >1 T records DROP

Incremental strategies in cloud with DL • Append only •

Demonstration • Data lake structure • Data lake permissions and

Costs • Beware of the “hidden” costs • Storage includes

Things to note • Data lake • Permissions • Test