Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Global Azure - designing and implementing a DWH in the cloud

Global Azure - designing and implementing a DWH in the cloud

A slidedeck from my talk on Global Azure Virtual Event 2020, where I shared what was the process and design of a DWH scenario I had to work on for a customer.

Ivan Donev

April 25, 2020
Tweet

More Decks by Ivan Donev

Other Decks in Technology

Transcript

  1. Ivan Donev MVP Data Platform Data platform solution architect @

    www.inspirit.bg Trainer @ www.sqlmasteracademy.com @ikdonev https://www.linkedin.com/in/ivandonev/
  2. Agenda • The use case for our solution • The

    datawarehouse layers • Core concepts that differ from traditional DWH • Key issues to be aware of • Demos • Unzipping with ADF • Setting Datalake permissions (Gen 2)
  3. The project • Straight forward reporting requirement • Export data

    to CSV/Excel based on multi table joins and filters • Singleton read operations • 100+ columns to export • Source – 1 trillion rows • Challenges for an on-prem solution • Capacity, licensing, hardware • Flexibility • Pressure on the source system during extraction
  4. Traditional DWH layers Source systems Sales Marketing Product LOB App

    Invoicing Store Use ETL to move and transform data Store in relational format EDWH or Hub and Spoke Model Analysis Services for Semantic Layer Reporting and delivery SSRS PowerBI Excel 3rd party tools
  5. Traditional DWH layers Source systems Sales Marketing Product LOB App

    Invoicing Store Use ETL to move and transform data Store in relational format EDWH or Hub and Spoke Model Analysis Services for Semantic Layer Reporting and delivery SSRS PowerBI Excel 3rd party tools • Use of on-prem SQL Server • Still can be implemented in Cloud • Key difficulties - Schema drifting and new data - Lambda architecture (speed layer) are difficult to implement within single technology stack - Scaling and extending the solution - Capacity and cost management
  6. Modern DWH in the cloud – key features • Flexible

    • Schema drifting • Data reuse • Capacity, provisioning and maintenance • Enables new scenarios at “low cost”
  7. The modern DWH reference architecture • ETL is now ELT

    • Two compute layers - Big data/ML - Relational (Synapse and/or SQL DB) • Still need relational for scaling of traditional SQL workloads/analysts’ queries • Incremental strategy is key in planning - Append only – easy in DL and Relational stores - Upsert – almost impossible in DL, easy in Rel. • Still need AAS/PBI Premium for scaling analytics workloads • SQL DB can be used instead of Synapse Analytics for small scale DWH implementations
  8. The modern DWH reference architecture • SourceDB to DL: SSIS

    with csv -> zip • Delimiter (choose wisely) • Encoding (UTF-8 works with Cyrillic) • Compression algorithm • DL zoning: DROP, RAW, BASE, CURATED • Permissions: specific for each zone • Processed files: archived • Storage format: PARQUET • Compute and querying: databricks • Critical milestones • Fixing encoding and data transfer • Compression • Processing and partitioning • Data flow joins to dimensions • Join performance
  9. The source schema and the output • 3rd form normalized

    • Header-detail of 2 sets of tables • Multiple nomenclatures • Including one with >10 mln records • Output: • Header • Join 3 times details • Join 9 times the 10 mln record nomenclature • Join ~10 times the other small tables
  10. The data lake structure Source Data >1 T records DROP

    Compressed RAW Just copy files BASE Cleansed data Dimension data Supporting tables PARQUET compression ENRICHED Single wide table
  11. Incremental strategies in cloud with DL • Append only •

    Requires insert only source • Updates/deletes are lost • DL writes by default are append • UPSERT • Not native for DL and/or spark • Can be implemented with: • Databricks delta • Union of full and delta and extracting the latest rows (if you have a timestamp_ • Separating insert/deletes/updates (you can use ADF Template “Parquet CRUD”
  12. Demonstration • Data lake structure • Data lake permissions and

    AAD • Azure data factory DROP to RAW • ADF dataflows • Sources • Joins and optimization • Referencing • Setting partitions
  13. Costs • Beware of the “hidden” costs • Storage includes

    additional costs for metadata • Databricks is split into two – VMs and workspaces • ADF is paid in two different aspects • DIU/hour pipeline activities with compute • Core/hour for data flows
  14. Things to note • Data lake • Permissions • Test

    the entire flow with SMALL data and small compute • Test and verify which compression works for you • Plan your incremental strategy in advance • ADF • DEBUG does not write data unless you debug a pipeline • Use separate integration environment for debug with small TTL (you are paying for the time the IR is turned on) • Set partitioning for large dataset operations explicitly. Know your data distribution! • Plan for compute acquiring • Use dataset referencing • Schema drifting – choose between late binding or strict schema