Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern Data Warehouse using Azure

Modern Data Warehouse using Azure

Slide deck related to the presentation on building modern data warehouse using Azure


Nilesh Gule

June 09, 2022

More Decks by Nilesh Gule

Other Decks in Technology


  1. @nileshgule Modern Data Warehouse using Azure

  2. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://GitHub.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  3. None
  4. None
  5. @nileshgule Evolution of Data Warehousing Dept Dept Dept Staging Raw

    Query Report Real time processing Dept Dept Dept
  6. @nileshgule Evolution of Data Warehousing Raw Query Report Real time

    processing Dept Dept Dept Raw Query Report
  7. Data Warehouse • Store data from multiple data sources •

    Used for historical and trend analysis reporting • Central repository for many subject areas • Contains single version of truth • Not to be used for OLTP applications Main features • Reduce stress on production systems • Optimized for read access • Keep historical records • Restructure / rename tables, fields, model data • Protect against source system upgrades • Use Master Data Management • Easy to create BI solutions on top of it (e.g. Azure Analysis Services Cubes) • Reduces security access to multiple production systems Use cases
  8. Credits: James Serra

  9. Credits: James Serra

  10. Credits: James Serra

  11. Credits: James Serra

  12. Azure Data Lake Storage • Petabyte scale storage • Hierarchical

    namespace • Hadoop compatible access with ABFS driver Main features • Use Service Principles • Use Security Groups over individual users • Enable Gen 2 firewall with Azure services access ADLS best practices
  13. Azure Data Factory • Cloud ETL service • Scale-out serverless

    data integration & data transformation • Code-free UI • Monitoring & Management Main features
  14. Azure Data Factory – copy data

  15. Azure Data Factory – Mapping Data Flow Credits: Harun Legoz

    for football match analysis example
  16. Azure Data Factory - Monitoring

  17. Azure Databricks • Collaborative Spark based Analytical service • Optimized

    Spark environment for data pipelines and ML
  18. Delta Lake • ACID transactions • Time travel (data versioning

    enables rollbacks, audit trail) • Streaming and batch unification • Schema enforcement • Upserts and deletes • Performance improvements https://delta.io/
  19. Azure Databricks - clusters

  20. Azure Databricks - Jobs

  21. Best in class price per performance Developer productivity Workload aware

    query execution Data flexibility Up to 94% less expensive than competitors Manage heterogenous workloads through workload priorities and isolation Ingest variety of data sources to derive the maximum benefit. Query all data. Use preferred tooling for SQL data warehouse development Industry-leading security Defense-in-depth security and 99.9% financially backed availability SLA Azure Synapse – SQL Analytics focus areas Credits: James Serra
  22. Data Lake with Data Warehouse use cases • Data scientists/power

    users • Batch processing • Data refinement / cleaning • ETL workloads • Store older / backup data • Sandbox for data exploration • One-time reports • Quick access to data • Don’t know questions Data Lake – staging & preparation • Business people • Low latency • Complex joins • Interactive ad-hoc queries • High number of users • Additional security • Large support for tools • Dashboards • Self-service BI • Known questions Data Warehouse – Serving, security & Compliance
  23. Data Lakehouse concerns / limitations • Reliability – keeping data

    lake and warehouse consistent • Data staleness – older data in warehouse • Limited support for advanced analytics – Top ML systems don’t work well on warehouses • TCO – extra cost for data copies in warehouse Problems • Speed – RDBMS faster, especially MPP • Security - No RLS, column-level, dynamic data masking • Complexity – metadata separate from data, file based • Missing features – referential integrity, TDE, workload management, many features require Spark lockin Concerns wrt RDBMS
  24. @nileshgule Evolution of Data Warehousing … Centralized Decentralized Data Mesh

    | Thoughtworks
  25. @nileshgule References Data Warehouse ❖ Why you need a data

    warehouse ❖ James Serra website Azure ❖ ADLS Gen 2 Storage Account ❖ Azure Data Factory ❖ Azure Databricks ❖ Azure Data Factory Mapping Data Flows ❖ Azure Synapse Analytics ❖ Azure Synapse Link for SQL Delta Lake & Data Mesh ❖ Databricks Deltalake ❖ Delta ❖ Data Mesh Architecture ❖ ThoughtWorks – Data Mesh
  26. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com https://bit.ly/youtube-nileshgule
  27. Q&A