Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern Data Warehouse with Azure

Modern Data Warehouse with Azure

Slidedeck related to the talk presented at the Manila Data Day event March 2020. The demo covers Azure services like Data Lake Storage (Gen 2), Azure Data Factory, Azure Databricks, Azure Synapse, Key Vault and Active directory to build a modern data warehouse.

9e33a1d43a88f23f6c545c1e0f07f4b5?s=128

Nilesh Gule

March 07, 2020
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. Nilesh Gule @nileshgule | www.HandsOnArchitect.com Modern Data Warehouse Using Azure

  2. Sponsors

  3. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “nileshgule@gmail.com", “likes” : “Technical Evangelism, Cricket” }
  4. None
  5. None
  6. Credits: James Serra

  7. Azure Data Lake Storage

  8. Azure Data Lake Storage • Petabyte scale storage • Hierarchical

    namespace • Hadoop compatible access with ABFS driver Main features • Use Service Principles • Use Security Groups over individual users • Enable Gen 2 firewall with Azure services access ADLS best practices
  9. Azure Data Factory

  10. Azure Data Factory • Cloud ETL service • Scale-out serverless

    data integration & data transformation • Code-free UI • Monitoring & Management Main features
  11. Azure Data Factory – copy data

  12. Azure Data Factory – Mapping Data Flow Credits: Harun Legoz

    for football match analysis example
  13. Azure Data Factory - Monitoring

  14. Azure Databricks

  15. Azure Databricks • Collaborative Spark based Analytical service • Optimized

    Spark environment for data pipelines and ML
  16. Azure Databricks - clusters

  17. Azure Databricks – SQL DW Connectivity

  18. Azure Databricks - Jobs

  19. Azure SQL Data Warehouse

  20. Best in class price per performance Developer productivity Workload aware

    query execution Data flexibility Up to 94% less expensive than competitors Manage heterogenous workloads through workload priorities and isolation Ingest variety of data sources to derive the maximum benefit. Query all data. Use preferred tooling for SQL data warehouse development Industry-leading security Defense-in-depth security and 99.9% financially backed availability SLA Azure Synapse – SQL Analytics focus areas Credits: James Serra
  21. APACHE SPARK SQL ANALYTICS STUDIO DATA INTEGRATION Synapse Analytics (PREVIEW)

    Synapse Analytics (GA) (formerly SQL DW) Synapse Analytics (GA) Add new capabilities to the GA service NOV 2019 New GA features • Resultset caching • Materialized Views • Ordered columnstore • JSON support • Dynamic Data Masking • SSDT support • Read committed snapshot isolation Preview features • Workload Isolation • Simple ingestion with COPY • Share DW data with Azure Data Share • Private LINK support Private preview features • Streaming ingestion & analytics in DW • Native Prediction/Scoring • Fast query over Parquet files • FROM clause with joins NOV 2019 Preview features • Synapse Studio • Collaborative workspaces • Distributed T-SQL Query service • SQL Script editor • Unified security model • Notebooks • Apache Spark • Code-free data flows • Orchestration Pipelines • Data movement Credits: James Serra
  22. Parallelism • Uses many separate CPUs running in parallel to

    execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing Credits: James Serra
  23. SQL DW Logical Architecture (overview) “Compute” node Balanced storage SQL

    “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL DMS DMS DMS DMS Compute Node – the “worker bee” of SQL DW • Runs Azure SQL Server DB • Contains a “slice” of each database • CPU is saturated by storage Control Node – the “brains” of the SQL DW • Also runs Azure SQL Server DB • Holds a “shell” copy of each database • Metadata, statistics, etc • The “public face” of the appliance Data Movement Services (DMS) • Part of the “secret sauce” of SQL DW • Moves data around as needed • Enables parallel operations among the compute nodes (queries, loads, etc) “Control” node SQL DMS Credits: James Serra
  24. SQL DW Data Layout Options “Compute” node Balanced storage SQL

    Balanced storage Balanced storage Balanced storage “Compute” node SQL “Compute” node SQL “Compute” node SQL DMS DMS DMS DMS Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Customer Dim Cust Dim ID Cust Name Cust Addr Cust Phone Cust Email Sales Fact Date Dim ID Store Dim ID Prod Dim ID Cust Dim ID Qty Sold Dollars Sold T D P D S D C D T D P D S D C D T D P D S D C D T D P D S D C D Sales Fact Replicated Table copied to each compute node Distributed Table spread across compute nodes based on “hash” Star Schema Credits: James Serra
  25. FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E

    FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H FactSales_ A FactSales_ B FactSales_ C FactSales_ D FactSales_ E FactSales_ F FactSales_ G FactSales_ H DATA DISTRIBUTION CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ) ); Control Node … Compute Node 1 Compute Node 2 Compute Node X Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C …… Create Table FactSales_H FactSalesA FactSalesB FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH FactSalesA FactSalesB FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH FactSalesA FactSale B FactSalesC FactSalesD FactSalesE FactSalesF FactSalesG FactSalesH Create table metadata on Control Node
  26. ALTER DATABASE ContosoDW MODIFY (service_objective = 'DW1000'); DWU DW100 DW200

    DW300 DW400 DW500 DW1000 DW1500 DW2000 DW2500 DW3000 DW5000 DW6000 DW7500 DW10000 DW15000 DW30000 Credits: James Serra
  27. Azure SQL Data Warehouse Engine Worker1 Azure Storage Blob(s) D12

    D11 D13 D14 D15 D16 D18 D17 D19 D20 D22 D21 D23 D24 D25 D26 D28 D27 D29 D30 D32 D31 D33 D34 D35 D36 D38 D37 D39 D40 D42 D41 D43 D44 D45 D46 D48 D47 D49 D50 D52 D51 D53 D54 D55 D56 D58 D57 D59 D60 D2 D1 D3 D4 D5 D6 D8 D7 D9 D10 Credits: James Serra
  28. Azure SQL Data Warehouse Engine Worker4 Azure Storage Blob(s) Worker1

    Worker5 Worker3 Worker2 Worker6 D52 D51 D53 D54 D55 D56 D58 D57 D59 D60 D12 D11 D13 D14 D15 D16 D18 D17 D19 D20 D22 D21 D23 D24 D25 D26 D28 D27 D29 D30 D32 D31 D33 D34 D35 D36 D38 D37 D39 D40 D42 D41 D43 D44 D45 D46 D48 D47 D49 D50 D2 D1 D3 D4 D5 D6 D8 D7 D9 D10 Credits: James Serra
  29. Azure Key Vault

  30. Azure Key Vault • Safeguards Cryptographic keys and secrets used

    by cloud apps • Stores • Encryption keys used for BYOK solution • Storage account Access Keys • Secrets / connection strings • Certificates
  31. Azure Active Directory

  32. Azure Active Directory • Universal Identity Platform • Number of

    built in roles • Storage Blob Contributor • Storage Blob Owner • Storage Blob Data Reader • Centralizes management of users and roles • Service Principles for service to service communication
  33. ADLS Gen 2 Storage Account Azure Data Factory Azure Databricks

    Azure Data Factory Mapping Data Flows Azure Synapse Analytics Azure Databricks Key Vault Azure Key Vault ADLS Gen 2 Databricks connectivity Databricks SQL DW Connectivity Difference between SQL Server and SQL Data Warehouse External Data Source External Tables
  34. Thank you very much Code with Passion and Strive for

    Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  35. Q&A

  36. Session Feedback