Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Part 3 - Modern Data Warehouse with Azure

9e33a1d43a88f23f6c545c1e0f07f4b5?s=47 Nilesh Gule
September 15, 2020

Part 3 - Modern Data Warehouse with Azure

Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azure Data Factory.

The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170

9e33a1d43a88f23f6c545c1e0f07f4b5?s=128

Nilesh Gule

September 15, 2020
Tweet

Transcript

  1. Nilesh Gule @nileshgule | www.HandsOnArchitect.com Modern Data Warehouse Using Azure

  2. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://github.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “email” : “nileshgule@gmail.com", “likes” : “Technical Evangelism, Cricket”, “co-organizer” : “Azure Singapore UG” }
  3. None
  4. Credits: James Serra

  5. None
  6. Part 1 - Recap – ADLS & ADF • Petabyte

    scale storage • Hierarchical namespace • Hadoop compatible access with ABFS driver ADLS - Main features ADF - Main features • Cloud ETL service • Scale-out serverless data integration & data transformation • Code-free UI • Monitoring & Management
  7. Part 2 - Recap • Collaborative Spark based Analytical service

    • Different cluster types (automated / interactive / pool) • Autoscale based on workloads • Fine grained access controls Azure Databricks - Main features
  8. Azure Synapse Limitless analytics service for enterprise data warehousing and

    Big Data analytics
  9. Parallelism • Uses many separate CPUs running in parallel to

    execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes MPP - Massively Parallel Processing • Multiple CPUs used to complete individual processes simultaneously • All CPUs share the same memory, disks, and network controllers (scale-up) • All SQL Server implementations up until now have been SMP • Mostly, the solution is housed on a shared SAN SMP - Symmetric Multiprocessing
  10. Synapse Architecture https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture • Control Node • Compute Node •

    Data Movement Service (DMS) Components • Hash • Round Robin • Replicate Distributions
  11. Synapse Data Distributions https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/massively-parallel-processing-mpp-architecture • Highest query perf for joins

    & aggregations on large tables • Rows per distribution varies Hash • Fastest query performance for small tables Replicated
  12. ALTER DATABASE ContosoDW MODIFY (service_objective = 'DW1000'); DWU DW100 DW200

    DW300 DW400 DW500 DW1000 DW1500 DW2000 DW2500 DW3000 DW5000 DW6000 DW7500 DW10000 DW15000 DW30000
  13. Azure SQL Data Warehouse Engine Worker1 Azure Storage Blob(s) D12

    D11 D13 D14 D15 D16 D18 D17 D19 D20 D22 D21 D23 D24 D25 D26 D28 D27 D29 D30 D32 D31 D33 D34 D35 D36 D38 D37 D39 D40 D42 D41 D43 D44 D45 D46 D48 D47 D49 D50 D52 D51 D53 D54 D55 D56 D58 D57 D59 D60 D2 D1 D3 D4 D5 D6 D8 D7 D9 D10
  14. Azure SQL Data Warehouse Engine Worker4 Azure Storage Blob(s) Worker1

    Worker5 Worker3 Worker2 Worker6 D52 D51 D53 D54 D55 D56 D58 D57 D59 D60 D12 D11 D13 D14 D15 D16 D18 D17 D19 D20 D22 D21 D23 D24 D25 D26 D28 D27 D29 D30 D32 D31 D33 D34 D35 D36 D38 D37 D39 D40 D42 D41 D43 D44 D45 D46 D48 D47 D49 D50 D2 D1 D3 D4 D5 D6 D8 D7 D9 D10
  15. Azure Databricks – SQL DW Connectivity

  16. External Data Sources • External Data Source • Hadoop, ADLS

    • External File Format • File types • Delimited Text, Hive RCFile, Hive ORC file, Parquet • Data Compression • Gzip, Snappy • Field Delimiters • Date Format • External Table
  17. What workloads are NOT suitable? • High frequency reads and

    writes. • Large numbers of singleton selects. • High volumes of single row inserts. Operational workloads (OLTP) • Row by row processing needs. • Incompatible formats (XML). Data Preparations SQL SQL
  18. What Workloads are Suitable? Store large volumes of data. Consolidate

    disparate data into a single location. Shape, model, transform and aggregate data. Batch/Micro-batch loads. Perform query analysis across large datasets. Ad-hoc reporting across large data volumes. All using simple SQL constructs. Analytics
  19. Summary • MPP Architecture • Can be paused • Optimized

    for analytics workloads • Supports multiple external file formats • Works with Polybase Azure Synapse - Main features
  20. SQL Server & SQL Data Warehouse Differences Azure Synapse Workload

    Management External Data Source External File Formats External Table SQL Data Warehouse Benchmark
  21. References – MS Learn https://docs.microsoft.com/en-us/learn/paths/implement-sql-data-warehouse

  22. Thank you very much Code with Passion and Strive for

    Excellence https://www.slideshare.net/nileshgule/presentations https://speakerdeck.com/nileshgule/
  23. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com
  24. Q&A