Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Started With Azure SQL Data Warehouse

Getting Started With Azure SQL Data Warehouse

Azure SQL Data Warehouse is a massively parallel processing (MPP) cloud-based, scale-out, relational database, built on SQL Server. Behind the scenes, SQL Data Warehouse spreads your data across many shared storages and processing units. It gives a possibility to grow and shrink compute and storage independently, pause and resume to pay only for what you use.

A few months ago, we started exploring Azure Data Warehouse, and now we are successfully using it in production. I would like to share our first lessons learned.

Valdas Maksimavičius

July 10, 2017
Tweet

More Decks by Valdas Maksimavičius

Other Decks in Programming

Transcript

  1. About me Valdas Maksimavičius • MS Dynamics, .Net, Web, BI,

    Analytics, Azure • Analytics Architect at Cognizant • Between Poland and Lithuania • blog.thevaldas.com
  2. Getting Started With Azure SQL Data Warehouse 1. IaaS, PaaS,

    MPP (Massively Parallel Processing) 2. Data distribution (Round Robin, Hash) 3. Performance (indexes, statistics, monitoring) 4. Tools and services 5. Security and backups
  3. Azure SQL Data Warehouse is a SQL-based, fully- managed, petabyte-scale

    cloud data warehouse. It’s highly elastic, and it enables you to set up in minutes and scale capacity in seconds. Scale compute and storage independently.
  4. Azure SQL Data Warehouse is a SQL-based, fully- managed, petabyte-scale

    cloud data warehouse. It’s highly elastic, and it enables you to set up in minutes and scale capacity in seconds. Scale compute and storage independently.
  5. IaaS vs PaaS Infrastructure as a Service Storage Servers Networking

    O/S Middleware Virtualization Data Applications Runtime Managed by Microsoft You scale, make resilient & manage Platform as a Service Scale, Resilience and management by Microsoft You manage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data
  6. IaaS vs PaaS Infrastructure as a Service Storage Servers Networking

    O/S Middleware Virtualization Data Applications Runtime Managed by Microsoft You scale, make resilient & manage Platform as a Service Scale, Resilience and management by Microsoft You manage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data
  7. SMP + OLTP (i.e. applications with individual updates, inserts, and

    deletes) + Relatively cheap - Relatively slow MPP + Data warehousing (heavy reads and batch writes) + Scaling horizontally - Complex - Expensive
  8. Parallel Data Warehouse v1 Data Allegro product on Windows &

    SQL. First DW appliance by MSFT in partnership with Dell and HP Microsoft Acquired Data Allegro Company viewed as most efficient way to bring MPP to SQL Server world Analytics Platform System (APS) Introduction of Hadoop region within appliance and new naming to reflect broader Big Data capabilities SQL DW Service Introduction of Azure SQL DW Service based on APS’s MPP capabilities Fast Track Data Warehouse Launch DW Reference Architectures based on SMP DW best practices offered with leading H/W Partners Parallel Data Warehouse v2 Re-architected Product delivering new form factors and greatly improved price/perform ance. Microsoft & Data Warehouse 2008 2013 2010 2015 2014 2011
  9. CREATE TABLE [dbo].[Students] ( [StudentId] INT NOT NULL [Name] VARCHAR(50)

    NULL ) WITH ( DISTRIBUTION = HASH( [StudentId] ) )
  10. Resource classes (memory per distribution - MB) DWU Smallrc (default)

    Mediumrc Largerc xlargerc DW100 100 100 200 400 DW500 100 400 800 1 600 DW6000 100 3 200 6 400 12 800
  11. Resource classes (memory system-wide GB) DWU Smallrc (default) Mediumrc Largerc

    xlargerc DW100 6 6 12 23 DW500 6 23 47 94 DW6000 6 188 375 750
  12. CREATE TABLE [dbo].[Students] ( [StudentId] INT NOT NULL [Name] VARCHAR(50)

    NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX ) Clustered Columnstore Index (default)
  13. Clustered Columnstore Index (default) + Highest level of data compression

    + Best overall performance + Generally outperform clustered index or heap tables - Not support varchar(max), nvarchar(max) and varbinary(max) - Optimal compression once there is more than 100 million rows - Less efficient for transient data, small or trickle load operations - For single row retrieval consider clustered index
  14. Heap Table + Temporarily landing data + Much faster loading

    to heap table + Support varchar(max), nvarchar(max) and varbinary(max) - For small lookup tables, less than 100 million rows
  15. CREATE TABLE [dbo].[Students] ( [StudentId] INT NOT NULL [Name] VARCHAR(50)

    NULL ) WITH ( CLUSTERED INDEX ( [StudentId] ) ) Clustered Index
  16. Clustered Index + Outperform all for single row or very

    few row lookup - Highly selective filter on the clustered column index
  17. Non-clustered Index + Can be added on many columns -

    Each index which is added to a table will add both space and processing time to loads
  18. SQL Data Warehouse query optimizer is a cost based optimizer.

    That is, it compares the cost of various query plans and then chooses the plan with the lowest cost, which should also be the plan that will execute the fastest. The way that you tell SQL Data Warehouse about your data, is by collecting statistics about your data.
  19. SQL Data Warehouse query optimizer is a cost based optimizer.

    That is, it compares the cost of various query plans and then chooses the plan with the lowest cost, which should also be the plan that will execute the fastest. The way that you tell SQL Data Warehouse about your data, is by collecting statistics about your data. The process of creating and updating statistics is currently a manual process
  20. Monitor your workload using DMVs Sessions - sys.dm_pdw_exec_sessions Requests -

    sys.dm_pdw_exec_requests Query plan - sys.dm_pdw_request_steps SQL on distributed databases - sys.dm_pdw_sql_requests Data movement - sys.dm_pdw_dms_workers Waiting queries - sys.dm_pdw_waits Resources - sys.dm_pdw_nodes_os_performance_counters
  21. Azure Machine Learning - Writer Writes limited to only 1

    row at a time!!! Use another destination instead NO
  22. Azure Data Factory - Load data to and from -

    Run stored procedure - Custom activity (pause and resume Azure DW)
  23. Backups • Azure Storage Blob snapshots Snapshots start a minimum

    of every eight hours and are available for seven days • Geo-redundant backups The Azure Storage RA-GRS feature replicates the backup files to a paired data center
  24. Summary 1/2 Don't use SQL DW for workloads that have:

    • High frequency reads and writes • Large numbers of singleton selects • High volumes of single row inserts • Row by row processing needs • Incompatible formats (JSON, XML)
  25. Summary 2/2 Use SQL DW for workloads that: • Have

    heavy reads and batch writes • Have only a few users • Can be modelled in one database • Can benefit from pause/resume functionality
  26. Have you learned something lately? Have you done something you

    are proud of? Then we invite you to share it with us!