Getting Started With Azure SQL Data Warehouse

Getting Started With Azure SQL Data Warehouse Valdas Maksimavičius

About me Valdas Maksimavičius • MS Dynamics, .Net, Web, BI,
Analytics, Azure • Analytics Architect at Cognizant • Between Poland and Lithuania • blog.thevaldas.com

Getting Started With Azure SQL Data Warehouse 1. IaaS, PaaS,
MPP (Massively Parallel Processing) 2. Data distribution (Round Robin, Hash) 3. Performance (indexes, statistics, monitoring) 4. Tools and services 5. Security and backups

Azure SQL Data Warehouse is a SQL-based, fully- managed, petabyte-scale
cloud data warehouse. It’s highly elastic, and it enables you to set up in minutes and scale capacity in seconds. Scale compute and storage independently.

IaaS vs PaaS

IaaS vs PaaS Infrastructure as a Service Storage Servers Networking
O/S Middleware Virtualization Data Applications Runtime Managed by Microsoft You scale, make resilient & manage Platform as a Service Scale, Resilience and management by Microsoft You manage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data

SMP vs MPP Symmetric Multiprocessing vs Massively Parallel Processing

SMP (Symmetric Multiprocessing)

MPP (Massively Parallel Processing)

SMP + OLTP (i.e. applications with individual updates, inserts, and
deletes) + Relatively cheap - Relatively slow MPP + Data warehousing (heavy reads and batch writes) + Scaling horizontally - Complex - Expensive

Parallel Data Warehouse v1 Data Allegro product on Windows &
SQL. First DW appliance by MSFT in partnership with Dell and HP Microsoft Acquired Data Allegro Company viewed as most efficient way to bring MPP to SQL Server world Analytics Platform System (APS) Introduction of Hadoop region within appliance and new naming to reflect broader Big Data capabilities SQL DW Service Introduction of Azure SQL DW Service based on APS’s MPP capabilities Fast Track Data Warehouse Launch DW Reference Architectures based on SMP DW best practices offered with leading H/W Partners Parallel Data Warehouse v2 Re-architected Product delivering new form factors and greatly improved price/perform ance. Microsoft & Data Warehouse 2008 2013 2010 2015 2014 2011

Azure SQL vs Azure DW

Azure Data Warehouse

Manage compute and workload

R E S U L T

ALTER DATABASE EnterpriseDW MODIFY (SERVICE_OBJECTIVE = 'DW1000')

Distributed tables

Round-robin distributed tables (default)

CREATE TABLE [dbo].[Students] ( [StudentId] INT NOT NULL [Name] VARCHAR(50)
NULL ) WITH ( DISTRIBUTION = ROUND_ROBIN )

Hash distributed tables Hash Function

NULL ) WITH ( DISTRIBUTION = HASH( [StudentId] ) )

Resource classes smallrc, mediumrc, largerc, and xlargerc

EXEC sp_addrolemember 'largerc', 'LoadUser'

Resource classes (memory per distribution - MB) DWU Smallrc (default)
Mediumrc Largerc xlargerc DW100 100 100 200 400 DW500 100 400 800 1 600 DW6000 100 3 200 6 400 12 800

Resource classes (memory system-wide GB) DWU Smallrc (default) Mediumrc Largerc
xlargerc DW100 6 6 12 23 DW500 6 23 47 94 DW6000 6 188 375 750

Indexes

NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX ) Clustered Columnstore Index (default)

Clustered Columnstore Index (default) + Highest level of data compression
+ Best overall performance + Generally outperform clustered index or heap tables - Not support varchar(max), nvarchar(max) and varbinary(max) - Optimal compression once there is more than 100 million rows - Less efficient for transient data, small or trickle load operations - For single row retrieval consider clustered index

NULL ) WITH ( HEAP ) Heap Table

Heap Table + Temporarily landing data + Much faster loading
to heap table + Support varchar(max), nvarchar(max) and varbinary(max) - For small lookup tables, less than 100 million rows

NULL ) WITH ( CLUSTERED INDEX ( [StudentId] ) ) Clustered Index

Clustered Index + Outperform all for single row or very
few row lookup - Highly selective filter on the clustered column index

CREATE INDEX studentIdIndex ON Students (studentId) Non-clustered Index

Non-clustered Index + Can be added on many columns -
Each index which is added to a table will add both space and processing time to loads

Statistics

SQL Data Warehouse query optimizer is a cost based optimizer.
That is, it compares the cost of various query plans and then chooses the plan with the lowest cost, which should also be the plan that will execute the fastest. The way that you tell SQL Data Warehouse about your data, is by collecting statistics about your data.

SQL Data Warehouse query optimizer is a cost based optimizer.
That is, it compares the cost of various query plans and then chooses the plan with the lowest cost, which should also be the plan that will execute the fastest. The way that you tell SQL Data Warehouse about your data, is by collecting statistics about your data. The process of creating and updating statistics is currently a manual process

Monitoring

NOT AVAILABLE!

Monitor your workload using DMVs Sessions - sys.dm_pdw_exec_sessions Requests -
sys.dm_pdw_exec_requests Query plan - sys.dm_pdw_request_steps SQL on distributed databases - sys.dm_pdw_sql_requests Data movement - sys.dm_pdw_dms_workers Waiting queries - sys.dm_pdw_waits Resources - sys.dm_pdw_nodes_os_performance_counters

Tools and other services

Development tools

Don’t be too enthusiastic

Redgate tools

Works fine, but PowerBI skips schema names

Azure Machine Learning - Reader OK

Azure Machine Learning - Writer Writes limited to only 1
row at a time!!! Use another destination instead NO

Azure Stream Analytics for real time loading

Azure Data Factory - Load data to and from -
Run stored procedure - Custom activity (pause and resume Azure DW)

Polybase

Security

Active Directory Integrated Authentication

Backups

Backups • Azure Storage Blob snapshots Snapshots start a minimum
of every eight hours and are available for seven days • Geo-redundant backups The Azure Storage RA-GRS feature replicates the backup files to a paired data center

Limitations

Summary 1/2 Don't use SQL DW for workloads that have:
• High frequency reads and writes • Large numbers of singleton selects • High volumes of single row inserts • Row by row processing needs • Incompatible formats (JSON, XML)

Summary 2/2 Use SQL DW for workloads that: • Have
heavy reads and batch writes • Have only a few users • Can be modelled in one database • Can benefit from pause/resume functionality

Questions

Have you learned something lately? Have you done something you
are proud of? Then we invite you to share it with us!

Getting Started With Azure SQL Data Warehouse

Getting Started With Azure SQL Data Warehouse

More Decks by Valdas Maksimavičius

Other Decks in Programming

Featured

Transcript