Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Azure SQL Data Warehouse case study

Azure SQL Data Warehouse case study

It was presented at PASS Summit 2016.

Bence Faludi

October 26, 2016
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Wunderlist Whether you’re planning a holiday, sharing a shopping list

    with a partner or managing multiple work projects, Wunderlist is here to help you tick off all your personal and professional to-dos. 20+ million users Wunderlist is the easiest way to get stuff done.
  2. Previous data architecture on AWS Clients Queue Raw logs Standardized

    and filtered logs Cloud storage Redshift, hot storage (aggregated values, generated KPIs) External sources (payments, app store information, etc.) Redshift, cold storage (raw data) Micro services’ databases Chart.io Analysts & Data Scientists Client Tracking Micro services 400+ GB compressed/day 5 GB compressed/day 30 GB incremental compressed/day 2.5 TB compressed size 8 TB compressed size
  3. We needed to move from AWS to Azure because …

    Microsoft has its own infrastructure so as part of the acquisition, these are investments our company has made and we must adopt as being a part of this parent company. This helps us to share data & information more freely between products. Data compliance allows us to reach user groups we cannot otherwise reach. We’d like to influence the development of data related products and help Microsoft to make industry-leading services. We wanted to replace some of our homebrew solution to a more reliable product.
  4. Moving from AWS to Azure Amazon S3 Amazon SNS/SQS Amazon

    Redshift Ubuntu 14.04 night-shift as ETL Azure Blob Storage Azure Data Lake + Azure SQL Data Warehouse PolyBase + Ubuntu 14.04 night-shift as ETL Apache Kafka based solution Dequeuing service +
  5. Current data architecture on Azure Raw logs Standardized and filtered

    logs Cloud storage Azure SQL DW (raw & aggregated values) Chart.io Analysts & Data Scientists PowerBI AppInsight Analytics Clients Client Tracking Backend APIs Queue 8+2.5 TB size External sources (payments, app store information, etc.)
  6. Inside the box Using PolyBase for quick data loading. Using

    the same distribution key everywhere to improve performance, minimizing data movement. Partitions are fantastic for storing raw data, no need to create multiple tables. Using largerc resource class for daily running ELT and smallrc for visualization tools. ELT, do the transformation within the SQL DW. We’re doing KPI calculations, aggregations, data enrichment. Creating VIEWs for the most popular queries and questions.
  7. Migration Best Practices Migrate the biggest tables first you have

    to test your heaviest queries and figure out the distribution & partition settings. Only migrate your raw data and calculate the aggregations within the database (ELT). Create statistics for every column & update statistics after every load. Use the same distribution column. Create as many compressed files as possible when exporting/importing your table’s data to decrease the transferable size and run parallel processes. COPY command supports more date formats but not comfortable to define the schema of your data.
  8. Findings and wish list Azure SQL DW Wish: JSON support

    for SQL DW and/or PolyBase. Wish: Improved monitoring dashboard. Performance is as good as Amazon Redshift. Stable and reliable Easy and quick to scale up, using T-SQL code. Perfect to do the transformations within the DWH. Carefully consider the distribution key, otherwise you have to redistribute frequently. Connecting from Linux/OSX is challenging. We made Cheetah to solve this issue.