Slide 1

Slide 1 text

Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics Hiroyuki Nakazato Cloud Solution Architect - Data & Analytics, Microsoft Japan

Slide 2

Slide 2 text

Agenda 1. Technical overview 2. Architecture patterns 3. New feature - Custom partitioning (Preview) 4. Demo 5. Ideas for when to use which architecture pattern

Slide 3

Slide 3 text

Azure Synapse Analytics The first unified, cloud native platform for converged analytics Data Warehouse Data Lake Event Brokers Azure Synapse Link Cosmos DB Dataverse

Slide 4

Slide 4 text

Azure Synapse Link for Azure Cosmos DB Analytical Store Column store optimized for analytical queries Transactional Store Row store optimized for transactional operations Azure Cosmos DB Azure Synapse Analytics Container Cloud-Native HTAP Azure Synapse Link Serverless SQL pool Auto-Sync Machine learning Big data analytics BI Dashboards Operational Data Generate near real-time insights on your operational data Apache Spark pool Existing SQL API Containers - GA in March 2022 Newly created SQL API containers Newly created Mongo DB API containers Existing Mongo DB API Containers - On roadmap

Slide 5

Slide 5 text

Azure Cosmos DB Change Feed Mapping Data Flow GA in March 2022 Near real-time processing of data with Low-code / No-code Azure Synapse Analytics Azure Data Factory

Slide 6

Slide 6 text

Architecture pattern #1 - Cosmos DB Change Feed Cosmos DB Change Feed with Data Factory / Synapse Mapping Data Flow Mapping Data Flow Sources Cosmos DB Targets Synapse Analytics Data Factory Change Feed Write data to target 15+ data stores are supported with Mapping Data Flow Read data as source

Slide 7

Slide 7 text

Architecture pattern #2 - Synapse Link for Cosmos DB Near-real time analytics with Synapse Serverless SQL pool & Power BI Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Read data via query DirectQuery or Import

Slide 8

Slide 8 text

Architecture pattern #3 - Synapse Link for Cosmos DB Code based data processing with Synapse Spark pool Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Apache Spark pool Lake database (As metastore) Data Lake Storage Read data via query DirectQuery or Import Write processed data Read data as source

Slide 9

Slide 9 text

Architecture pattern #4 - Synapse Link for Cosmos DB Code free data processing with Synapse Mapping Data Flow Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Apache Spark pool Lake database (As metastore) Mapping Data Flow Data Lake Storage Read data via query DirectQuery or Import Write processed data Read data as source

Slide 10

Slide 10 text

Architecture pattern #5 - Synapse Link for Cosmos DB Cost-effective archival of historical data • Transactional Store and Analytical Store can have mutually independent Time-to-Live (TTL) • Analytical Store’s storage cost is cheaper than Transactional Store’s one: Enables to achieve cost-effective archival of historical data • Transaction Store’s TTL aka “TTTL” • Analytical Store’s TTL aka “ATTL” • If TTTL <= ATTL • Items in Transactional Store are deleted before items in Analytical Store • If ATTL is -1 • Analytical Store keeps all historical data • ATTL: On (no default) == ATTL -1 (See more info) https://docs.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction#cost-effective-archival-of-historical-data

Slide 11

Slide 11 text

Concepts of partitioning & partition pruning • In general, partitioning and partition pruning is important to improve the efficiency of reading data using Apache Spark • Partitioning - Separate directories to store data according to the value of fields that frequently specified in query filter conditions • Partition pruning - Avoid reading unnecessary partition data in query Data in ADLS Gen2 Spark SQL SELECT * FROM example_logs WHERE event_year=2022 AND event_month=02 example_logs / event_year=2022 / event_month=01 / xxx.parquet example_logs / event_year=2022 / event_month=02 / xxx.parquet example_logs / event_year=2022 / event_month=03 / xxx.parquet (Note) Analytical Store is not partitioned by default

Slide 12

Slide 12 text

Custom partitioning (Preview) When to use  High volume of update or delete operations  Slow data ingestion • Output data from Analytical Store partitioned by specified fields to ADLS Gen2 using Synapse Spark • The ADLS Gen2 is primary storage account linked to Synapse Workspace Benefits  Reduced data scanning from partition pruning  Query performance improvements Current limitations  Only available for Synapse Spark, Cosmos DB SQL API  Can only point to the primary storage account Additional costs  Synapse Spark (As compute)  ADLS Gen2 (As storage)

Slide 13

Slide 13 text

Related docs for custom partitioning (Preview) Content Link Product docs - Overview Custom partitioning in Azure Synapse Link for Azure Cosmos DB (Preview) | Microsoft Docs Product docs - How-to guide Configure custom partitioning to partition analytical store data (Preview) | Microsoft Docs Update published in Mar. 2022 Public preview: Azure Synapse Link for Azure Cosmos DB partitioning Spark 3.1 | Azure updates | Microsoft Azure Update published in Nov. 2021 Azure Synapse Link for Azure Cosmos DB: Custom partitioning support in public preview | Azure updates | Microsoft Azure

Slide 14

Slide 14 text

Demo 1. Cosmos DB Change Feed with Synapse Mapping Data Flow 2. Near-real time analytics with Synapse Serverless SQL pool & Power BI 3. Code free data processing with Synapse Mapping Data Flow 4. Custom partitioning (Preview)

Slide 15

Slide 15 text

Ideas for when to use which architecture pattern 1 Cosmos DB Change Feed with Data Factory / Synapse Mapping Data Flow • Prefer Low-code / No-code style • Don’t want to / can’t use Synapse Link for Cosmos DB • Need to process data with shorter latency than Synapse Link for Cosmos DB 2 Near-real time analytics with Synapse Serverless SQL pool & Power BI • Appropriate for near-real time analytics • BI with fully serverless, cost-effective way 3 Code based data processing with Synapse Spark pool • Good to use with #2 & Custom partitioning (Preview) • Large volume of data, which should be aggregated for analytics & BI 4 Code free data processing with Synapse Mapping Data Flow • Good to use with #2 & Custom partitioning (Preview) • Unlike #3, if Low-code / No-code style is preferable, use #4 5 Cost-effective archival of historical data • Need to retain read-only copy cost-effectively