Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics

Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics
Hiroyuki Nakazato Cloud Solution Architect - Data & Analytics, Microsoft Japan

Agenda 1. Technical overview 2. Architecture patterns 3. New feature
- Custom partitioning (Preview) 4. Demo 5. Ideas for when to use which architecture pattern

Azure Synapse Analytics The first unified, cloud native platform for
converged analytics Data Warehouse Data Lake Event Brokers Azure Synapse Link Cosmos DB Dataverse

Azure Synapse Link for Azure Cosmos DB Analytical Store Column
store optimized for analytical queries Transactional Store Row store optimized for transactional operations Azure Cosmos DB Azure Synapse Analytics Container Cloud-Native HTAP Azure Synapse Link Serverless SQL pool Auto-Sync Machine learning Big data analytics BI Dashboards Operational Data Generate near real-time insights on your operational data Apache Spark pool Existing SQL API Containers - GA in March 2022 Newly created SQL API containers Newly created Mongo DB API containers Existing Mongo DB API Containers - On roadmap

Azure Cosmos DB Change Feed Mapping Data Flow GA in
March 2022 Near real-time processing of data with Low-code / No-code Azure Synapse Analytics Azure Data Factory

Architecture pattern #1 - Cosmos DB Change Feed Cosmos DB
Change Feed with Data Factory / Synapse Mapping Data Flow Mapping Data Flow Sources Cosmos DB Targets Synapse Analytics Data Factory Change Feed Write data to target 15+ data stores are supported with Mapping Data Flow Read data as source

Architecture pattern #2 - Synapse Link for Cosmos DB Near-real
time analytics with Synapse Serverless SQL pool & Power BI Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Read data via query DirectQuery or Import

Architecture pattern #3 - Synapse Link for Cosmos DB Code
based data processing with Synapse Spark pool Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Apache Spark pool Lake database (As metastore) Data Lake Storage Read data via query DirectQuery or Import Write processed data Read data as source

Architecture pattern #4 - Synapse Link for Cosmos DB Code
free data processing with Synapse Mapping Data Flow Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Apache Spark pool Lake database (As metastore) Mapping Data Flow Data Lake Storage Read data via query DirectQuery or Import Write processed data Read data as source

Architecture pattern #5 - Synapse Link for Cosmos DB Cost-effective
archival of historical data • Transactional Store and Analytical Store can have mutually independent Time-to-Live (TTL) • Analytical Store’s storage cost is cheaper than Transactional Store’s one: Enables to achieve cost-effective archival of historical data • Transaction Store’s TTL aka “TTTL” • Analytical Store’s TTL aka “ATTL” • If TTTL <= ATTL • Items in Transactional Store are deleted before items in Analytical Store • If ATTL is -1 • Analytical Store keeps all historical data • ATTL: On (no default) == ATTL -1 (See more info) https://docs.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction#cost-effective-archival-of-historical-data

Concepts of partitioning & partition pruning • In general, partitioning
and partition pruning is important to improve the efficiency of reading data using Apache Spark • Partitioning - Separate directories to store data according to the value of fields that frequently specified in query filter conditions • Partition pruning - Avoid reading unnecessary partition data in query Data in ADLS Gen2 Spark SQL SELECT * FROM example_logs WHERE event_year=2022 AND event_month=02 example_logs / event_year=2022 / event_month=01 / xxx.parquet example_logs / event_year=2022 / event_month=02 / xxx.parquet example_logs / event_year=2022 / event_month=03 / xxx.parquet (Note) Analytical Store is not partitioned by default

Custom partitioning (Preview) When to use  High volume of
update or delete operations  Slow data ingestion • Output data from Analytical Store partitioned by specified fields to ADLS Gen2 using Synapse Spark • The ADLS Gen2 is primary storage account linked to Synapse Workspace Benefits  Reduced data scanning from partition pruning  Query performance improvements Current limitations  Only available for Synapse Spark, Cosmos DB SQL API  Can only point to the primary storage account Additional costs  Synapse Spark (As compute)  ADLS Gen2 (As storage)

Related docs for custom partitioning (Preview) Content Link Product docs
- Overview Custom partitioning in Azure Synapse Link for Azure Cosmos DB (Preview) | Microsoft Docs Product docs - How-to guide Configure custom partitioning to partition analytical store data (Preview) | Microsoft Docs Update published in Mar. 2022 Public preview: Azure Synapse Link for Azure Cosmos DB partitioning Spark 3.1 | Azure updates | Microsoft Azure Update published in Nov. 2021 Azure Synapse Link for Azure Cosmos DB: Custom partitioning support in public preview | Azure updates | Microsoft Azure

Demo 1. Cosmos DB Change Feed with Synapse Mapping Data
Flow 2. Near-real time analytics with Synapse Serverless SQL pool & Power BI 3. Code free data processing with Synapse Mapping Data Flow 4. Custom partitioning (Preview)

Ideas for when to use which architecture pattern 1 Cosmos
DB Change Feed with Data Factory / Synapse Mapping Data Flow • Prefer Low-code / No-code style • Don’t want to / can’t use Synapse Link for Cosmos DB • Need to process data with shorter latency than Synapse Link for Cosmos DB 2 Near-real time analytics with Synapse Serverless SQL pool & Power BI • Appropriate for near-real time analytics • BI with fully serverless, cost-effective way 3 Code based data processing with Synapse Spark pool • Good to use with #2 & Custom partitioning (Preview) • Large volume of data, which should be aggregated for analytics & BI 4 Code free data processing with Synapse Mapping Data Flow • Good to use with #2 & Custom partitioning (Preview) • Unlike #3, if Low-code / No-code style is preferable, use #4 5 Cost-effective archival of historical data • Need to retain read-only copy cost-effectively

Architecture patterns of Azure Cosmos DB & Azur...

Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics

Hiroyuki Nakazato / 中里浩之

More Decks by Hiroyuki Nakazato / 中里浩之

Other Decks in Technology

Featured

Transcript