Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics

Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics

More Decks by Hiroyuki Nakazato / 中里 浩之

Other Decks in Technology

Transcript

  1. Architecture patterns of Azure Cosmos DB & Azure Synapse Analytics

    Hiroyuki Nakazato Cloud Solution Architect - Data & Analytics, Microsoft Japan
  2. Agenda 1. Technical overview 2. Architecture patterns 3. New feature

    - Custom partitioning (Preview) 4. Demo 5. Ideas for when to use which architecture pattern
  3. Azure Synapse Analytics The first unified, cloud native platform for

    converged analytics Data Warehouse Data Lake Event Brokers Azure Synapse Link Cosmos DB Dataverse
  4. Azure Synapse Link for Azure Cosmos DB Analytical Store Column

    store optimized for analytical queries Transactional Store Row store optimized for transactional operations Azure Cosmos DB Azure Synapse Analytics Container Cloud-Native HTAP Azure Synapse Link Serverless SQL pool Auto-Sync Machine learning Big data analytics BI Dashboards Operational Data Generate near real-time insights on your operational data Apache Spark pool Existing SQL API Containers - GA in March 2022 Newly created SQL API containers Newly created Mongo DB API containers Existing Mongo DB API Containers - On roadmap
  5. Azure Cosmos DB Change Feed Mapping Data Flow GA in

    March 2022 Near real-time processing of data with Low-code / No-code Azure Synapse Analytics Azure Data Factory
  6. Architecture pattern #1 - Cosmos DB Change Feed Cosmos DB

    Change Feed with Data Factory / Synapse Mapping Data Flow Mapping Data Flow Sources Cosmos DB Targets Synapse Analytics Data Factory Change Feed Write data to target 15+ data stores are supported with Mapping Data Flow Read data as source
  7. Architecture pattern #2 - Synapse Link for Cosmos DB Near-real

    time analytics with Synapse Serverless SQL pool & Power BI Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Read data via query DirectQuery or Import
  8. Architecture pattern #3 - Synapse Link for Cosmos DB Code

    based data processing with Synapse Spark pool Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Apache Spark pool Lake database (As metastore) Data Lake Storage Read data via query DirectQuery or Import Write processed data Read data as source
  9. Architecture pattern #4 - Synapse Link for Cosmos DB Code

    free data processing with Synapse Mapping Data Flow Serverless SQL pool Sources Transactional Store Synapse Analytics Analytical Store Cosmos DB (Synapse Link enabled) Auto sync SQL database (As metastore) Power BI Apache Spark pool Lake database (As metastore) Mapping Data Flow Data Lake Storage Read data via query DirectQuery or Import Write processed data Read data as source
  10. Architecture pattern #5 - Synapse Link for Cosmos DB Cost-effective

    archival of historical data • Transactional Store and Analytical Store can have mutually independent Time-to-Live (TTL) • Analytical Store’s storage cost is cheaper than Transactional Store’s one: Enables to achieve cost-effective archival of historical data • Transaction Store’s TTL aka “TTTL” • Analytical Store’s TTL aka “ATTL” • If TTTL <= ATTL • Items in Transactional Store are deleted before items in Analytical Store • If ATTL is -1 • Analytical Store keeps all historical data • ATTL: On (no default) == ATTL -1 (See more info) https://docs.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction#cost-effective-archival-of-historical-data
  11. Concepts of partitioning & partition pruning • In general, partitioning

    and partition pruning is important to improve the efficiency of reading data using Apache Spark • Partitioning - Separate directories to store data according to the value of fields that frequently specified in query filter conditions • Partition pruning - Avoid reading unnecessary partition data in query Data in ADLS Gen2 Spark SQL SELECT * FROM example_logs WHERE event_year=2022 AND event_month=02 example_logs / event_year=2022 / event_month=01 / xxx.parquet example_logs / event_year=2022 / event_month=02 / xxx.parquet example_logs / event_year=2022 / event_month=03 / xxx.parquet (Note) Analytical Store is not partitioned by default
  12. Custom partitioning (Preview) When to use  High volume of

    update or delete operations  Slow data ingestion • Output data from Analytical Store partitioned by specified fields to ADLS Gen2 using Synapse Spark • The ADLS Gen2 is primary storage account linked to Synapse Workspace Benefits  Reduced data scanning from partition pruning  Query performance improvements Current limitations  Only available for Synapse Spark, Cosmos DB SQL API  Can only point to the primary storage account Additional costs  Synapse Spark (As compute)  ADLS Gen2 (As storage)
  13. Related docs for custom partitioning (Preview) Content Link Product docs

    - Overview Custom partitioning in Azure Synapse Link for Azure Cosmos DB (Preview) | Microsoft Docs Product docs - How-to guide Configure custom partitioning to partition analytical store data (Preview) | Microsoft Docs Update published in Mar. 2022 Public preview: Azure Synapse Link for Azure Cosmos DB partitioning Spark 3.1 | Azure updates | Microsoft Azure Update published in Nov. 2021 Azure Synapse Link for Azure Cosmos DB: Custom partitioning support in public preview | Azure updates | Microsoft Azure
  14. Demo 1. Cosmos DB Change Feed with Synapse Mapping Data

    Flow 2. Near-real time analytics with Synapse Serverless SQL pool & Power BI 3. Code free data processing with Synapse Mapping Data Flow 4. Custom partitioning (Preview)
  15. Ideas for when to use which architecture pattern 1 Cosmos

    DB Change Feed with Data Factory / Synapse Mapping Data Flow • Prefer Low-code / No-code style • Don’t want to / can’t use Synapse Link for Cosmos DB • Need to process data with shorter latency than Synapse Link for Cosmos DB 2 Near-real time analytics with Synapse Serverless SQL pool & Power BI • Appropriate for near-real time analytics • BI with fully serverless, cost-effective way 3 Code based data processing with Synapse Spark pool • Good to use with #2 & Custom partitioning (Preview) • Large volume of data, which should be aggregated for analytics & BI 4 Code free data processing with Synapse Mapping Data Flow • Good to use with #2 & Custom partitioning (Preview) • Unlike #3, if Low-code / No-code style is preferable, use #4 5 Cost-effective archival of historical data • Need to retain read-only copy cost-effectively