Modeling data and best practices for the Azure Cosmos DB.

Modeling data and best practices for the Azure Cosmos DB
Mohammad Asif Waquar @asifwaquar

2 about me Senior Software Engineer at ABN AMRO https://www.linkedin.com/in/mohammad-asif-6a6153111/

SQL PASS Chapter Team @arrnagaraj @Sachit_Keshari @SanjivVenkatram @sarbjitgill @aaroh_bits @Pioisms

Agenda Intro Cosmos DB Resource Model Data Modelling Strategy &
Partitioning Demo SQL API

Turnkey global distribution Elastic scale out of storage & throughput
Comprehensive SLAs Guaranteed low latency at the 99th percentile Five well-defined consistency models Azure Cosmos DB A globally distributed, massively scalable, multi-model database service

Turnkey global distribution Elastic scale out of storage & throughput
Comprehensive SLAs Guaranteed low latency at the 99th percentile Five well-defined consistency models Azure Cosmos DB A globally distributed, massively scalable, multi-model database service Column-family Document Graph Key-value

Column-family Document Graph Turnkey global distribution Elastic scale out of
storage & throughput Comprehensive SLAs Guaranteed low latency at the 99th percentile Five well-defined consistency models TableAPI Key-value Cosmos DB’s API for MongoDB Azure Cosmos DB A globally distributed, massively scalable, multi-model database service

Features • Multi-model data paradigm: key-value, document, graph, family of
columns; • Low latency for 99% of queries: less than 10 ms for read operations and less than 15 ms for (indexed) write operations; • Designed for high throughput; • Ensures availability, consistency of data, delay at SLA level of 99.999%; • Configurable throughput; • Automatic replication (master-slave); • Automatic data indexing; • Configurable levels of consistency of data. Five different levels (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual);

HOW’S THE THROUGHPUT ?

Resource Model

CONTAINERS Logical resources “surfaced” to APIs as tables, collections or
graphs, which are made up of one or more physical partitions or servers. Containers Resource Partitions Collections Tables Graphs Tenants Follower Follower Leader Forwarder Replica Set RESOURCE PARTITIONS • Consistent, highly available, and resource-governed coordination primitives • Consist of replica sets, with each replica hosting an instance of the database engine To remote resource partition(s) Resource Hierarchy

Account Database Database Database Database Database Container Database Database Item
Account URI and Credentials ********.azure.com pass…

Creating Account

Database Representations

= Collection Graph Table Container Representations

Document Vertices/Edges Row Collection Graph Table Item Representations

Conflict Stored procedure Trigger UDF Container-Level Resources

Data Modelling Strategy & Partitioning

Ways to Model Your Data Normalize everything Embed as 1
piece

Data Modelling: Relational vs. Document UserID Name Dob 1 John
Smith 8/30/1964 StockID UserID Qty Symbol 1 1 100 MSFT 2 1 75 WMT Document { "id": 1, "name": "John Smith", "dob": "1964-30-08", "holdings": [ { "qty": 100, "symbol": "MSFT" }, { "qty": 75, "symbol": "WMT" } ] } User Table Holdings Table Relational Store Document Store Rows Documents Columns Properties Strongly-typed schemas Schema-free Highly normalized Typically denormalized

Modelling challenges • How to de-normalize ? • How to
normalize ? • To embed or reference ? • Can I apply joins ? • Should I put data types in same collection ,or different ?

Modelling challenges: To embed or reference ? Document "id": 1,
"name": "John Smith", "dob": "1964-30-08", "holdings": [ { "qty": 100, "symbol": "MSFT" }, { "qty": 75, "symbol": "WMT" } ] Document { "postid": "1", "title": "My blog post", "body": "Post content…", "comments": [ "comment #1", "comment #2", "comment #3", "comment #4“, : "comment #1598873", : Embed Reference Document { "postid": "1", "title": "My blog post", "body": "Post content…“ } Document Document { Document { } } { "postid": "1", "comment": "comment #3“ }

When to embed ? o Data that is queried together,
should live together. o Child data is dependent on parent. o 1:1 relationship eg. All customer have email, phone, nric number for 1:1 relationship. o Data doesn’t change that frequently eg. Email ,address don’t change too often. o Usually embedding provides better read performance but trade-off for write performance, So if we aren’t doing more write this approach will be good.

When to reference ? o 1 : many (unbounded relationship)
o many : many relationships o Data changes at different rates o What is referenced, is heavily referenced by many others o Typically provides better write performance o But may require more network calls for reads

Why is choice of partition key so important? o Enables
your data in Cosmos DB to scale o Large impact on performance of system What can go wrong? o Hot partitions o Choice forces many cross-partition queries for workload Partitioning

Logical partition: Stores all data associated with the same partition
key value Physical partition: Fixed amount of reserved SSD-backed storage + compute. Cosmos DB distributes logical partitions among a smaller number of physical partitions. From your perspective: define 1 partition key per container Partitioning

Partition Key: User Id Logical Partitioning Abstraction Behind the Scenes:
Physical Partition Sets hash(User Id) Psuedo-random distribution of data over range of possible hashed values Cosmos DB Container (e.g. Collection)

hash(User Id) …. Melvin karen … Physical Partition 1 Physical
Partition 2 Physical Partition n John Dharma Shireesh Nilesh Sukhi Bob Milton … Frugal # of Partitions based on actual storage and throughput needs (yielding scalability with low total cost of ownership) Range 1 Range 2 Range n Physical Partition Sets

hash(User Id) …. Melvin Karen … Physical Partition 1 Physical
Partition 2 Physical Partition n John Dharma Shireesh Nilesh Sukhi Bob Milton … What happens when partitions need to grow? Range 1 Range 2 Range n Physical Partition Sets

hash(User Id) Partition X Dharma Shireesh Nilesh Sukhi Bob Milton
… + Dharma Shireesh … Partition X1 Nilesh Sukhi … Partition X2 Partition Ranges can be dynamically sub-divided To seamlessly grow database as the application grows While sedulously maintaining high availability Range 1 Range 2 Range X1 Range X2 Range X Physical Partition Sets

hash(User Id) Partition Ranges can be dynamically sub-divided To seamlessly
grow database as the application grows While sedulously maintaining high availability Best of All: Partition management is completely taken care of by the system You don’t have to lift a finger… the database takes care of you. Partition X Dharma Shireesh Nilesh Sukhi Bob Milton … + Dharma Shireesh … Partition X1 Nilesh Sukhi … Partition X2 Range 1 Range 2 Range X1 Range X2 Physical Partition Sets

Replication and Consistency

How do you ensure consistent reads across replicas? - Define
a consistency level Replication within aregion - Data moves extremely fast (typically, within1ms) between neighboring racks Global replication - Ittakeshundreds of milliseconds to move data across continents Strongerconsistency Higherlatency Loweravailability Weakerconsistency Lower latency Higher availability Replication and Consistency

Consistency Level Guarantees Strong Linearizability (once operation is complete, it
will be visible to all), No dirty reads Bounded Staleness Consistent Prefix. Reads lag behind writes by at most k prefixes or t interval (Dirty reads possible Bounded by time and updates.) Similar properties to strong consistency (except within staleness window), while preserving 99.99% availability and low latency. Session Consistent Prefix. Within a session: Predictable consistency for a session, high read throughput + low latency No dirty reads for writers (read your own writes),Dirty reads possible for other users Consistent Prefix Reads will never see out of order writes (no gaps). Eventual Potential for out of order reads. Lowest cost for reads of all consistency levels. Well-Defined Consistency Models

Let’s see in action

Application uses

Important Links https://azure.microsoft.com/en-us/pricing/calculator/?service=cosmos-db#cosmos-db7aed2059-b457-48cc- a0e9-6744ce81096b Pricing Calculator https://docs.microsoft.com/en-us/azure/cosmos-db/sql-query-getting-started Azure Cosmos Emulator
https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator#controlling-the-emulator SQL API Query http://www.microsoft.com/en-us/download/details.aspx?id=46436 Data Migration Tool

Questions?

Thank you

Modeling data and best practices for the Azure ...

Modeling data and best practices for the Azure Cosmos DB.

More Decks by Asif Waquar

Other Decks in Technology

Featured

Transcript