$30 off During Our Annual Pro Sale. View Details »

Cosmos DB Operations

Cosmos DB Operations

Azure Cosmos DB is globally distributed, multimodel SLA based database for throughput, low latency for reads/writes and consistency. The operations of Cosmos DB are breeze as capacity management, performance management and availability management are all taken care by the platform. Right modelling with partition for scale out in mind and right throughput - ensures you do not have to do much.

Govind Kanshi

March 15, 2018
Tweet

More Decks by Govind Kanshi

Other Decks in Programming

Transcript

  1. Cosmos DB Operations
    email : [email protected]
    Twitter - @azurecosmosdb

    View Slide

  2. Managing Database Operations
    • Performance management
    • Provisioned throughput guaranteed by Cosmos DB
    • Latency for point reads/writes of 1 KB guaranteed
    • Capacity management
    • Cosmos DB is elastic for both your throughput and storage needs
    • Availability management
    • Cosmos DB provides ability to distribute data for low latency reads &
    availability

    View Slide

  3. What to monitor
    • Throughput
    • Throttles
    • Metrics- Distribution of throughput
    • Storage
    • Metrics- Distribution of data
    • Latency
    • Metrics Server latency
    • Consistency/Availability – as per SLA

    View Slide

  4. Performance
    • Throughput management
    • RU – Request Unit is the budget at per second
    • Throughput equally distributed across partition key ranges
    • # of partition key ranges are transient and change to accommodate increased data
    • Throughput increases automatically when data increases to serve that data
    • Every CRUD operation uses RU
    • TTL operation does not use RU.
    • SQL Query RU can change based on # of entities valid for filter condition
    • Point lookup/Write RU never changes with help of partition key and exact id
    • Scale up/down
    • Scheduled
    • Web job Code – cli/sdk
    • Portal
    • Alerts
    • Monitor (soon to have collection id) –

    View Slide

  5. Performance – Throughput
    • Throughput management
    • Throttling - SDK retries throttling issue by retrying
    • This behavior can be overridden
    • Side effect of automatic retry – more perceived latency

    View Slide

  6. Performance - SQL API – query execution
    metrics
    • https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-
    query-metrics#query-execution-metrics

    View Slide

  7. Performance -
    • Metadata requests
    • Use canonicalized model to refer to “resources”
    • do not query for them frequently – cache the ref

    View Slide

  8. Performance
    • Client Side log/etl
    • For debugging retrys or other issues – unreachable host
    • Do not switch on indefinitely
    • Look for CPU being high on client as 1st measure.
    https://github.com/Azure/azure-cosmosdb-java#prerequisites

    View Slide

  9. Performance
    • Client side Latency
    • Latency is the function of
    • operation or
    • automatic retrys
    • Also possible because of
    • MaxDegreeOfParallelism
    • MaxBufferedItemCount
    • MaxItemCount
    • Colocate client in same Datacenter as the Cosmos DB account
    • Use Static Instance of the DocumentClient
    • Follow performance tips - https://docs.microsoft.com/en-us/azure/cosmos-
    db/performance-tips , https://docs.microsoft.com/en-us/azure/cosmos-
    db/performance-tips-java

    View Slide

  10. Performance
    • Index management
    • Automatic indexing
    • Range can do hash’s job
    • Hash useful for contains query
    over array
    • Indexing policy
    • Do not use Lazy indexing
    • If you want to query on id
    • Create another attribute with
    • same value and index it

    View Slide

  11. Performance - Index
    • Disable Index if all you need is kv store in SQL api
    • Id can be partition key
    • Index only what is required

    View Slide

  12. Performance
    • You found an issue in performance
    • Query/operation taking longer
    • Option
    • log time/ru yourself to appinsight
    • Look at little delayed log analytics data
    • Log Analytics
    • Which query takes more RU
    • Which query/operation takes more time

    View Slide

  13. Performance Summary
    • In order of preference (latency and throughput)
    • GET
    • Single-partition query
    • Cross-partition query
    • Read feed (or) scan query
    • Bulk Insert (SP) > POST > PUT
    • TTL Delete > Bulk Delete (SP) > DELETE > PUT
    • Use change feed!
    • Stored procs are good for writing bulk/batch in transactional manner
    (do not use them for doing reads/bulk reads). Client reads will always
    get you more bang for the buck.

    View Slide

  14. Storage management
    • No capacity management
    • Platform takes care of growth and required growth of request units
    • Ensure no data skew
    • Use metrics to detect but design
    • Ensure right partition key
    • Rebuild the container
    • Use TTL to expire stuff
    • Use Change feed + Azure functions to move data
    • What if you need different partition key for same data (secondary
    indexing )
    • Use Change feed to populate another collection with different partition key

    View Slide

  15. Storage – large documents – how to
    • Large documents
    • Consume high RUs due to IOs and indexing over
    • Lead to partition key quota full
    • Lead to rate-limiting
    • Patterns to manage large documents
    • Storing large attributes in separate linked document/collection
    • Storing large attributes in Azure Blob Storage
    • Compress these attributes
    • Custom indexing policy, disable on subset of properties

    View Slide

  16. Storage – large partition keys > 10 GB
    • Common scenarios:
    • Multi-tenant applications where few tenants are very large
    • Router publishes telemetry at higher rate than sensors
    • Celebrity in a social networking app, viral gaming tournament
    • Patterns to manage large partition keys
    • Have a surrogate partition key like tenant ID + 0-100
    • Use hybrid partitioning scheme for small tenants, and large tenants =
    0-100
    • Move large tenants to their own collections
    • If the per-document size is large, use the patterns for large documents

    View Slide

  17. Storage – hot partition keys
    • Subset of keys much more frequently accessed than others
    • Popular item in retail catalog, common driver defect in Windows
    DnA telemetry
    • Patterns to manage hot partition keys
    • Secondary cache collection with just the hot keys
    • Scale out across regions for isolating read and write RUs
    • Reduce RU consumption by converting critical-path queries to GETs
    • Materialized views for aggregates like COUNT into a document
    • Materialized view for latest state, leaderboard into a document
    • Why? Amortize cost at write time vs. read time

    View Slide

  18. Availability management
    • Always – add Geo DR
    • 99.99 within region, 99.999 for reads
    • Data available in read regions for low latency read
    workload(Changefeed or just reads)
    • Data abides by consistency provided
    • Auto homing SDK
    • Leverage Manual failover testing for DR testing or follow the sun
    • Leverage consistency settings to take advantage of throughput
    (if required) – 2 * strong consistency, reads at lower consistency

    View Slide

  19. Error codes
    • Http Status code - https://docs.microsoft.com/en-
    us/rest/api/documentdb/http-status-codes-for-documentdb
    • 200
    • 400
    • 401/403 ..
    • 404/413
    • 429 – SDK will handle it
    • Retry Policy – default
    • Override it if needed
    • 500 – file support
    • 503 – retryable

    View Slide

  20. Other
    • Bulk load
    • Increase RU, shuffle the data , push parallelly
    • Use a tool which knows distributed database
    • ODBC would not be good way to connect for example
    • A tool/service in offering – reach out for bulk load tool in java/.net
    • Backup
    • Automatic two - 4 hourly snapshots (for oops I deleted scenario)
    • Restore on demand via support call
    • Create copy of database
    • Changefeed + Azure function
    • Paging
    • Continuation tokens in Cosmos DB never expire. ation token corresponding to
    1,2,3 and 4. So you can execute the query to go back to that page. – WIP

    View Slide

  21. Managing Database Operations
    • Performance
    • Throughput - Choosing right partition/operations/data model
    • T = Reads + Writes + updates + deletes + queries
    • Latency - Ensuring queries have right partitions
    • GET < Single partition < Multiple Partition
    • Storage management
    • Choose right partition keys – query fan out for all data vs data size
    • Colocate data with partition key
    • Availability management
    • Always Add Geo redundancy with one click

    View Slide

  22. Other
    • Access to account level activities can be controlled
    • Log of activities in log analytics
    • Token based time access to resources
    • Data at rest encrypted
    • Data in motion encrypted

    View Slide

  23. Links
    • Metrics - https://docs.microsoft.com/en-us/azure/cosmos-db/use-metrics
    • Diagnostic logging - https://docs.microsoft.com/en-us/azure/cosmos-db/logging
    • Set throughput - https://docs.microsoft.com/en-us/azure/cosmos-db/set-throughput
    • Access control - https://docs.microsoft.com/en-us/azure/cosmos-db/access-control
    • Failover - https://docs.microsoft.com/en-us/azure/cosmos-db/regional-failover
    • TTL - https://docs.microsoft.com/en-us/azure/cosmos-db/time-to-live
    • Indexing - https://docs.microsoft.com/en-us/azure/cosmos-db/indexing-policies
    • Change feed - https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed
    • SQL Query perf - https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-metrics
    • Partitioning - https://docs.microsoft.com/en-us/azure/cosmos-db/sql-api-partition-data
    • Modelling - https://docs.microsoft.com/en-us/azure/cosmos-db/modeling-data
    • Perf tips - https://docs.microsoft.com/en-us/azure/cosmos-db/performance-tips
    • Throughput - https://docs.microsoft.com/en-us/azure/cosmos-db/request-units
    • Azure CLI - https://docs.microsoft.com/en-us/azure/cosmos-db/cli-samples

    View Slide

  24. Learn more www.azurecosmosdb.com
    GLOBAL APPS NEED GLOBAL DATA FROM
    A SERVICE THAT’S OUT OF THIS WORLD
    WELCOME TO AZURE COSMOS DB
    Sign up to Azure for free https://aka.ms/azureaccount
    Try Azure Cosmos DB https://aka.ms/tryazurecosmosdb
    Join next weeks session to learn about how to build
    serverless apps and resister
    https://aka.ms/CosmosDBlearn

    View Slide

  25. Thank you for joining us.

    View Slide