Massive data processing Needs and Challenges Balancing massive data writing and complex queries No performance degradation under Massive writing Can handle different types of queries Cost model to pay as you go 5
Limitations of traditional architectures Try to handle everything in one big datastore... Write-optimized datastore are weak on complex queries Query-optimized data stores are weak to massively concurrent writes -> As a result, rely on over-spec datastores 6
Data Transformation, Stream Processing Used for processing to stream data with low latency The best solution is to use CosmosDBTrigger in Azure Functions For write-fast storage such as SQL Database and Redis Cache Also used when writing back to Cosmos DB (creating materialized view) 13
Sample code - Push model public class Function1 { public Function1(CosmosClient cosmosClient) { _container = cosmosClient.GetContainer("SampleDB", "MaterializedView"); } private readonly Container _container; [FunctionName("Function1")] public async Task Run([CosmosDBTrigger( databaseName: "SampleDB", collectionName: "TodoItems", LeaseCollectionName = "leases")] IReadOnlyList input, ILogger log) { var tasks = new Task[input.Count]; for (int i = 0; i < input.Count; i++) { // Change the partition key and write it back (actually, do advanced conversion) var partitionKey = new PartitionKey(input[i].GetPropertyValue("anotherKey")); tasks[i] = _container.UpsertItemStreamAsync(new MemoryStream(input[i].ToByteArray()), partitionKey); } await Task.WhenAll(tasks); } } 14
Batch Processing Use when you need to process a large amount of data at one time It is practical to implement it using TimerTrigger in Azure Functions Used for archiving to Blob Storage / Data Lake Storage Gen 2 Storage GPv2 and Data Lake Storage Gen 2 are charged by the number of write transactions, so writing stream data every time increases costs 15
Improving resiliency - Retry policy CosmosDBTrigger proceeds to the next Change Feed when an execution error occurs. Retry policy is used because data in case of failure will be lost without being processed again. Use FixedDelayRetry or ExponentialBackoffRetry with an unlimited ( -1 ) maximum number of retries. Change Feed will not proceed until successful, so no data will be lost. 18
Focus on idempotency and eventual consistency Coding for idempotency whenever possible For storage that can be overwrite or delete (Cosmos DB / SQL Database / etc) When it is difficult to ensure idempotency, focus on eventual consistency. Focus on "At least once" For storage that can only be append (Blob Storage / Data Lake Storage Gen 2) 20
Avoid inconsistent states - Graceful shutdown Azure Functions will be restarted when a new version is deployed or platform is updated If the host is restarted while executing a Function, the states may be inconsistent Implement Graceful shutdown to avoid inconsistent states Increase resiliency by combining with Retry policy 21
References Azure Cosmos DB trigger for Functions 2.x and higher | Microsoft Docs Azure/azure-cosmos-dotnet-v3: .NET SDK for Azure Cosmos DB for the core SQL API Change feed pull model | Microsoft Docs Azure Functions error handling and retry guidance | Microsoft Docs Cancellation tokens - Develop C# class library functions using Azure Functions | Microsoft Docs 23