Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Asynchronous Processing World of .NET

The Asynchronous Processing World of .NET

This presentation examines different approaches to asynchronous processing in .NET, using a web scraping application as a case study. It explores the nuances of concurrency models such as BlockingCollection, System.Threading.Channels, TPL Dataflow, and task continuation trees, comparing them to a naïve async approach. Each model's implementation is illustrated with code examples, highlighting their strengths and weaknesses in terms of throughput, latency, resource utilization, and developer effort. The presentation emphasizes the importance of choosing the right concurrency model for a given task and leveraging established libraries for optimal performance and maintainability. It also encourages thorough testing and benchmarking to validate the chosen approach.
The accompanying GitHub repository (https://github.com/tamirdresher/AsyncProcessingInDotNet-Webscraper) provides the full source code for the web scraping application and includes benchmarking results for each concurrency model, allowing for a deeper understanding of their real-world performance characteristics.

Tamir Dresher

November 20, 2024
Tweet

More Decks by Tamir Dresher

Other Decks in Programming

Transcript

  1. The Asynchronous Processing World of .NET Race Condition Who’s there?

    C# C# Tamir Dresher twitter: @tamir_dresher Bluesky: @tamirdresher.bsky.social
  2. Head of Architecture @ Who am I? Software Engineering Lecturer

    Ruppin Academic Center & My Books: Ex-Microsoft MVP
  3. The .NET concurrency landscape Asynchronisity Abstraction Primitives Thread ThreadPool Locking

    (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher
  4. Our use case – translating webscraper Download Html Retrieve Links

    Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Tamir Dresher
  5. Experiment 1 - Naïve async approach Asynchronisity Abstraction Primitives Thread

    ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher
  6. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true)) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  7. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true)) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  8. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true)) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  9. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  10. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  11. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  12. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  13. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  14. Naïve async approach public override async Task Scrape(string startUrl, int

    maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher
  15. Naïve async approach - Summary  Very easy to understand

    and reason about  Issues:  Low throughput – number of pages we process in a time period  High latency – the time it takes to finish a single page processing  Low utilization Tamir Dresher
  16. Our use case – translating web scraper Download Html Retrieve

    Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Tamir Dresher
  17. Our use case – translating web scraper Download Html Retrieve

    Links Retrieve Image Links Replace To Local Links URLs Start URL Translate Save Html Download Images Tamir Dresher
  18. Experiment 2 - BlockingCollection approach Asynchronisity Abstraction Primitives Thread ThreadPool

    Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher
  19. BlockingCollection<T>  A thread-safe wrapper around IProducerConsumerCollection with built-in blocking

    and bounding  var collection = new BlockingCollection<int>( IProducerConsumerCollection<T> collection, int boundedCapacity); var collection = new BlockingCollection<int>(new ConcurrentBag<int>(), boundedCapacity: 5 ) collection.Add(100); Collection.CompleteAdding(); foreach (var num in collection.GetConsumingEnumerable()) { //Do something } Producer Consumer Tamir Dresher
  20. BlockingCollection approach BlockingCollection<(string url, string basePath, int depth, int maxDepth,

    string translateTo)> _htmlJobs = new(); BlockingCollection<(string imageUrl, string basePath, Uri uri)> _imageJobs = new(); BlockingCollection<(string htmlContent, string basePath, Uri uri, string language)> _translationJobs = new(); Tamir Dresher
  21. BlockingCollection approach // Extract images and queue them for download

    var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { _imageJobs.Add((imageUrl, basePath, uri)); } BlockingCollection<(string url, string basePath, int depth, int maxDepth, string translateTo)> _htmlJobs = new(); BlockingCollection<(string imageUrl, string basePath, Uri uri)> _imageJobs = new(); BlockingCollection<(string htmlContent, string basePath, Uri uri, string language)> _translationJobs = new(); Tamir Dresher
  22. BlockingCollection approach // Extract images and queue them for download

    var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { _imageJobs.Add((imageUrl, basePath, uri)); } private async Task ProcessImageJobs() { foreach (var (imageUrl, basePath, uri) in _imageJobs.GetConsumingEnumerable()) { if (!_visitedUrls.TryAdd(imageUrl, true)) continue; //already processed this image await DownloadImageAsync(basePath, uri, imageUrl); } } BlockingCollection<(string url, string basePath, int depth, int maxDepth, string translateTo)> _htmlJobs = new(); BlockingCollection<(string imageUrl, string basePath, Uri uri)> _imageJobs = new(); BlockingCollection<(string htmlContent, string basePath, Uri uri, string language)> _translationJobs = new(); Tamir Dresher
  23. BlockingCollection approach - workers public void StartWorkers(int htmlWorkerCount = 20,

    int imageWorkerCount = 20, int translationWorkerCount = 20) { // Start HTML scraping workers for (int i = 0; i < htmlWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessHtmlJobs())); } // Start image download workers for (int i = 0; i < imageWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessImageJobs())); } // Start translation workers for (int i = 0; i < translationWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessTranslationJobs())); } } Tamir Dresher
  24. BlockingCollection approach - Summary  BlockingCollection is… well, blocking. It

    pauses producers when the collection is full and consumers when it's empty  Threads waiting in a BlockingCollection consume system resources while blocked and reduce the system performance  No fine-grained control on buffering or backpressure handling  No support for async/await  Coordination between the various activities is the developer responsibility  Workers setup is the developer responsibility (+degree of parallelism) Tamir Dresher
  25.  Problem: we have a cycle in our design so

    there’s no intuitive way to set completion on the feeding of BlockingCollection  Solution: add a counter of remaining jobs and mark completion when it reach zero The coordination problem Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Translate Save Html Download Images Tamir Dresher
  26. The coordination problem Tamir Dresher private int _htmlJobsCount = 1;

    private async Task ProcessHtmlJobs() { foreach (var (url, ..) in _htmlJobs.GetConsumingEnumerable()) { try { ... var links = RetrieveLinksFromHtml(html); Interlocked.Add(ref _htmlJobsCount, links.Count); foreach (var link in links) { /* add to queue */ } ... } catch (Exception ex) { ... } finally { var remainingJobsCount = Interlocked.Decrement(ref _htmlJobsCount); Debug.Assert(remainingJobsCount >= 0); if (remainingJobsCount == 0) { _htmlJobs.CompleteAdding(); } } } }
  27. BlockingCollection approach - Summary  BlockingCollection is… well, blocking. It

    pauses producers when the collection is full and consumers when it's empty  Threads waiting in a BlockingCollection consume system resources while blocked and reduce the system performance  No fine-grained control on buffering or backpressure handling  No support for async/await  Coordination between the various activities is the developer responsibility  Workers setup is the developer responsibility (+degree of parallelism) Tamir Dresher
  28. BlockingCollection approach - Summary  BlockingCollection is… well, blocking. It

    pauses producers when the collection is full and consumers when it's empty  Threads waiting in a BlockingCollection consume system resources while blocked and reduce the system performance  No fine-grained control on buffering or backpressure handling  No support for async/await  Coordination between the various activities is the developer responsibility  Workers setup is the developer responsibility (+degree of parallelism) Tamir Dresher
  29. Experiment 3 – System.Threading.Channels approach Asynchronisity Abstraction Primitives Thread ThreadPool

    Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) Channels BlockingCollection Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher
  30. System.Threading.Channels  Channels are high-performance, asynchronous primitives with support for

    bounded or unbounded producer-consumer scenarios  Designed for non-blocking workflows using async/await  Handles backpressure effectively for high-throughput scenarios. var channel = Channel.CreateUnbounded<int>(); Tamir Dresher
  31. System.Threading.Channels  Channels are high-performance, asynchronous primitives with support for

    bounded or unbounded producer-consumer scenarios  Designed for non-blocking workflows using async/await  Handles backpressure effectively for high-throughput scenarios. var channel = Channel.CreateUnbounded<int>(); for (int i = 1; i <= 10; i++) { await channel.Writer.WriteAsync(i); } channel.Writer.Complete(); var channel = Channel.CreateBounded<int>( new BoundedChannelOptions(3) { FullMode = BoundedChannelFullMode.DropOldest }); await foreach (var num in channel.Reader.ReadAllAsync()) { // Do something } Producer Consumer Tamir Dresher
  32. System.Threading.Channels approach Channel<HtmlScrapeJob> _htmlChannel = Channel.CreateUnbounded<HtmlScrapeJob>(); Channel<ImageDownloadJob> _imageChannel = Channel.CreateUnbounded<ImageDownloadJob>();

    Channel<HtmlTranslationJob> _translationChannel = Channel.CreateUnbounded<HtmlTranslationJob>(); public void StartWorkers(int htmlWorkerCount = 20, int imageWorkerCount = 20, int translationWorkerCount = 20) { // Start HTML scraping workers for (int i = 0; i < htmlWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessHtmlJobs())); } // Start image download workers for (int i = 0; i < imageWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessImageJobs())); } // Start translation workers for (int i = 0; i < translationWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessTranslationJobs())); } } Tamir Dresher
  33. System.Threading.Channels approach private async Task ProcessHtmlJobs() { await foreach (var

    htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } Tamir Dresher
  34. private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in

    _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } System.Threading.Channels approach Tamir Dresher
  35. private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in

    _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } System.Threading.Channels approach Tamir Dresher
  36. private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in

    _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } System.Threading.Channels approach Tamir Dresher
  37. System.Threading.Channels approach private async Task ProcessHtmlJobs() { await foreach (var

    htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } Tamir Dresher
  38. Channels approach - Summary  While System.Threading.Channels is a powerful

    and modern it feels very low-level, developer need to with  Complex coordination  Limited Debugging and Observability  Prioritization  Dynamic resizing of capacity or parallelism level  Cancellation and error propogation Tamir Dresher
  39. Experiment 4 - TPL Dataflow approach Asynchronisity Abstraction Primitives Thread

    ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher
  40. TPL Dataflow  A library for building data processing pipelines

    using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend Tamir Dresher
  41. TPL Dataflow  A library for building data processing pipelines

    using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock<int, int>(x => x * 2); var adder = new TransformManyBlock<int, int>(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock<int>(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher
  42. TPL Dataflow  A library for building data processing pipelines

    using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock<int, int>(x => x * 2); var adder = new TransformManyBlock<int, int>(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock<int>(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher
  43. TPL Dataflow  A library for building data processing pipelines

    using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock<int, int>(x => x * 2); var adder = new TransformManyBlock<int, int>(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock<int>(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher
  44. TPL Dataflow  A library for building data processing pipelines

    using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock<int, int>(x => x * 2); var adder = new TransformManyBlock<int, int>(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock<int>(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher
  45. Our use case – translating web scraper Download Html Retrieve

    Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Tamir Dresher
  46. Our use case – translating web scraper Download Html Retrieve

    Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Broadcaster TransformBlock TransformManyBlock TransformManyBlock ActionBlock TransformBlock ActionBlock TransformBlock Tamir Dresher
  47. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock<HtmlProcessingData>(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher
  48. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20, BoundedCapacity = 100 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock<HtmlProcessingData>(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher TransformBlock<UrlProcessingData, HtmlProcessingData> CreateFetchHtmlBlock(ExecutionDataflowBlockOptions options, ScrapeContext context) { return new TransformBlock<UrlProcessingData, HtmlProcessingData>(async item => { try { if (item.depth <= context.MaxDepth && _processedUrls.TryAdd(item.Url, true)) { var html = await DownloadHtmlAsync(item.Url); return new HtmlProcessingData(html, item.Url, item.depth, context); } } catch (Exception ex) { _logger.LogError(ex, "Error fetching html {url}", item?.Url); } return new HtmlProcessingData("", "", -1, context); }, options); }
  49. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20, BoundedCapacity = 100 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock<HtmlProcessingData>(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher TransformBlock<UrlProcessingData, HtmlProcessingData> CreateFetchHtmlBlock(ExecutionDataflowBlockOptions options, ScrapeContext context) { return new TransformBlock<UrlProcessingData, HtmlProcessingData>(async item => { try { if (item.depth <= context.MaxDepth && _processedUrls.TryAdd(item.Url, true)) { var html = await DownloadHtmlAsync(item.Url); return new HtmlProcessingData(html, item.Url, item.depth, context); } } catch (Exception ex) { _logger.LogError(ex, "Error fetching html {url}", item?.Url); } return new HtmlProcessingData("", "", -1, context); }, options); }
  50. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock<HtmlProcessingData>(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher
  51. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); } Tamir Dresher
  52. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); } Filter empty messages Tamir Dresher
  53. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); } Tamir Dresher
  54. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); translateHtmlBlock.LinkTo(replaceToLocalLinksBlock, linkOptions); } Tamir Dresher
  55. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); translateHtmlBlock.LinkTo(replaceToLocalLinksBlock, linkOptions); replaceToLocalLinksBlock.LinkTo(saveHtmlBlock, linkOptions); } Tamir Dresher
  56. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); translateHtmlBlock.LinkTo(replaceToLocalLinksBlock, linkOptions); replaceToLocalLinksBlock.LinkTo(saveHtmlBlock, linkOptions); } Tamir Dresher
  57. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline ... // Seed the pipeline fetchHtmlBlock.Post(new UrlProcessingData(context.StartUrl, context, depth:0, false)); } Tamir Dresher
  58. TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int

    maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline ... // Seed the pipeline fetchHtmlBlock.Post(new UrlProcessingData(context.StartUrl, context, depth:0, false)); // completion handling await context.Completion; fetchHtmlBlock.Complete(); await Task.WhenAll(fetchHtmlBlock.Completion, downloadImageBlock.Completion, saveHtmlBlock.Completion ); } Tamir Dresher
  59. TPL Dataflow approach - Summary  Explicit way to define

    and reason about the network  Modular - rich set blocks for repeating concerns  Debuggability - InputCount, OutputCount, Completion monitor for errors  Provide optimizations and edge-cases handling  Issues  Learning curve  Good for DAG but when adding loops and forks, coordination is not out-of-the-box Tamir Dresher
  60. Final Experiment –Task continuation tree approach Asynchronisity Abstraction Primitives Thread

    ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher
  61. Download Html page1 Translate Save page2 page3 Image1_1 Image1_n Download

    Html Translate Save page4 Download Html Translate Save page6 Download Html Translate Save page5 Download Html Translate Save page7 Tamir Dresher
  62. Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth

    ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List<Task> { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher
  63. Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth

    ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List<Task> { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } No await Tamir Dresher
  64. Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth

    ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List<Task> { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher
  65. Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth

    ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List<Task> { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher
  66. Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth

    ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List<Task> { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher
  67. Benchmark  BenchmarkDotNet – 3 iterations, 2 sites, 7 levels

    depth, dummy translator (100ms delay) InfiniteWebSite https://books.toscrape.com/  3 links, 7 levels =>  3280 pages  3 * 3280 = 9840 images  1078 pages  1956 images Tamir Dresher
  68. Summary  Concurrency is hard  Don’t invent the wheel

     Always test and measure – you might find surprising results  https://github.com/tamirdresher/AsyncProcessingInDotNet-Webscraper Tamir Dresher Tamir Dresher @tamir_dresher @tamirdresher.bsky.social Questions?