Slide 1

Slide 1 text

The Asynchronous Processing World of .NET Race Condition Who’s there? C# C# Tamir Dresher twitter: @tamir_dresher Bluesky: @tamirdresher.bsky.social

Slide 2

Slide 2 text

Lock PLINQ AsyncLocal Tamir Dresher

Slide 3

Slide 3 text

Head of Architecture @ Who am I? Software Engineering Lecturer Ruppin Academic Center & My Books: Ex-Microsoft MVP

Slide 4

Slide 4 text

Sync | Async | Concurrency | Parallelism Synchronous Asynchronous Concurrent Parallel Tamir Dresher

Slide 5

Slide 5 text

The .NET concurrency landscape Asynchronisity Abstraction Primitives Thread ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher

Slide 6

Slide 6 text

Our use case – translating webscraper Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Tamir Dresher

Slide 7

Slide 7 text

Experiment 1 - Naïve async approach Asynchronisity Abstraction Primitives Thread ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher

Slide 8

Slide 8 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true)) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 9

Slide 9 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true)) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 10

Slide 10 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true)) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 11

Slide 11 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 12

Slide 12 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 13

Slide 13 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 14

Slide 14 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 15

Slide 15 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 16

Slide 16 text

Naïve async approach public override async Task Scrape(string startUrl, int maxDepth, string translateTo) { Queue<(int depth, string url)> urlQueue = new(); urlQueue.Enqueue((depth: 0, startUrl)); while (urlQueue.TryDequeue(out var qItem)) { var (depth, url) = qItem; if (!_visitedUrls.TryAdd(url, true) || depth > maxDepth) continue; string html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html); foreach (var link in links) { urlQueue.Enqueue((depth + 1, link); } var imageLinks = RetrieveImageLinks(html); await DownloadImagesAsync(uri, imageLinks); var translated = await TranslateHtmlAsync(html, translateToLanguage); var localizedLinksHtml = ReplaceToLocalLinks(uri, translated); await SaveHtmlAsync(uri, localizedLinksHtml); } } Tamir Dresher

Slide 17

Slide 17 text

Naïve async approach - Summary  Very easy to understand and reason about  Issues:  Low throughput – number of pages we process in a time period  High latency – the time it takes to finish a single page processing  Low utilization Tamir Dresher

Slide 18

Slide 18 text

Adding concurrency Moving to producer-consumer pattern Tamir Dresher

Slide 19

Slide 19 text

Our use case – translating web scraper Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Tamir Dresher

Slide 20

Slide 20 text

Our use case – translating web scraper Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Translate Save Html Download Images Tamir Dresher

Slide 21

Slide 21 text

Experiment 2 - BlockingCollection approach Asynchronisity Abstraction Primitives Thread ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher

Slide 22

Slide 22 text

BlockingCollection  A thread-safe wrapper around IProducerConsumerCollection with built-in blocking and bounding  var collection = new BlockingCollection( IProducerConsumerCollection collection, int boundedCapacity); var collection = new BlockingCollection(new ConcurrentBag(), boundedCapacity: 5 ) collection.Add(100); Collection.CompleteAdding(); foreach (var num in collection.GetConsumingEnumerable()) { //Do something } Producer Consumer Tamir Dresher

Slide 23

Slide 23 text

BlockingCollection approach BlockingCollection<(string url, string basePath, int depth, int maxDepth, string translateTo)> _htmlJobs = new(); BlockingCollection<(string imageUrl, string basePath, Uri uri)> _imageJobs = new(); BlockingCollection<(string htmlContent, string basePath, Uri uri, string language)> _translationJobs = new(); Tamir Dresher

Slide 24

Slide 24 text

BlockingCollection approach // Extract images and queue them for download var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { _imageJobs.Add((imageUrl, basePath, uri)); } BlockingCollection<(string url, string basePath, int depth, int maxDepth, string translateTo)> _htmlJobs = new(); BlockingCollection<(string imageUrl, string basePath, Uri uri)> _imageJobs = new(); BlockingCollection<(string htmlContent, string basePath, Uri uri, string language)> _translationJobs = new(); Tamir Dresher

Slide 25

Slide 25 text

BlockingCollection approach // Extract images and queue them for download var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { _imageJobs.Add((imageUrl, basePath, uri)); } private async Task ProcessImageJobs() { foreach (var (imageUrl, basePath, uri) in _imageJobs.GetConsumingEnumerable()) { if (!_visitedUrls.TryAdd(imageUrl, true)) continue; //already processed this image await DownloadImageAsync(basePath, uri, imageUrl); } } BlockingCollection<(string url, string basePath, int depth, int maxDepth, string translateTo)> _htmlJobs = new(); BlockingCollection<(string imageUrl, string basePath, Uri uri)> _imageJobs = new(); BlockingCollection<(string htmlContent, string basePath, Uri uri, string language)> _translationJobs = new(); Tamir Dresher

Slide 26

Slide 26 text

BlockingCollection approach - workers public void StartWorkers(int htmlWorkerCount = 20, int imageWorkerCount = 20, int translationWorkerCount = 20) { // Start HTML scraping workers for (int i = 0; i < htmlWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessHtmlJobs())); } // Start image download workers for (int i = 0; i < imageWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessImageJobs())); } // Start translation workers for (int i = 0; i < translationWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessTranslationJobs())); } } Tamir Dresher

Slide 27

Slide 27 text

BlockingCollection approach - Summary  BlockingCollection is… well, blocking. It pauses producers when the collection is full and consumers when it's empty  Threads waiting in a BlockingCollection consume system resources while blocked and reduce the system performance  No fine-grained control on buffering or backpressure handling  No support for async/await  Coordination between the various activities is the developer responsibility  Workers setup is the developer responsibility (+degree of parallelism) Tamir Dresher

Slide 28

Slide 28 text

 Problem: we have a cycle in our design so there’s no intuitive way to set completion on the feeding of BlockingCollection  Solution: add a counter of remaining jobs and mark completion when it reach zero The coordination problem Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Translate Save Html Download Images Tamir Dresher

Slide 29

Slide 29 text

The coordination problem Tamir Dresher private int _htmlJobsCount = 1; private async Task ProcessHtmlJobs() { foreach (var (url, ..) in _htmlJobs.GetConsumingEnumerable()) { try { ... var links = RetrieveLinksFromHtml(html); Interlocked.Add(ref _htmlJobsCount, links.Count); foreach (var link in links) { /* add to queue */ } ... } catch (Exception ex) { ... } finally { var remainingJobsCount = Interlocked.Decrement(ref _htmlJobsCount); Debug.Assert(remainingJobsCount >= 0); if (remainingJobsCount == 0) { _htmlJobs.CompleteAdding(); } } } }

Slide 30

Slide 30 text

BlockingCollection approach - Summary  BlockingCollection is… well, blocking. It pauses producers when the collection is full and consumers when it's empty  Threads waiting in a BlockingCollection consume system resources while blocked and reduce the system performance  No fine-grained control on buffering or backpressure handling  No support for async/await  Coordination between the various activities is the developer responsibility  Workers setup is the developer responsibility (+degree of parallelism) Tamir Dresher

Slide 31

Slide 31 text

BlockingCollection approach - Summary  BlockingCollection is… well, blocking. It pauses producers when the collection is full and consumers when it's empty  Threads waiting in a BlockingCollection consume system resources while blocked and reduce the system performance  No fine-grained control on buffering or backpressure handling  No support for async/await  Coordination between the various activities is the developer responsibility  Workers setup is the developer responsibility (+degree of parallelism) Tamir Dresher

Slide 32

Slide 32 text

Experiment 3 – System.Threading.Channels approach Asynchronisity Abstraction Primitives Thread ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) Channels BlockingCollection Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher

Slide 33

Slide 33 text

System.Threading.Channels  Channels are high-performance, asynchronous primitives with support for bounded or unbounded producer-consumer scenarios  Designed for non-blocking workflows using async/await  Handles backpressure effectively for high-throughput scenarios. var channel = Channel.CreateUnbounded(); Tamir Dresher

Slide 34

Slide 34 text

System.Threading.Channels  Channels are high-performance, asynchronous primitives with support for bounded or unbounded producer-consumer scenarios  Designed for non-blocking workflows using async/await  Handles backpressure effectively for high-throughput scenarios. var channel = Channel.CreateUnbounded(); for (int i = 1; i <= 10; i++) { await channel.Writer.WriteAsync(i); } channel.Writer.Complete(); var channel = Channel.CreateBounded( new BoundedChannelOptions(3) { FullMode = BoundedChannelFullMode.DropOldest }); await foreach (var num in channel.Reader.ReadAllAsync()) { // Do something } Producer Consumer Tamir Dresher

Slide 35

Slide 35 text

System.Threading.Channels approach Channel _htmlChannel = Channel.CreateUnbounded(); Channel _imageChannel = Channel.CreateUnbounded(); Channel _translationChannel = Channel.CreateUnbounded(); public void StartWorkers(int htmlWorkerCount = 20, int imageWorkerCount = 20, int translationWorkerCount = 20) { // Start HTML scraping workers for (int i = 0; i < htmlWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessHtmlJobs())); } // Start image download workers for (int i = 0; i < imageWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessImageJobs())); } // Start translation workers for (int i = 0; i < translationWorkerCount; i++) { _workerTasks.Add(Task.Run(() => ProcessTranslationJobs())); } } Tamir Dresher

Slide 36

Slide 36 text

System.Threading.Channels approach private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } Tamir Dresher

Slide 37

Slide 37 text

private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } System.Threading.Channels approach Tamir Dresher

Slide 38

Slide 38 text

private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } System.Threading.Channels approach Tamir Dresher

Slide 39

Slide 39 text

private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } System.Threading.Channels approach Tamir Dresher

Slide 40

Slide 40 text

System.Threading.Channels approach private async Task ProcessHtmlJobs() { await foreach (var htmlScrapeJob in _htmlChannel.Reader.ReadAllAsync()) { try { var uri = new Uri(htmlScrapeJob.url); var html = await DownloadHtmlAsync(uri); var links = RetrieveLinksFromHtml(html) foreach (var link in links) { await _htmlChannel.Writer.WriteAsync(new HtmlScrapeJob(…)); } var imageLinks = RetrieveImageLinks(html); foreach (var imageUrl in imageLinks) { await _imageChannel.Writer.WriteAsync(new ImageDownloadJob(…)); } await _translationChannel.Writer.WriteAsync(new HtmlTranslationJob(…)); } catch (Exception ex){…} } } Tamir Dresher

Slide 41

Slide 41 text

Channels approach - Summary  While System.Threading.Channels is a powerful and modern it feels very low-level, developer need to with  Complex coordination  Limited Debugging and Observability  Prioritization  Dynamic resizing of capacity or parallelism level  Cancellation and error propogation Tamir Dresher

Slide 42

Slide 42 text

Experiment 4 - TPL Dataflow approach Asynchronisity Abstraction Primitives Thread ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher

Slide 43

Slide 43 text

TPL Dataflow  A library for building data processing pipelines using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend Tamir Dresher

Slide 44

Slide 44 text

TPL Dataflow  A library for building data processing pipelines using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock(x => x * 2); var adder = new TransformManyBlock(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher

Slide 45

Slide 45 text

TPL Dataflow  A library for building data processing pipelines using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock(x => x * 2); var adder = new TransformManyBlock(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher

Slide 46

Slide 46 text

TPL Dataflow  A library for building data processing pipelines using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock(x => x * 2); var adder = new TransformManyBlock(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher

Slide 47

Slide 47 text

TPL Dataflow  A library for building data processing pipelines using asynchronous, thread-safe blocks in a modular way  Predefined building blocks like BufferBlock, TransformBlock, and ActionBlock  Built-in Backpressure - automatically manages flow control to prevent overwhelming consumers  The workflow creation is self-explanatory, making it easier to understand, debug, and extend var multiplier = new TransformBlock(x => x * 2); var adder = new TransformManyBlock(x => [x + 1, x + 2, x + 3]); var printer = new ActionBlock(x => Console.WriteLine($"Processed: {x}")); // Link blocks multiplier.LinkTo(adder); adder.LinkTo(printer); // Post data multiplier.Post(5); multiplier.Post(10); // Mark completion multiplier.Complete(); await printer.Completion; Tamir Dresher

Slide 48

Slide 48 text

Our use case – translating web scraper Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Tamir Dresher

Slide 49

Slide 49 text

Our use case – translating web scraper Download Html Retrieve Links Retrieve Image Links Replace To Local Links URLs Start URL Download Images Translate Save Html Broadcaster TransformBlock TransformManyBlock TransformManyBlock ActionBlock TransformBlock ActionBlock TransformBlock Tamir Dresher

Slide 50

Slide 50 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher

Slide 51

Slide 51 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20, BoundedCapacity = 100 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher TransformBlock CreateFetchHtmlBlock(ExecutionDataflowBlockOptions options, ScrapeContext context) { return new TransformBlock(async item => { try { if (item.depth <= context.MaxDepth && _processedUrls.TryAdd(item.Url, true)) { var html = await DownloadHtmlAsync(item.Url); return new HtmlProcessingData(html, item.Url, item.depth, context); } } catch (Exception ex) { _logger.LogError(ex, "Error fetching html {url}", item?.Url); } return new HtmlProcessingData("", "", -1, context); }, options); }

Slide 52

Slide 52 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20, BoundedCapacity = 100 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher TransformBlock CreateFetchHtmlBlock(ExecutionDataflowBlockOptions options, ScrapeContext context) { return new TransformBlock(async item => { try { if (item.depth <= context.MaxDepth && _processedUrls.TryAdd(item.Url, true)) { var html = await DownloadHtmlAsync(item.Url); return new HtmlProcessingData(html, item.Url, item.depth, context); } } catch (Exception ex) { _logger.LogError(ex, "Error fetching html {url}", item?.Url); } return new HtmlProcessingData("", "", -1, context); }, options); }

Slide 53

Slide 53 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks var fetchHtmlBlock = CreateFetchHtmlBlock(dataflowOptions, context); var htmlBroadcaster = new BroadcastBlock(x => x); var retrieveHtmlLinksBlock = CreateRetrieveHtmlLinksBlock(dataflowOptions, context); var retrieveHtmlImageLinksBlock = CreateRetrieveHtmlImageLinksBlock(dataflowOptions); var downloadImageBlock = CreateDownloadImageBlock(dataflowOptions); var translateHtmlBlock = CreateTranslateHtmlBlock(dataflowOptions, context); var replaceToLocalLinksBlock = CreateReplaceToLocalLinksBlock(dataflowOptions); var saveHtmlBlock = CreateSaveHtmlBlock(dataflowOptions); } Tamir Dresher

Slide 54

Slide 54 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); } Tamir Dresher

Slide 55

Slide 55 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); } Filter empty messages Tamir Dresher

Slide 56

Slide 56 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); } Tamir Dresher

Slide 57

Slide 57 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); translateHtmlBlock.LinkTo(replaceToLocalLinksBlock, linkOptions); } Tamir Dresher

Slide 58

Slide 58 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); translateHtmlBlock.LinkTo(replaceToLocalLinksBlock, linkOptions); replaceToLocalLinksBlock.LinkTo(saveHtmlBlock, linkOptions); } Tamir Dresher

Slide 59

Slide 59 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline DataflowLinkOptions linkOptions = new DataflowLinkOptions { PropagateCompletion = true }; fetchHtmlBlock.LinkTo(htmlBroadcaster, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlLinksBlock, linkOptions); htmlBroadcaster.LinkTo(retrieveHtmlImageLinksBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); retrieveHtmlImageLinksBlock.LinkTo(downloadImageBlock, linkOptions); retrieveHtmlLinksBlock.LinkTo(fetchHtmlBlock, linkOptions); htmlBroadcaster.LinkTo(translateHtmlBlock, linkOptions, x => !string.IsNullOrEmpty(x.Html)); translateHtmlBlock.LinkTo(replaceToLocalLinksBlock, linkOptions); replaceToLocalLinksBlock.LinkTo(saveHtmlBlock, linkOptions); } Tamir Dresher

Slide 60

Slide 60 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline ... // Seed the pipeline fetchHtmlBlock.Post(new UrlProcessingData(context.StartUrl, context, depth:0, false)); } Tamir Dresher

Slide 61

Slide 61 text

TPL Dataflow approach async Task ScrapeAsync(string startUrl, string basePath, int maxDepth, string translateTo, bool stayInDomain) { var context = new ScrapeContext(startUrl, …); var dataflowOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 }; // Set up dataflow blocks ... // Link blocks in the pipeline ... // Seed the pipeline fetchHtmlBlock.Post(new UrlProcessingData(context.StartUrl, context, depth:0, false)); // completion handling await context.Completion; fetchHtmlBlock.Complete(); await Task.WhenAll(fetchHtmlBlock.Completion, downloadImageBlock.Completion, saveHtmlBlock.Completion ); } Tamir Dresher

Slide 62

Slide 62 text

TPL Dataflow approach - Summary  Explicit way to define and reason about the network  Modular - rich set blocks for repeating concerns  Debuggability - InputCount, OutputCount, Completion monitor for errors  Provide optimizations and edge-cases handling  Issues  Learning curve  Good for DAG but when adding loops and forks, coordination is not out-of-the-box Tamir Dresher

Slide 63

Slide 63 text

Final Experiment –Task continuation tree approach Asynchronisity Abstraction Primitives Thread ThreadPool Locking (i.e. Lock, Mutex) Communication & Cooperation Concurrent Collections (Queue, Bag, etc) BlockingCollection Channels Cancellation* Parallelism Abstraction Parallel PLINQ Task/ValueTask IAsyncEnumerable Application Concurrency Flow/Networks/Pipelines TPL Dataflow Reactive Extensions TaskScheduler Actors (Akka, Orleans, Dapr) Signaling (i.e. ManualResetEvent) AsyncLocal IObservable/IObserver Partitioner Tamir Dresher

Slide 64

Slide 64 text

Download Html page1 Translate Save page2 page3 Image1_1 Image1_n Download Html Translate Save page4 Download Html Translate Save page6 Download Html Translate Save page5 Download Html Translate Save page7 Tamir Dresher

Slide 65

Slide 65 text

Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher

Slide 66

Slide 66 text

Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } No await Tamir Dresher

Slide 67

Slide 67 text

Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher

Slide 68

Slide 68 text

Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher

Slide 69

Slide 69 text

Task continuation tree approach async Task ScrapeUrlAsync(string url, int currentDepth ...) { await Task.Run(async () => { var html = await DownloadHtmlAsync(url); var translationTask = TranslateHtmlAsync(html, translateToLanguage); var replaceLinksTask = translationTask .ContinueWith(async translated => { var localizedLinksHtml = ReplaceToLocalLinks(basePath, url, translated.Result); await SaveHtmlAsync(basePath, uri, localizedLinksHtml); }); var scrapingTasks = RetrieveLinksFromHtml(html).Select(link => ScrapeUrlAsync(link!, …)); var tasks = new List { Task.WhenAll(scrapingTasks), DownloadImagesAsync(basePath, uri, RetrieveImageLinks(html)), translationTask, replaceLinksTask }; await Task.WhenAll(tasks); }); } Tamir Dresher

Slide 70

Slide 70 text

Benchmark Results Tamir Dresher

Slide 71

Slide 71 text

Benchmark  BenchmarkDotNet – 3 iterations, 2 sites, 7 levels depth, dummy translator (100ms delay) InfiniteWebSite https://books.toscrape.com/  3 links, 7 levels =>  3280 pages  3 * 3280 = 9840 images  1078 pages  1956 images Tamir Dresher

Slide 72

Slide 72 text

InfiniteWebSite benchmark Tamir Dresher

Slide 73

Slide 73 text

books.toscrape.com benchmark Tamir Dresher

Slide 74

Slide 74 text

Summary  Concurrency is hard  Don’t invent the wheel  Always test and measure – you might find surprising results  https://github.com/tamirdresher/AsyncProcessingInDotNet-Webscraper Tamir Dresher Tamir Dresher @tamir_dresher @tamirdresher.bsky.social Questions?