DIY: Homemade Thread Pool

DIY: Homemade Thread Pool You can repeat it at home!
Dennis Kormalev Toronto, 2020

Existing options • Boost.Asio • Good performance, but boost •
Qt • Big dependency, subpar performance • Folly • Tons of features, but a huge dependency • Intel TBB • Good performance, but API is a bit clunky (more oriented on computations, not on scheduling), a lot of boilerplate in user code • threadpoolcpp (by inkooboo) • Awesome performance, not a lot of tests, complete lack of features • Tons of other smaller implementations

What if we need futures as well? • std::future •
Part of std, but almost no features • boost::future • Has then (no more sugar, though), but requires boost • QFuture • Almost unusable outside QtConcurrent • Folly • Good API, lots of features, but requires Folly • Tons of other smaller implementations

What if we need thread pool and futures? • Boost.Asio
• Folly • Few other smaller implementations

Maybe we need our own solution?

Futures • Pure C++17 • Error handling in ADT style
(biased Either/expected way) • Continuations • T -> U, T -> Future<U> • Failure -> T, Failure -> Future<T>, Failure1 -> Failure2 • … • Sugar • Future<T>, Future<U> -> Future<tuple<T, U>> • Container<Future<T>> -> Future<Container<T>> • …

Task scheduling • Pure C++17 • Concise API • Avoid
extra user-space boilerplate • Use futures • Subpools • Priorities • Sugar • Single task – multiple data • …

Talk is about • Thread pool and task scheduling core
• Step-by-step enhancements • Benchmarks • Subtle optimizations leading to huge improvements • Extra helpers for easier usage

Talk is NOT about • Futures implementation • “parallel_for” helpers
• Trampolining helpers for too deep Future continuations • Noexcept and access specifications • Extra bookkeeping, getters and setters • Extra optimizations based on heuristics and assumptions • All this can be checked out in library source code though • Also it is not a talk about “fastest ever” task scheduler, sorry folks

First naïve try • Queue with tasks • Workers vector
• Maintenance thread that constantly tries to schedule

Do we need it?

No maintenance thread, please • Each task insertion runs schedule()
• When task is finished – worker invokes schedule() as well

This is bad! Let’s keep track of our available workers

Potential lock waiting inside main lock

Subpools • Two main types of tasks: • CPU-intensive •
IO-related • We want to allow user to specify their own subpools with custom limits • We want to allow user to bind subpool to one thread for resource- related operations

No rest for the wicked • What may happen if
task in queue waits for subpool capacity? • Worker invokes taskFinished() • New task is sent to another worker • First worker goes to sleep, second worker wakes • Can we eliminate these extra sleeps? • Yes!

Thread bound tasks • Scheduling is the same as for
regular ones? Or… • After first schedule for the subpool we always know exact thread • No need to put them in main queue • We can schedule them directly from insertTaskInfo() • Is it unfair? • Yes – they possibly will be scheduled before others • And no – thread bound tasks can’t be scheduled to another thread anyway • We need worker-specific queues for it

Priorities • Not all tasks are created equal • Task
priority is just another uint8_t • How can we schedule it though? • We need to read our list from the left, but write to it in multiple points, not only adding to the end

Are mutexes good enough? • Mutex is fine and easy
to use, but too heavy • Simple spin lock based on std::atomic_flag for the rescue • Replacing lock in Worker • High concurrency – roughly the same • Low concurrency – 10-20% less overload • Replacing mainLock in TaskDispatcher • High concurrency – 40-70% less overhead • Low concurrency – 10-20% less overhead

Wait, what about destruction? • run() loops are infinite •
We need to stop them and destroy workers somehow • No terminate() method for std::thread • except calling ~thread(), which requires terminate_handler • We need to stop them manually

Benchmarks!

Benchmarks • Timed repost (100’000 jobs per each thread, ~0.1ms
payload) • Close to “real life” – tasks with some payload that start more tasks at the end of their life • Empty repost (1’000’000 jobs per each thread) • Similar to timed repost, but no payload – brutal on synchronization points • Timed avalanche (100’000 jobs, ~0.1ms payload) • One thread adds tons of tasks with some payload. Also close to “real life” because of the payload. • Empty avalanche (100’000 jobs) • One thread adds tons of empty tasks. More as a concurrency check than “real life”

Library Empty avalanche Empty repost x1 Empty repost x2 Empty
repost x4 Empty repost x8 Asynqro Intensive 209 4’574 4’923 8’749 16’285 Asynqro ThreadBound 226 205 374 1’046 2’694 Boost.Asio 319 1’493 1’890 1’875 2’167 Intel TBB 26 309 526 716 1’062 Intel TBB (spawn) -- 110 138 148 262 QtConcurrent 1’339 8’234 26’872 48’353 59’112 threadpoolcpp 5 33 33 35 56 Library Timed avalanche Timed repost x1 Timed repost x2 Timed repost x4 Timed repost x8 Asynqro Intensive 99 445 477 953 204 Asynqro ThreadBound 13 34 41 44 106 Boost.Asio 9 179 195 216 41 Intel TBB 185 168 123 106 1’494 Intel TBB (spawn) -- 159 101 66 10’190 QtConcurrent 102 327 346 393 272 threadpoolcpp 8 10 11 12 23

Optimizations based on benchmarks data • std::unordered_set for available workers
-> mallocs everywhere • std::bitset for the rescue! • Waiting for conditional variable is too time consuming • Let’s idle for a bit before sleeping

Library Empty avalanche Empty repost x1 Empty repost x2 Empty
repost x4 Empty repost x8 Asynqro Intensive Old 209 4’574 4’923 8’749 16’285 Asynqro Intensive No Idle 199 3’902 4’310 8’734 10’074 Asynqro Intensive 46 600 778 2’285 12’763 Asynqro ThreadBound 27 201 403 1’133 2’616 Boost.Asio 319 1’493 1’890 1’875 2’167 Intel TBB 26 309 526 716 1’062 QtConcurrent 1’339 8’234 26’872 48’353 59’112 Library Timed avalanche Timed repost x1 Timed repost x2 Timed repost x4 Timed repost x8 Asynqro Intensive Old 99 445 477 953 204 Asynqro Intensive No Idle 84 393 413 914 122 Asynqro Intensive 55 237 231 190 110 Asynqro ThreadBound 8 27 40 37 78 Boost.Asio 9 179 195 216 41 Intel TBB 185 168 123 106 1’494 QtConcurrent 102 327 346 393 272

How to add new tasks?

But we wanted futures, didn’t we?

But we want futures with task results, not just some
boolean

What can we return from task? • Nothing • No
meaningful data in Future, so Future<bool> (asynqro implementation doesn’t allow Future<void>) • Value of type T • Future<T> • Future<T> • Future<T> and it should be a continuation of Future from task

What about other failure types? • Extra type parameter to
run() specifying the Failure type • Pros – easy to implement; straightforward solution • Cons – not scalable for more features • Wrapper structure • Cons – not a straightforward solution • Pros – gives us an inversion of control; easy to add more customization points • Let’s go with wrapper

Failure type of Future from task is still not that
generic though A.k.a. more customization points we mentioned previously

Casting failures • We have a TaskRunner with my_awesome_app::Failure as
PlainFailure • We have some functions returning Future<T, std::exception_ptr> that we want to run asynchronously • Solutions: • Implicit constructor for my_awesome_app::Failure class • Cast failure in every task using mapFailure() • Add failure cast ability to RunnerInfo

Questions? https://github.com/dkormalev/asynqro

DIY: Homemade Thread Pool

DIY: Homemade Thread Pool

Denis Kormalev

More Decks by Denis Kormalev

Other Decks in Programming

Featured

Transcript