Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DIY: Homemade Thread Pool

DIY: Homemade Thread Pool

Talk at CPPTO Meetup (C++ Toronto) in February, 2020 about step-by-step custom thread pool creation.

Avatar for Denis Kormalev

Denis Kormalev

February 25, 2020
Tweet

More Decks by Denis Kormalev

Other Decks in Programming

Transcript

  1. Existing options • Boost.Asio • Good performance, but boost •

    Qt • Big dependency, subpar performance • Folly • Tons of features, but a huge dependency • Intel TBB • Good performance, but API is a bit clunky (more oriented on computations, not on scheduling), a lot of boilerplate in user code • threadpoolcpp (by inkooboo) • Awesome performance, not a lot of tests, complete lack of features • Tons of other smaller implementations
  2. What if we need futures as well? • std::future •

    Part of std, but almost no features • boost::future • Has then (no more sugar, though), but requires boost • QFuture • Almost unusable outside QtConcurrent • Folly • Good API, lots of features, but requires Folly • Tons of other smaller implementations
  3. What if we need thread pool and futures? • Boost.Asio

    • Folly • Few other smaller implementations
  4. Futures • Pure C++17 • Error handling in ADT style

    (biased Either/expected way) • Continuations • T -> U, T -> Future<U> • Failure -> T, Failure -> Future<T>, Failure1 -> Failure2 • … • Sugar • Future<T>, Future<U> -> Future<tuple<T, U>> • Container<Future<T>> -> Future<Container<T>> • …
  5. Task scheduling • Pure C++17 • Concise API • Avoid

    extra user-space boilerplate • Use futures • Subpools • Priorities • Sugar • Single task – multiple data • …
  6. Talk is about • Thread pool and task scheduling core

    • Step-by-step enhancements • Benchmarks • Subtle optimizations leading to huge improvements • Extra helpers for easier usage
  7. Talk is NOT about • Futures implementation • “parallel_for” helpers

    • Trampolining helpers for too deep Future continuations • Noexcept and access specifications • Extra bookkeeping, getters and setters • Extra optimizations based on heuristics and assumptions • All this can be checked out in library source code though • Also it is not a talk about “fastest ever” task scheduler, sorry folks
  8. First naïve try • Queue with tasks • Workers vector

    • Maintenance thread that constantly tries to schedule
  9. No maintenance thread, please • Each task insertion runs schedule()

    • When task is finished – worker invokes schedule() as well
  10. Subpools • Two main types of tasks: • CPU-intensive •

    IO-related • We want to allow user to specify their own subpools with custom limits • We want to allow user to bind subpool to one thread for resource- related operations
  11. No rest for the wicked • What may happen if

    task in queue waits for subpool capacity? • Worker invokes taskFinished() • New task is sent to another worker • First worker goes to sleep, second worker wakes • Can we eliminate these extra sleeps? • Yes!
  12. Thread bound tasks • Scheduling is the same as for

    regular ones? Or… • After first schedule for the subpool we always know exact thread • No need to put them in main queue • We can schedule them directly from insertTaskInfo() • Is it unfair? • Yes – they possibly will be scheduled before others • And no – thread bound tasks can’t be scheduled to another thread anyway • We need worker-specific queues for it
  13. Priorities • Not all tasks are created equal • Task

    priority is just another uint8_t • How can we schedule it though? • We need to read our list from the left, but write to it in multiple points, not only adding to the end
  14. Are mutexes good enough? • Mutex is fine and easy

    to use, but too heavy • Simple spin lock based on std::atomic_flag for the rescue • Replacing lock in Worker • High concurrency – roughly the same • Low concurrency – 10-20% less overload • Replacing mainLock in TaskDispatcher • High concurrency – 40-70% less overhead • Low concurrency – 10-20% less overhead
  15. Wait, what about destruction? • run() loops are infinite •

    We need to stop them and destroy workers somehow • No terminate() method for std::thread • except calling ~thread(), which requires terminate_handler • We need to stop them manually
  16. Benchmarks • Timed repost (100’000 jobs per each thread, ~0.1ms

    payload) • Close to “real life” – tasks with some payload that start more tasks at the end of their life • Empty repost (1’000’000 jobs per each thread) • Similar to timed repost, but no payload – brutal on synchronization points • Timed avalanche (100’000 jobs, ~0.1ms payload) • One thread adds tons of tasks with some payload. Also close to “real life” because of the payload. • Empty avalanche (100’000 jobs) • One thread adds tons of empty tasks. More as a concurrency check than “real life”
  17. Library Empty avalanche Empty repost x1 Empty repost x2 Empty

    repost x4 Empty repost x8 Asynqro Intensive 209 4’574 4’923 8’749 16’285 Asynqro ThreadBound 226 205 374 1’046 2’694 Boost.Asio 319 1’493 1’890 1’875 2’167 Intel TBB 26 309 526 716 1’062 Intel TBB (spawn) -- 110 138 148 262 QtConcurrent 1’339 8’234 26’872 48’353 59’112 threadpoolcpp 5 33 33 35 56 Library Timed avalanche Timed repost x1 Timed repost x2 Timed repost x4 Timed repost x8 Asynqro Intensive 99 445 477 953 204 Asynqro ThreadBound 13 34 41 44 106 Boost.Asio 9 179 195 216 41 Intel TBB 185 168 123 106 1’494 Intel TBB (spawn) -- 159 101 66 10’190 QtConcurrent 102 327 346 393 272 threadpoolcpp 8 10 11 12 23
  18. Optimizations based on benchmarks data • std::unordered_set for available workers

    -> mallocs everywhere • std::bitset for the rescue! • Waiting for conditional variable is too time consuming • Let’s idle for a bit before sleeping
  19. Library Empty avalanche Empty repost x1 Empty repost x2 Empty

    repost x4 Empty repost x8 Asynqro Intensive Old 209 4’574 4’923 8’749 16’285 Asynqro Intensive No Idle 199 3’902 4’310 8’734 10’074 Asynqro Intensive 46 600 778 2’285 12’763 Asynqro ThreadBound 27 201 403 1’133 2’616 Boost.Asio 319 1’493 1’890 1’875 2’167 Intel TBB 26 309 526 716 1’062 QtConcurrent 1’339 8’234 26’872 48’353 59’112 Library Timed avalanche Timed repost x1 Timed repost x2 Timed repost x4 Timed repost x8 Asynqro Intensive Old 99 445 477 953 204 Asynqro Intensive No Idle 84 393 413 914 122 Asynqro Intensive 55 237 231 190 110 Asynqro ThreadBound 8 27 40 37 78 Boost.Asio 9 179 195 216 41 Intel TBB 185 168 123 106 1’494 QtConcurrent 102 327 346 393 272
  20. What can we return from task? • Nothing • No

    meaningful data in Future, so Future<bool> (asynqro implementation doesn’t allow Future<void>) • Value of type T • Future<T> • Future<T> • Future<T> and it should be a continuation of Future from task
  21. What about other failure types? • Extra type parameter to

    run() specifying the Failure type • Pros – easy to implement; straightforward solution • Cons – not scalable for more features • Wrapper structure • Cons – not a straightforward solution • Pros – gives us an inversion of control; easy to add more customization points • Let’s go with wrapper
  22. Failure type of Future from task is still not that

    generic though A.k.a. more customization points we mentioned previously
  23. Casting failures • We have a TaskRunner with my_awesome_app::Failure as

    PlainFailure • We have some functions returning Future<T, std::exception_ptr> that we want to run asynchronously • Solutions: • Implicit constructor for my_awesome_app::Failure class • Cast failure in every task using mapFailure() • Add failure cast ability to RunnerInfo