Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Coding for Multiple Cores

Hanxue Lee
November 28, 2014

Coding for Multiple Cores

by Bruce Dawson

Hanxue Lee

November 28, 2014
Tweet

More Decks by Hanxue Lee

Other Decks in Programming

Transcript

  1. Why multi-threading/multi-core? • Clock rates are stagnant • Future CPUs

    will be predominantly multi- thread/multi-core • Xbox 360 has 3 cores • PS3 will be multi-core • >70% of PC sales will be multi-core by end of 2006 • Most Windows Vista systems will be multi-core • Two performance possibilities: • Single-threaded? Minimal performance growth • Multi-threaded? Exponential performance growth
  2. Design for Multithreading • Good design is critical • Bad

    multithreading can be worse than no multithreading • Deadlocks, synchronization bugs, poor performance, etc.
  3. Rendering Thread Rendering Thread Rendering Thread Game Thread Good Multithreading

    Main Thread Physics Rendering Thread Animation/ Skinning Particle Systems Networking File I/O Game Thread
  4. Another Paradigm: Cascades Thread 1 Thread 2 Thread 3 Thread

    4 Thread 5 Input Physics AI Rendering Present Frame 1 Frame 2 Frame 3 Frame 4 • Advantages: • Synchronization points are few and well-defined • Disadvantages: • Increases latency (for constant frame rate) • Needs simple (one-way) data flow
  5. File Decompression • Most common CPU heavy thread on the

    Xbox 360 • Easy to multithread • Allows use of aggressive compression to improve load times • Don’t throw a thread at a problem better solved by offline processing • Texture compression, file packing, etc.
  6. Rendering • Separate update and render threads • Rendering on

    multiple threads (D3DCREATE_MULTITHREADED) works poorly • Exception: Xbox 360 command buffers • Special case of cascades paradigm • Pass render state from update to render • With constant workload gives same latency, better frame rate • With increased workload gives same frame rate, worse latency
  7. Graphics Fluff • Extra graphics that doesn't affect play •

    Procedurally generated animating cloud textures • Cloth simulations • Dynamic ambient occlusion • Procedurally generated vegetation, etc. • Extra particles, better particle physics, etc. • Easy to synchronize • Potentially expensive, but if the core is otherwise idle...?
  8. Physics? • Could cascade from update to physics to rendering

    • Makes use of three threads • May be too much latency • Could run physics on many threads • Uses many threads while doing physics • May leave threads mostly idle elsewhere
  9. How Many Threads? • No more than one CPU intensive

    software thread per core • 3-6 on Xbox 360 • 1-? on PC (1-4 for now, need to query) • Too many busy threads adds complexity, and lowers performance • Context switches are not free • Can have many non-CPU intensive threads • I/O threads that block, or intermittent tasks
  10. Simultaneous Multi-Threading • Be careful with Simultaneous Multi- Threading (SMT)

    threads • Not the same as double the number of cores • Can give a small perf boost • Can cause a perf drop • Can avoid scheduler latency • Ideally one heavy thread per core plus some additional intermittent threads
  11. Case Study: Kameo (Xbox 360) • Started single threaded •

    Rendering was taking half of time—put on separate thread • Two render-description buffers created to communicate from update to render • Linear read/write access for best cache usage • Doesn't copy const data • File I/O and decompress on other threads
  12. Case Study: Kameo (Xbox 360) Core Thread Software threads 0

    0 Game update 1 File I/O 1 0 Rendering 1 2 0 XAudio 1 File decompression • Total usage was ~2.2-2.5 cores
  13. Case Study: Project Gotham Racing Core Thread Software threads 0

    0 Update, physics, rendering, UI 1 Audio update, networking 1 0 Crowd update, texture decompression 1 Texture decompression 2 0 XAudio 1 • Total usage was ~2.0-3.0 cores
  14. Managing Your Threads • Creating threads • Synchronizing • Terminating

    • Don't use TerminateThread() • Bad idea on Windows: leaves the process in an indeterminate state, doesn't allow clean-up, etc. • Unavailable on Xbox 360 • Instead return from your thread function, or call ExitThread
  15. Creating Threads Poorly const int stackSize = 0; HANDLE hThread

    = CreateThread(0, stackSize, ThreadFunctionBad, 0, 0, 0); // Do work on main thread here. for (;;) { // Wait for child thread to complete DWORD exitCode; GetExitCodeThread(hThread, &exitCode); if (exitCode != STILL_ACTIVE) break; } ... DWORD __stdcall ThreadFunctionBad(void* data) { #ifdef WIN32 SetThreadAffinityMask(GetCurrentThread(), 8); #endif // Do child thread work here. return 0; } CreateThread doesn't initialize C runtime Stack size of zero means inherit parent's stack size Busy waiting is bad! Don't forget to close this when done with it Be careful with thread affinities on Windows
  16. Creating Threads Well const int stackSize = 65536; HANDLE hThread

    = (HANDLE)_beginthreadex(0, stackSize, ThreadFunction, 0, 0, 0); // Do work on main thread here. // Wait for child thread to complete WaitForSingleObject(hThread, INFINITE); CloseHandle(hThread); ... unsigned __stdcall ThreadFunction(void* data) { #ifdef XBOX // On Xbox 360 you must explicitly assign // software threads to hardware threads. XSetThreadProcessor(GetCurrentThread(), 2); #endif // Do child thread work here. return 0; } _beginthreadex initializes CRT Specify stack size on Xbox 360 The correct way to wait for a thread to exit Don't forget to close this when done with it Thread affinities must be specified on Xbox 360
  17. Alternative: OpenMP • Available in VC++ 2005 • Simple way

    to parallelize loops and some other constructs • Works best on long symmetric tasks— particles? • Game tasks are short—16.6 ms • Many game tasks are not symmetric • OpenMP is nice, but not ideal
  18. Available Synchronization Objects • Events • Semaphores • Mutexes •

    Critical Sections • Don't use SuspendThread() • Some title have used this for synchronization • Can easily lead to deadlocks • Interacts badly with Visual Studio debugger
  19. Exclusive Access: Mutex // Initialize HANDLE mutex = CreateMutex(0, FALSE,

    0); // Use void ManipulateSharedData() { WaitForSingleObject(mutex, INFINITE); // Manipulate stuff... ReleaseMutex(mutex); } // Destroy CloseHandle(mutex);
  20. Exclusive Access: CRITICAL_SECTION // Initialize CRITICAL_SECTION cs; InitializeCriticalSection(&cs); // Use

    void ManipulateSharedData() { EnterCriticalSection(&cs); // Manipulate stuff... LeaveCriticalSection(&cs); } // Destroy DeleteCriticalSection(&cs);
  21. Lockless programming • Trendy technique to use clever programming to

    share resources without locking • Includes InterlockedXXX(), lockless message passing, Double Checked Locking, etc. • Very hard to get right: • Compiler can reorder instructions • CPU can reorder instructions • CPU can reorder reads and writes • Not as fast as avoiding synchronization entirely
  22. Lockless Messages: Buggy void SendMessage(void* input) { // Wait for

    the message to be 'empty'. while (g_msg.filled) ; memcpy(g_msg.data, input, MESSAGESIZE); g_msg.filled = true; } void GetMessage() { // Wait for the message to be 'filled'. while (!g_msg.filled) ; memcpy(localMsg.data, g_msg.data, MESSAGESIZE); g_msg.filled = false; }
  23. Synchronization tips/costs: • Synchronization is moderately expensive when there is

    no contention • Hundreds to thousands of cycles • Synchronization can be arbitrarily expensive when there is contention! • Goals: • Synchronize rarely • Hold locks briefly • Minimize shared data
  24. Beware hidden synchronization: • Allocations are (generally) a synch point

    • Consider per-thread heaps with no locking • HEAP_NO_SERIALIZE flag avoids lock on Win32 heaps • Consider custom single-purpose allocators • Consider avoiding memory allocations! • Avoid synch in in-house profilers • D3DCREATE_MULTITHREADED causes synchronization on almost every Direct3D call
  25. Threading File I/O & Decompression • First: use large reads

    and asynchronous I/O • Then: consider compression to accelerate loading • Don't do format conversions etc. that are better done at build time! • Have resource proxies to allow rendering to continue
  26. File I/O Implementation Details • vector<Resource*> g_resources; • Worst design:

    decompressor locks g_resources while decompressing • Better design: decompressor adds resources to vector after decompressing • Still requires renderer to synch on every resource access • Best design: two Resource* vectors • Renderer has private vector, no locking required • Decompressor use shared vector, syncs when adding new Resource* • Renderer moves Resource* from shared to private vector once per frame
  27. Profiling multi-threaded apps • Need thread-aware profilers • Profiling may

    hide many synchronization stalls • Home-grown spin locks make profiling harder • Consider instrumenting calls to synchronization functions • Don't use locks in instrumentation—use TLS variables to store results • Windows: Intel VTune, AMD CodeAnalyst, and the Visual Studio Team System Profiler • Xbox 360: PIX, XbPerfView, etc.
  28. Naming Threads typedef struct tagTHREADNAME_INFO { DWORD dwType; // must

    be 0x1000 LPCSTR szName; // pointer to name (in user addr space) DWORD dwThreadID; // thread ID (-1=caller thread) DWORD dwFlags; // reserved for future use, must be zero } THREADNAME_INFO; void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) { THREADNAME_INFO info; info.dwType = 0x1000; info.szName = szThreadName; info.dwThreadID = dwThreadID; info.dwFlags = 0; __try { RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD), (DWORD*) &info ); } __except(EXCEPTION_CONTINUE_EXECUTION) { } } SetThreadName(-1, "Main thread");
  29. Other Ideas • Debugging tips for MT • Visual Studio

    does support multi-threaded debugging • Use threads window • Use @hwthread in watch window on Xbox 360 • KD and WinDBG support multi-threaded debugging • Thread Local Storage (TLS) • __declspec(thread) declares per-thread variables • But doesn't work in dynamically loaded DLLs • TLSAlloc is less efficient, less convenient, but works in dynamically loaded DLLs
  30. Windows tips • Avoid using D3DCREATE_MULTITHREADED • It’s easy, it

    works, it’s really really slow • Best to do all calls to Direct3D from a single thread • Could pass off locked resource pointers to a queue for a loading threads to work with • Test on multiple machines and configurations • Single-core, SMT (i.e. Hyper-Threading), Dual- core, Intel and AMD chips, Multi-socket multicore (4+ cores)
  31. Windows API features • WaitForMultipleObject • Obviously better than a

    series of WaitForSingleObject calls • The OS is highly optimized around multithreading and event-based blocking • I/O Completion Ports • Very efficient way to have the OS assign a pool of worker threads to incoming I/O requests • Useful construct for implementing a game server
  32. SMT versus Multicore • OS returns number of logical processors

    in GetSystemInfo(), so a 2 could mean a SMT machine with only 1 actual core –or- 2 cores • Detailed Win32 APIs exposing this distinction not available until Windows XP x64, Windows Server 2003 SP1, Windows Vista, etc. • GetLogicalProcessorInformation() • For now you have to use CPUID detailed by Intel and AMD to parse this out…
  33. Timing with Multiple Cores • RDTSC is not always synced

    between cores! • As your thread moves from core to core, results of RDTSC counter deltas may be nonsense • CPU frequency itself can change at run-time through speed step technologies • See Power Management APIs for more information • Best thing to do is use Win32 API QueryPerformanceCounter / QueryPerformanceFrequency • See DirectX SDK article Game Timing and Multiple Cores
  34. Thread Micromanagement • Use SetThreadAffinityMask with caution! • May be

    useful for assigning ‘heavy’ work threads • This mask is technically a hint, not a commitment • RDTSC-based instrumenting will require locking the game threads to a single core • Otherwise let the Windows scheduler do the right thing • CreateDevice/Reset might have a side-effect on the calling thread’s affinity with software vertex processing enabled
  35. Thread Micromanagement (cont) • Be careful about boosting thread priority

    • If the priority is too high, you could cause the system to hang and become unresponsive • If the priority is too low, the thread may starve
  36. DLLs and Multithreading • DllMain for every DLL is informed

    of thread creation/destruction • For some DLLs this is required to initialize TLS • For many this is a waste of time, so call DisableThreadLibraryCalls() from your DllMain during process creation (DLL_PROCESS_ATTACH) • The OS serializes access to the entry point • This means threads created during DllMain won’ t start for a while, so don’t wait on them in the DLL startup
  37. Resources • Multithreading Applications in Win32, Jim Beveridge & Robert

    Weiner, Addison-Wesley, 1997 • Multiprocessor Considerations for Kernel-Mode Drivers • http://download.microsoft.com/download/e/b/a/eba1050f-a31d- 436b-9281-92cdfeae4b45/MP_issues.doc • Determining Logical Processors per Physical Processor • http://www.intel.com/cd/ids/developer/asmo- na/eng/dc/threading/knowledgebase/43842.htm • GetLogicalProcessorInformation • http://msdn.microsoft.com/library/default.asp?url=/library/en- us/dllproc/base/getlogicalprocessorinformation.asp • Double checked locking • http://en.wikipedia.org/wiki/Double-checked_locking
  38. Resources • GDC 2006 Presentations • http://msdn.com/directx/presentations • DirectX Developer

    Center • http://msdn.com/directx • XNA Developer Center • http://msdn.com/xna • Xbox Developer Center (Registered Devs Only) • https://xds.xbox.com • XNA, DirectX, XACT Forums • http://msdn.com/directx/forums • Email addresses • [email protected] (DirectX Feedback) • [email protected] (Xbox Developers Only) • [email protected] (XNA Feedback)
  39. © 2006 Microsoft Corporation. All rights reserved. Microsoft, DirectX, Xbox

    360, the Xbox logo, and XNA are either registered trademarks or trademarks of Microsoft Corporation in the United Sates and / or other countries. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.