Copyright (C) 2010, David Beazley, http://www.dabeaz.com In Search of the Perfect Global Interpreter Lock 1 David Beazley http://www.dabeaz.com @dabeaz Presented at RuPy 2011 Poznan, Poland October 15, 2011
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Introduction • As many programmers know, Python and Ruby feature a Global Interpreter Lock (GIL) • More precise: CPython and MRI • It limits thread performance on multicore • Theoretically restricts code to a single CPU 2
Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment • Consider a trivial CPU-bound function def countdown(n): while n > 0: n -= 1 3 • Run it once with a lot of work COUNT = 100000000 # 100 million countdown(COUNT) • Now, divide the work across two threads t1 = Thread(target=count,args=(COUNT//2,)) t2 = Thread(target=count,args=(COUNT//2,)) t1.start(); t2.start() t1.join(); t2.join()
Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment • Some Ruby def countdown(n) while n > 0 n -= 1 end end 4 • Sequential COUNT = 100000000 # 100 million countdown(COUNT) • Subdivided across threads t1 = Thread.new { countdown(COUNT/2) } t2 = Thread.new { countdown(COUNT/2) } t1.join t2.join
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Expectations • Sequential and threaded versions perform the same amount of work (same # calculations) • There is the GIL... so no parallelism • Performance should be about the same 5
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Python 2.7 11 Sequential Threaded (2 threads) : 6.9s : 63.0s (9.1x slower!) • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same) • Why does it get that much slower on Windows?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Experiment: Messaging 12 • A request/reply server for size-prefixed messages Server Client • Each message: a size header + payload • Similar: ZeroMQ
Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment: Messaging 14 • A simple test - message echo (pseudocode) def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1 def server(): while True: msg = recv() send(msg) • To be less evil, it's throttled (<1000 msg/sec) • Not a messaging stress test
Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment: Messaging 15 • A test: send/receive 1000 8K messages • Scenario 1: Unloaded server Server Client • Scenario 2 : Server competing with one CPU-thread Server Client CPU-Thread
Copyright (C) 2010, David Beazley, http://www.dabeaz.com What's Happening? • The GIL does far more than limit cores • It can make performance much worse • Better performance by turning off cores? • 5000x performance hit on Linux? • Why? 23
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Why You Might Care • Must you abandon Python/Ruby for concurrency? • Having threads restricted to one CPU core might be okay if it were sane • Analogy: A multitasking operating system (e.g., Linux) runs fine on a single CPU • Plus, threads get used a lot behind the scenes (even in thread alternatives, e.g., async) 24
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Why I Care • It's an interesting little systems problem • How do you make a better GIL? • It's fun. 25
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Some Background • I have been discussing some of these issues in the Python community since 2009 26 http://www.dabeaz.com/GIL • I'm less familiar with Ruby, but I've looked at its GIL implementation and experimented • Very interested in commonalities/differences
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Implementation • System threads (e.g., pthreads) • Managed by OS • Concurrent execution of the Python interpreter (written in C) 28 • System threads (e.g., pthreads) • Managed by OS • Concurrent execution of the Ruby VM (written in C)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Alas, the GIL • Parallel execution is forbidden • There is a "global interpreter lock" • The GIL ensures that only one thread runs in the interpreter at once • Simplifies many low-level details (memory management, callouts to C extensions, etc.) 29
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Execution Model • The GIL results in cooperative multitasking 31 Thread 1 Thread 2 Thread 3 block block block block block • When a thread is running, it holds the GIL • GIL released on blocking (e.g., I/O operations) run run run run run release GIL acquire GIL release GIL acquire GIL
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Threads for I/O • For I/O it works great • GIL is never held very long • Most threads just sit around sleeping • Life is good 32
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Threads for Computation • You may actually want to compute something! • Fibonacci numbers • Image/audio processing • Parsing • The CPU will be busy • And it won't give up the GIL on its own 33
Copyright (C) 2010, David Beazley, http://www.dabeaz.com CPU-Bound Switching • Releases and reacquires the GIL every 100 "ticks" • 1 Tick ~= 1 interpreter instruction 34 • Background thread generates a timer interrupt every 10ms • GIL released and reacquired by current thread on interrupt
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Python Thread Switching 35 CPU Bound Thread Run 100 ticks Run 100 ticks Run 100 ticks • Every 100 VM instructions, GIL is dropped, allowing other threads to run if they want • Not time based--switching interval depends on kind of instructions executed release acquire release acquire release acquire
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Ruby Thread Switching 36 CPU Bound Thread Run Run Timer Thread Timer (10ms) Timer (10ms) release acquire release acquire • Loosely mimics the time-slice of the OS • Every 10ms, GIL is released/acquired
Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Common Theme • Both Python and Ruby have C code like this: 37 void execute() { while (inst = next_instruction()) { // Run the VM instruction ... if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } } } • Exact details vary, but concept is the same • Each thread has periodic release/acquire in the VM to allow other threads to run
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Question 38 if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } • Short answer: Everything! • What can go wrong with this bit of code?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • You might expect that Thread 2 will run 43 • But you assume the GIL plays nice... Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL Running schedule READY acquire GIL
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • What might actually happen on multicore 44 Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL schedule Running acquire GIL fails (GIL locked) READY • Both threads attempt to run simultaneously • ... but only one will succeed (depends on timing)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 45 if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } • This code doesn't actually switch threads • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 46 if (must_release_gil) { GIL_release(); sleep(0); /* Other threads may run now */ GIL_acquire(); } • This doesn't force switching (sleeping) • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 47 if (must_release_gil) { GIL_release(); sched_yield() /* Other threads may run now */ GIL_acquire(); } • Neither does this (calling the scheduler) • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Conflict • There are conflicting goals • Python/Ruby - wants to run on a single CPU, but doesn't want to do thread scheduling (i.e., let the OS do it). • OS - "Oooh. Multiple cores." Schedules as many runnable tasks as possible at any instant • Result: Threads fight with each other 48
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Multicore GIL Battle 50 • You can see it! (2 CPU-bound threads) Why >100%? • Comment: In Python, it's very rapid • GIL is released every few microseconds!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com I/O Handling • If there is a CPU-bound thread, I/O bound threads have a hard time getting the GIL 51 Thread 1 (CPU 1) Thread 2 (CPU 2) Network Packet Acquire GIL (fails) run Acquire GIL (fails) Acquire GIL (fails) Acquire GIL (success) preempt preempt preempt preempt run sleep Might repeat 100s-1000s of times run run run
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Pathology 52 • Messaging on Linux (8 Cores) Ruby 1.9 (no threads) Ruby 1.9 (1 CPU thread) : 1.18s : 5839.4s • Locks in Linux have no fairness • Consequence: Really hard to steal the GIL • And Ruby only retries every 10ms
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Let's Talk Fairness 53 • Fair-locking means that locks have some notion of priorities, arrival order, queuing, etc. Lock t1 t2 t3 t4 t5 waiting t0 running Lock t2 t3 t4 t5 t0 waiting t1 running release • Releasing means you go to end of line
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Revisited 59 • Go back to the messaging server def server(): while True: msg = recv() send(msg)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Performance Explained 61 • What actually happens under the covers def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg) GIL release GIL release GIL release GIL release • Why? Each operation might block • Catch: Passes control back to CPU-bound thread
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Performance Illustrated 62 CPU Bound Thread run Timer Thread 10ms I/O Thread 10ms 10ms 10ms Data Arrives recv recv send send done run run run run run 10ms • Each message has 40ms response cycle • 1000 messages x 40ms = 40s (42.0s measured)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Solution? • Yes, yes, everyone hates threads • However, that's only because they're useful! • Threads are used for all sorts of things • Even if they're hidden behind the scenes 64 Don't use threads!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Better Solution • It's probably not going away (very difficult) • However, does it have to thrash wildly? • Question: Can you do anything? 65 Make the GIL better
Copyright (C) 2010, David Beazley, http://www.dabeaz.com GIL Efforts in Python 3 • Python 3.2 has a new GIL implementation • It's imperfect--in fact, it has a lot of problems • However, people are experimenting with it 66
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Python 3 GIL • GIL acquisition now based on timeouts 67 Thread 1 Thread 2 READY running wait(gil, TIMEOUT) release running IOWAIT data arrives wait(gil, TIMEOUT) 5ms drop_request • Involves waiting on a condition variable
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Problem: Convoying • CPU-bound threads significantly degrade I/O 68 Thread 1 Thread 2 READY running run data arrives • This is the same problem as in Ruby • Just a shorter time delay (5ms) data arrives running READY run release running READY data arrives 5ms 5ms 5ms
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Problem: Convoying • You can directly observe the delays (messaging) 69 Python/Ruby (No threads) Python 3.2 (1 Thread) Ruby 1.9 (1 Thread) : 1.29s (no delays) : 20.1s (5ms delays) : 42.0s (10ms delays) • Still not great, but problem is understood
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Priorities • Best promise : Priority scheduling • Earlier versions of Ruby had it • It works (OS-X, 4 cores) 71 Ruby 1.9 (1 Thread) Ruby 1.8.7 (1 Thread) Ruby 1.8.7 (1 Thread, lower priority) : 42.0s : 40.2s : 10.0s • Comment: Ruby-1.9 allows thread priorities to be set in pthreads, but it doesn't seem to have much (if any) effect
Copyright (C) 2010, David Beazley, http://www.dabeaz.com New Problems • Priorities bring new challenges • Starvation • Priority inversion • Implementation complexity • Do you have to write a full OS scheduler? • Hopefully not, but it's an open question 73
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Final Words • Implementing a GIL is a lot trickier than it looks • Even work with priorities has problems • Good example of how multicore is diabolical 74
Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thanks for Listening! • I hope you learned at least one new thing • I'm always interested in feedback • Follow me on Twitter (@dabeaz) 75