In Search of the Perfect Global Interpreter Lock

Copyright (C) 2010, David Beazley, http://www.dabeaz.com In Search of the
Perfect Global Interpreter Lock 1 David Beazley http://www.dabeaz.com @dabeaz Presented at RuPy 2011 Poznan, Poland October 15, 2011

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Introduction • As many
programmers know, Python and Ruby feature a Global Interpreter Lock (GIL) • More precise: CPython and MRI • It limits thread performance on multicore • Theoretically restricts code to a single CPU 2

Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment • Consider
a trivial CPU-bound function def countdown(n): while n > 0: n -= 1 3 • Run it once with a lot of work COUNT = 100000000 # 100 million countdown(COUNT) • Now, divide the work across two threads t1 = Thread(target=count,args=(COUNT//2,)) t2 = Thread(target=count,args=(COUNT//2,)) t1.start(); t2.start() t1.join(); t2.join()

Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment • Some
Ruby def countdown(n) while n > 0 n -= 1 end end 4 • Sequential COUNT = 100000000 # 100 million countdown(COUNT) • Subdivided across threads t1 = Thread.new { countdown(COUNT/2) } t2 = Thread.new { countdown(COUNT/2) } t1.join t2.join

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Expectations • Sequential and
threaded versions perform the same amount of work (same # calculations) • There is the GIL... so no parallelism • Performance should be about the same 5

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results 6 • Ruby
1.9 on OS-X (4 cores) Sequential Threaded (2 threads) : 2.46s : 2.55s (~ same)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Python 2.7
7 Sequential Threaded (2 threads) : 6.12s : 9.28s (1.5x slower!) • Ruby 1.9 on OS-X (4 cores) Sequential Threaded (2 threads) : 2.46s : 2.55s (~ same)

8 Sequential Threaded (2 threads) : 6.12s : 9.28s (1.5x slower!) • Ruby 1.9 on OS-X (4 cores) Sequential Threaded (2 threads) : 2.46s : 2.55s (~ same) • Question: Why does it get slower in Python?

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results 9 • Ruby
1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same)

10 Sequential Threaded (2 threads) : 6.9s : 63.0s (9.1x slower!) • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same)

11 Sequential Threaded (2 threads) : 6.9s : 63.0s (9.1x slower!) • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same) • Why does it get that much slower on Windows?

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Experiment: Messaging 12 •
A request/reply server for size-preﬁxed messages Server Client • Each message: a size header + payload • Similar: ZeroMQ

Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment: Messaging 13
• A simple test - message echo (pseudocode) def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1 def server(): while True: msg = recv() send(msg)

• A simple test - message echo (pseudocode) def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1 def server(): while True: msg = recv() send(msg) • To be less evil, it's throttled (<1000 msg/sec) • Not a messaging stress test

• A test: send/receive 1000 8K messages • Scenario 1: Unloaded server Server Client • Scenario 2 : Server competing with one CPU-thread Server Client CPU-Thread

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Messaging with
no threads (OS-X, 4 cores) 16 C Python 2.7 Ruby 1.9 : 1.26s : 1.29s : 1.29s

no threads (OS-X, 4 cores) 17 C Python 2.7 Ruby 1.9 : 1.26s : 1.29s : 1.29s • Messaging with one CPU-bound thread* C Python 2.7 Ruby 1.9 : 1.16s (~8% faster!?) : 12.3s (10x slower) : 42.0s (33x slower) • Hmmm. Curious. * On Ruby, the CPU-bound thread was also given lower priority

no threads (Linux, 8 CPUs) 18 C Python 2.7 Ruby 1.9 : 1.13s : 1.18s : 1.18s

no threads (Linux, 8 CPUs) 19 C Python 2.7 Ruby 1.9 : 1.13s : 1.18s : 1.18s • Messaging with one CPU-bound thread C Python 2.7 Ruby 1.9 : 1.11s (same) : 1.60s (1.4x slower) - better : 5839.4s (~5000x slower) - worse!

no threads (Linux, 8 CPUs) 20 C Python 2.7 Ruby 1.9 : 1.13s : 1.18s : 1.18s • Messaging with one CPU-bound thread C Python 2.7 Ruby 1.9 : 1.11s (same) : 1.60s (1.4x slower) - better : 5839.4s (~5000x slower) - worse! • 5000x slower? Really? Why?

Copyright (C) 2010, David Beazley, http://www.dabeaz.com The Mystery Deepens •
Disable all but one CPU core 21 Python 2.7 (4 cores+hyperthreading) Python 2.7 (1 core) : 9.28s : 7.9s (faster!) • Messaging with one CPU-bound thread Ruby 1.9 (4 cores+hyperthreading) Ruby 1.9 (1 core) : 42.0s : 10.5s (much faster!) • ?!?!?!?!?!? • CPU-bound threads (OS-X)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Better is Worse •
Change software versions 22 Python 2.7 (Messaging) Python 3.2 (Messaging) : 12.3s : 20.1s (1.6x slower) • Let's downgrade to Ruby 1.8 (Linux) Ruby 1.9 (Messaging) Ruby 1.8.7 (Messaging) : 42.0 : 10.0s (4x faster) • Let's upgrade to Python 3 (Linux) • So much for progress (sigh)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com What's Happening? • The
GIL does far more than limit cores • It can make performance much worse • Better performance by turning off cores? • 5000x performance hit on Linux? • Why? 23

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Why You Might Care
• Must you abandon Python/Ruby for concurrency? • Having threads restricted to one CPU core might be okay if it were sane • Analogy: A multitasking operating system (e.g., Linux) runs ﬁne on a single CPU • Plus, threads get used a lot behind the scenes (even in thread alternatives, e.g., async) 24

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Why I Care •
It's an interesting little systems problem • How do you make a better GIL? • It's fun. 25

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Some Background • I
have been discussing some of these issues in the Python community since 2009 26 http://www.dabeaz.com/GIL • I'm less familiar with Ruby, but I've looked at its GIL implementation and experimented • Very interested in commonalities/differences

Copyright (C) 2010, David Beazley, http://www.dabeaz.com 27 A Tale of
Two GILs

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Implementation • System
threads (e.g., pthreads) • Managed by OS • Concurrent execution of the Python interpreter (written in C) 28 • System threads (e.g., pthreads) • Managed by OS • Concurrent execution of the Ruby VM (written in C)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Alas, the GIL •
Parallel execution is forbidden • There is a "global interpreter lock" • The GIL ensures that only one thread runs in the interpreter at once • Simpliﬁes many low-level details (memory management, callouts to C extensions, etc.) 29

Copyright (C) 2010, David Beazley, http://www.dabeaz.com GIL Implementation 30 int
gil_locked = 0; mutex_t gil_mutex; cond_t gil_cond; void gil_acquire() { mutex_lock(gil_mutex); while (gil_locked) cond_wait(gil_cond); gil_locked = 1; mutex_unlock(gil_mutex); } void gil_release() { mutex_lock(gil_mutex); gil_locked = 0; cond_notify(); mutex_unlock(gil_mutex); } mutex_t gil; void gil_acquire() { mutex_lock(gil); } void gil_release() { mutex_unlock(gil); } Simple mutex lock Condition variable

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Execution Model •
The GIL results in cooperative multitasking 31 Thread 1 Thread 2 Thread 3 block block block block block • When a thread is running, it holds the GIL • GIL released on blocking (e.g., I/O operations) run run run run run release GIL acquire GIL release GIL acquire GIL

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Threads for I/O •
For I/O it works great • GIL is never held very long • Most threads just sit around sleeping • Life is good 32

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Threads for Computation •
You may actually want to compute something! • Fibonacci numbers • Image/audio processing • Parsing • The CPU will be busy • And it won't give up the GIL on its own 33

Copyright (C) 2010, David Beazley, http://www.dabeaz.com CPU-Bound Switching • Releases
and reacquires the GIL every 100 "ticks" • 1 Tick ~= 1 interpreter instruction 34 • Background thread generates a timer interrupt every 10ms • GIL released and reacquired by current thread on interrupt

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Python Thread Switching 35
CPU Bound Thread Run 100 ticks Run 100 ticks Run 100 ticks • Every 100 VM instructions, GIL is dropped, allowing other threads to run if they want • Not time based--switching interval depends on kind of instructions executed release acquire release acquire release acquire

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Ruby Thread Switching 36
CPU Bound Thread Run Run Timer Thread Timer (10ms) Timer (10ms) release acquire release acquire • Loosely mimics the time-slice of the OS • Every 10ms, GIL is released/acquired

Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Common Theme •
Both Python and Ruby have C code like this: 37 void execute() { while (inst = next_instruction()) { // Run the VM instruction ... if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } } } • Exact details vary, but concept is the same • Each thread has periodic release/acquire in the VM to allow other threads to run

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Question 38 if (must_release_gil)
{ GIL_release(); /* Other threads may run now */ GIL_acquire(); } • Short answer: Everything! • What can go wrong with this bit of code?

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • Suppose
you have two threads 40 • Thread 1 : Running • Thread 2 : Ready (Waiting for GIL) Thread 1 Running Thread 2 READY

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • Easy
case : Thread 1 performs I/O (read/write) 41 • Thread 1 : Releases GIL and blocks for I/O • Thread 2 : Gets scheduled, starts running Thread 1 Running Thread 2 READY I/O pthreads/OS schedule Running BLOCKED acquire GIL release GIL

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • Tricky
case : Thread 1 runs until preempted 42 Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL Which thread runs? ??? ???

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • You
might expect that Thread 2 will run 43 • But you assume the GIL plays nice... Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL Running schedule READY acquire GIL

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • What
might actually happen on multicore 44 Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL schedule Running acquire GIL fails (GIL locked) READY • Both threads attempt to run simultaneously • ... but only one will succeed (depends on timing)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 45 if (must_release_gil)
{ GIL_release(); /* Other threads may run now */ GIL_acquire(); } • This code doesn't actually switch threads • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)

{ GIL_release(); sleep(0); /* Other threads may run now */ GIL_acquire(); } • This doesn't force switching (sleeping) • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)

{ GIL_release(); sched_yield() /* Other threads may run now */ GIL_acquire(); } • Neither does this (calling the scheduler) • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Conflict • There
are conflicting goals • Python/Ruby - wants to run on a single CPU, but doesn't want to do thread scheduling (i.e., let the OS do it). • OS - "Oooh. Multiple cores." Schedules as many runnable tasks as possible at any instant • Result: Threads fight with each other 48

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Multicore GIL Battle 49
• Python 2.7 on OS-X (4 cores) Sequential Threaded (2 threads) : 6.12s : 9.28s (1.5x slower!) Thread 1 100 ticks preem pt preem pt preem pt 100 ticks Thread 2 ... release schedule READY Eventually... READY release run pthreads/OS acquire acquire fail READY schedule fail READY • Millions of failed GIL acquisitions

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Multicore GIL Battle 50
• You can see it! (2 CPU-bound threads) Why >100%? • Comment: In Python, it's very rapid • GIL is released every few microseconds!

Copyright (C) 2010, David Beazley, http://www.dabeaz.com I/O Handling • If
there is a CPU-bound thread, I/O bound threads have a hard time getting the GIL 51 Thread 1 (CPU 1) Thread 2 (CPU 2) Network Packet Acquire GIL (fails) run Acquire GIL (fails) Acquire GIL (fails) Acquire GIL (success) preempt preempt preempt preempt run sleep Might repeat 100s-1000s of times run run run

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Pathology 52 •
Messaging on Linux (8 Cores) Ruby 1.9 (no threads) Ruby 1.9 (1 CPU thread) : 1.18s : 5839.4s • Locks in Linux have no fairness • Consequence: Really hard to steal the GIL • And Ruby only retries every 10ms

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Let's Talk Fairness 53
• Fair-locking means that locks have some notion of priorities, arrival order, queuing, etc. Lock t1 t2 t3 t4 t5 waiting t0 running Lock t2 t3 t4 t5 t0 waiting t1 running release • Releasing means you go to end of line

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Effect of Fair-Locking 54
• Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Question: Which one uses fair locking? : 42.0s : 5839.4s

• Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Beneﬁt : I/O threads get their turn (yay!) : 42.0s (Fair) : 5839.4s

• Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Beneﬁt : I/O threads get their turn (yay!) : 42.0s (Fair) : 5839.4s • Python 2.7 (multiple cores) 2 CPU-Bound Threads (OS-X) 2 CPU-Bound Threads (Windows) : 9.28s : 63.0s • Question: Which one uses fair-locking?

• Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Beneﬁt : I/O threads get their turn (yay!) : 42.0s (Fair) : 5839.4s • Python 2.7 (multiple cores) 2 CPU-Bound Threads (OS-X) 2 CPU-Bound Threads (Windows) : 9.28s : 63.0s (Fair) • Problem: Too much context switching

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fair-Locking - Bah! 58
• In reality, you don't want fairness • Messaging Revisited (OS X, 4 Cores) Ruby 1.9 (No Threads) Ruby 1.9 (1 CPU-Bound thread) : 1.29s : 42.0s (33x slower) • Why is it still 33x slower? • Answer: Fair locking! (and convoying)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Revisited 59 •
Go back to the messaging server def server(): while True: msg = recv() send(msg)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Revisited 60 •
The actual implementation (size-preﬁxed messages) def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Performance Explained 61 •
What actually happens under the covers def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg) GIL release GIL release GIL release GIL release • Why? Each operation might block • Catch: Passes control back to CPU-bound thread

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Performance Illustrated 62 CPU
Bound Thread run Timer Thread 10ms I/O Thread 10ms 10ms 10ms Data Arrives recv recv send send done run run run run run 10ms • Each message has 40ms response cycle • 1000 messages x 40ms = 40s (42.0s measured)

Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Solution? • Yes,
yes, everyone hates threads • However, that's only because they're useful! • Threads are used for all sorts of things • Even if they're hidden behind the scenes 64 Don't use threads!

Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Better Solution •
It's probably not going away (very difﬁcult) • However, does it have to thrash wildly? • Question: Can you do anything? 65 Make the GIL better

Copyright (C) 2010, David Beazley, http://www.dabeaz.com GIL Efforts in Python
3 • Python 3.2 has a new GIL implementation • It's imperfect--in fact, it has a lot of problems • However, people are experimenting with it 66

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Python 3 GIL •
GIL acquisition now based on timeouts 67 Thread 1 Thread 2 READY running wait(gil, TIMEOUT) release running IOWAIT data arrives wait(gil, TIMEOUT) 5ms drop_request • Involves waiting on a condition variable

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Problem: Convoying • CPU-bound
threads signiﬁcantly degrade I/O 68 Thread 1 Thread 2 READY running run data arrives • This is the same problem as in Ruby • Just a shorter time delay (5ms) data arrives running READY run release running READY data arrives 5ms 5ms 5ms

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Problem: Convoying • You
can directly observe the delays (messaging) 69 Python/Ruby (No threads) Python 3.2 (1 Thread) Ruby 1.9 (1 Thread) : 1.29s (no delays) : 20.1s (5ms delays) : 42.0s (10ms delays) • Still not great, but problem is understood

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Priorities • Best promise
: Priority scheduling • Earlier versions of Ruby had it • It works (OS-X, 4 cores) 71 Ruby 1.9 (1 Thread) Ruby 1.8.7 (1 Thread) Ruby 1.8.7 (1 Thread, lower priority) : 42.0s : 40.2s : 10.0s • Comment: Ruby-1.9 allows thread priorities to be set in pthreads, but it doesn't seem to have much (if any) effect

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Priorities • Experimental Python-3.2
with priority scheduler • Also features immediate preemption • Messages (OS X, 4 Cores) 72 Python 3.2 (No threads) Python 3.2 (1 Thread) Python 3.2+priorities (1 Thread) : 1.29s : 20.2s : 1.21s (faster?) • That's a lot more promising!

Copyright (C) 2010, David Beazley, http://www.dabeaz.com New Problems • Priorities
bring new challenges • Starvation • Priority inversion • Implementation complexity • Do you have to write a full OS scheduler? • Hopefully not, but it's an open question 73

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Final Words • Implementing
a GIL is a lot trickier than it looks • Even work with priorities has problems • Good example of how multicore is diabolical 74

Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thanks for Listening! •
I hope you learned at least one new thing • I'm always interested in feedback • Follow me on Twitter (@dabeaz) 75

In Search of the Perfect Global Interpreter Lock

In Search of the Perfect Global Interpreter Lock

More Decks by David Beazley

Other Decks in Programming

Featured

Transcript