David Beazley
October 15, 2011
520

# In Search of the Perfect Global Interpreter Lock

Conference presentation. RuPy 2011. Poznan, Poland. Conference video at https://www.youtube.com/watch?v=5jbG7UKT1l4

October 15, 2011

## Transcript

1. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
In Search of the Perfect
Global Interpreter Lock
1
David Beazley
http://www.dabeaz.com
@dabeaz
Presented at RuPy 2011
Poznan, Poland
October 15, 2011

2. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Introduction
• As many programmers know, Python and Ruby
feature a Global Interpreter Lock (GIL)
• More precise: CPython and MRI
• It limits thread performance on multicore
• Theoretically restricts code to a single CPU
2

3. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment
• Consider a trivial CPU-bound function
def countdown(n):
while n > 0:
n -= 1
3
• Run it once with a lot of work
COUNT = 100000000 # 100 million
countdown(COUNT)
• Now, divide the work across two threads
t1.start(); t2.start()
t1.join(); t2.join()

4. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment
• Some Ruby
def countdown(n)
while n > 0
n -= 1
end
end
4
• Sequential
COUNT = 100000000 # 100 million
countdown(COUNT)
• Subdivided across threads
t1 = Thread.new { countdown(COUNT/2) }
t2 = Thread.new { countdown(COUNT/2) }
t1.join
t2.join

5. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Expectations
• Sequential and threaded versions perform the
same amount of work (same # calculations)
• There is the GIL... so no parallelism
• Performance should be about the same
5

6. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
6
• Ruby 1.9 on OS-X (4 cores)
Sequential
: 2.46s
: 2.55s (~ same)

7. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
7
Sequential
: 6.12s
: 9.28s (1.5x slower!)
• Ruby 1.9 on OS-X (4 cores)
Sequential
: 2.46s
: 2.55s (~ same)

8. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
8
Sequential
: 6.12s
: 9.28s (1.5x slower!)
• Ruby 1.9 on OS-X (4 cores)
Sequential
: 2.46s
: 2.55s (~ same)
• Question: Why does it get slower in Python?

9. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
9
• Ruby 1.9 on Windows Server 2008 (2 cores)
Sequential
: 3.32s
: 3.45s (~ same)

10. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
10
Sequential
: 6.9s
: 63.0s (9.1x slower!)
• Ruby 1.9 on Windows Server 2008 (2 cores)
Sequential
: 3.32s
: 3.45s (~ same)

11. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Python 2.7
11
Sequential
: 6.9s
: 63.0s (9.1x slower!)
• Ruby 1.9 on Windows Server 2008 (2 cores)
Sequential
: 3.32s
: 3.45s (~ same)
• Why does it get that much slower on Windows?

12. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Experiment: Messaging
12
• A request/reply server for size-preﬁxed messages
Server
Client
• Each message: a size header + payload
• Similar: ZeroMQ

13. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment: Messaging
13
• A simple test - message echo (pseudocode)
def client(nummsg,msg):
while nummsg > 0:
send(msg)
resp = recv()
sleep(0.001)
nummsg -= 1
def server():
while True:
msg = recv()
send(msg)

14. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment: Messaging
14
• A simple test - message echo (pseudocode)
def client(nummsg,msg):
while nummsg > 0:
send(msg)
resp = recv()
sleep(0.001)
nummsg -= 1
def server():
while True:
msg = recv()
send(msg)
• To be less evil, it's throttled (<1000 msg/sec)
• Not a messaging stress test

15. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
An Experiment: Messaging
15
• A test: send/receive 1000 8K messages
• Scenario 1: Unloaded server
Server
Client
• Scenario 2 : Server competing with one CPU-thread
Server
Client

16. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Messaging with no threads (OS-X, 4 cores)
16
C
Python 2.7
Ruby 1.9
: 1.26s
: 1.29s
: 1.29s

17. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Messaging with no threads (OS-X, 4 cores)
17
C
Python 2.7
Ruby 1.9
: 1.26s
: 1.29s
: 1.29s
• Messaging with one CPU-bound thread*
C
Python 2.7
Ruby 1.9
: 1.16s (~8% faster!?)
: 12.3s (10x slower)
: 42.0s (33x slower)
• Hmmm. Curious.
* On Ruby, the CPU-bound thread
was also given lower priority

18. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Messaging with no threads (Linux, 8 CPUs)
18
C
Python 2.7
Ruby 1.9
: 1.13s
: 1.18s
: 1.18s

19. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Messaging with no threads (Linux, 8 CPUs)
19
C
Python 2.7
Ruby 1.9
: 1.13s
: 1.18s
: 1.18s
• Messaging with one CPU-bound thread
C
Python 2.7
Ruby 1.9
: 1.11s (same)
: 1.60s (1.4x slower) - better
: 5839.4s (~5000x slower) - worse!

20. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Results
• Messaging with no threads (Linux, 8 CPUs)
20
C
Python 2.7
Ruby 1.9
: 1.13s
: 1.18s
: 1.18s
• Messaging with one CPU-bound thread
C
Python 2.7
Ruby 1.9
: 1.11s (same)
: 1.60s (1.4x slower) - better
: 5839.4s (~5000x slower) - worse!
• 5000x slower? Really? Why?

21. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
The Mystery Deepens
• Disable all but one CPU core
21
Python 2.7 (4 cores+hyperthreading)
Python 2.7 (1 core)
: 9.28s
: 7.9s (faster!)
• Messaging with one CPU-bound thread
Ruby 1.9 (4 cores+hyperthreading)
Ruby 1.9 (1 core)
: 42.0s
: 10.5s (much faster!)
• ?!?!?!?!?!?
• CPU-bound threads (OS-X)

22. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Better is Worse
• Change software versions
22
Python 2.7 (Messaging)
Python 3.2 (Messaging)
: 12.3s
: 20.1s (1.6x slower)
• Let's downgrade to Ruby 1.8 (Linux)
Ruby 1.9 (Messaging)
Ruby 1.8.7 (Messaging)
: 42.0
: 10.0s (4x faster)
• Let's upgrade to Python 3 (Linux)
• So much for progress (sigh)

23. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
What's Happening?
• The GIL does far more than limit cores
• It can make performance much worse
• Better performance by turning off cores?
• 5000x performance hit on Linux?
• Why?
23

24. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Why You Might Care
• Must you abandon Python/Ruby for concurrency?
• Having threads restricted to one CPU core might
be okay if it were sane
• Analogy: A multitasking operating system
(e.g., Linux) runs ﬁne on a single CPU
• Plus, threads get used a lot behind the scenes
(even in thread alternatives, e.g., async)
24

25. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Why I Care
• It's an interesting little systems problem
• How do you make a better GIL?
• It's fun.
25

26. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Some Background
• I have been discussing some of these issues
in the Python community since 2009
26
http://www.dabeaz.com/GIL
• I'm less familiar with Ruby, but I've looked at
its GIL implementation and experimented
• Very interested in commonalities/differences

27. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
27
A Tale of Two GILs

28. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• Managed by OS
• Concurrent
execution of the
Python interpreter
(written in C)
28
• Managed by OS
• Concurrent
execution of the
Ruby VM
(written in C)

29. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Alas, the GIL
• Parallel execution is forbidden
• There is a "global interpreter lock"
• The GIL ensures that only one thread runs in
the interpreter at once
• Simpliﬁes many low-level details (memory
management, callouts to C extensions, etc.)
29

30. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
GIL Implementation
30
int gil_locked = 0;
mutex_t gil_mutex;
cond_t gil_cond;
void gil_acquire() {
mutex_lock(gil_mutex);
while (gil_locked)
cond_wait(gil_cond);
gil_locked = 1;
mutex_unlock(gil_mutex);
}
void gil_release() {
mutex_lock(gil_mutex);
gil_locked = 0;
cond_notify();
mutex_unlock(gil_mutex);
}
mutex_t gil;
void gil_acquire() {
mutex_lock(gil);
}
void gil_release() {
mutex_unlock(gil);
}
Simple mutex lock
Condition variable

31. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• The GIL results in cooperative multitasking
31
block block block block block
• When a thread is running, it holds the GIL
• GIL released on blocking (e.g., I/O operations)
run
run
run
run
run
release
GIL
acquire
GIL
release
GIL
acquire
GIL

32. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• For I/O it works great
• GIL is never held very long
• Most threads just sit around sleeping
• Life is good
32

33. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• You may actually want to compute something!
• Fibonacci numbers
• Image/audio processing
• Parsing
• The CPU will be busy
• And it won't give up the GIL on its own
33

34. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
CPU-Bound Switching
• Releases and
reacquires the GIL
every 100 "ticks"
• 1 Tick ~= 1 interpreter
instruction
34
generates a timer
interrupt every 10ms
• GIL released and
reacquired by current

35. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
35
CPU Bound
Run 100
ticks
Run 100
ticks
Run 100
ticks
• Every 100 VM instructions, GIL is dropped,
allowing other threads to run if they want
• Not time based--switching interval depends on
kind of instructions executed
release
acquire
release
acquire
release
acquire

36. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
36
CPU Bound
Timer
Timer (10ms) Timer (10ms)
release
acquire
release
acquire
• Loosely mimics the time-slice of the OS
• Every 10ms, GIL is released/acquired

37. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Common Theme
• Both Python and Ruby have C code like this:
37
void execute() {
while (inst = next_instruction()) {
// Run the VM instruction
...
if (must_release_gil) {
GIL_release();
/* Other threads may run now */
GIL_acquire();
}
}
}
• Exact details vary, but concept is the same
• Each thread has periodic release/acquire in the
VM to allow other threads to run

38. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Question
38
if (must_release_gil) {
GIL_release();
/* Other threads may run now */
GIL_acquire();
}
• Short answer: Everything!
• What can go wrong with this bit of code?

39. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
39
Pathology

40. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• Suppose you have two threads
40
• Thread 1 : Running
• Thread 2 : Ready (Waiting for GIL)
Running

41. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• Easy case : Thread 1 performs I/O (read/write)
41
• Thread 1 : Releases GIL and blocks for I/O
• Thread 2 : Gets scheduled, starts running
Running
I/O
schedule
Running
BLOCKED
acquire GIL
release
GIL

42. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• Tricky case : Thread 1 runs until preempted
42
Running
preem
pt
release
GIL
???
???

43. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• You might expect that Thread 2 will run
43
• But you assume the GIL plays nice...
Running
preem
pt
release
GIL
Running
schedule
acquire
GIL

44. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
• What might actually happen on multicore
44
Running
preem
pt
release
GIL
schedule
Running
acquire
GIL
fails (GIL locked)
• Both threads attempt to run simultaneously
• ... but only one will succeed (depends on timing)

45. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fallacy
45
if (must_release_gil) {
GIL_release();
/* Other threads may run now */
GIL_acquire();
}
• This code doesn't actually switch threads
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)

46. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fallacy
46
if (must_release_gil) {
GIL_release();
sleep(0);
/* Other threads may run now */
GIL_acquire();
}
• This doesn't force switching (sleeping)
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)

47. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fallacy
47
if (must_release_gil) {
GIL_release();
sched_yield()
/* Other threads may run now */
GIL_acquire();
}
• Neither does this (calling the scheduler)
• It might switch threads, but it depends
• What operating system
• # cores
• Lock scheduling policy (if any)

48. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Conﬂict
• There are conﬂicting goals
• Python/Ruby - wants to run on a single
CPU, but doesn't want to do thread
scheduling (i.e., let the OS do it).
• OS - "Oooh. Multiple cores."
Schedules as many runnable tasks as
possible at any instant
• Result: Threads ﬁght with each other
48

49. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Multicore GIL Battle
49
• Python 2.7 on OS-X (4 cores)
Sequential
: 6.12s
: 9.28s (1.5x slower!)
100 ticks
preem
pt
preem
pt
preem
pt
100 ticks
...
release
schedule
Eventually...
release
run
acquire acquire
fail
schedule fail
• Millions of failed GIL acquisitions

50. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Multicore GIL Battle
50
• You can see it! (2 CPU-bound threads)
Why >100%?
• Comment: In Python, it's very rapid
• GIL is released every few microseconds!

51. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
I/O Handling
• If there is a CPU-bound thread, I/O bound
threads have a hard time getting the GIL
51
Thread 1 (CPU 1) Thread 2 (CPU 2)
Network Packet
Acquire GIL (fails)
run
Acquire GIL (fails)
Acquire GIL (fails)
Acquire GIL (success)
preempt
preempt
preempt
preempt
run
sleep
Might repeat
100s-1000s of times
run
run
run

52. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Messaging Pathology
52
• Messaging on Linux (8 Cores)
Ruby 1.9 (no threads)
Ruby 1.9 (1 CPU thread)
: 1.18s
: 5839.4s
• Locks in Linux have no fairness
• Consequence: Really hard to steal the GIL
• And Ruby only retries every 10ms

53. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Let's Talk Fairness
53
• Fair-locking means that locks have some notion
of priorities, arrival order, queuing, etc.
Lock t1 t2 t3 t4 t5
waiting
t0
running
Lock t2 t3 t4 t5 t0
waiting
t1
running
release
• Releasing means you go to end of line

54. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
54
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X)
Messages + 1 CPU Thread (Linux)
• Question: Which one uses fair locking?
: 42.0s
: 5839.4s

55. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
55
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X)
Messages + 1 CPU Thread (Linux)
• Beneﬁt : I/O threads get their turn (yay!)
: 42.0s (Fair)
: 5839.4s

56. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
56
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X)
Messages + 1 CPU Thread (Linux)
• Beneﬁt : I/O threads get their turn (yay!)
: 42.0s (Fair)
: 5839.4s
• Python 2.7 (multiple cores)
2 CPU-Bound Threads (OS-X)
2 CPU-Bound Threads (Windows)
: 9.28s
: 63.0s
• Question: Which one uses fair-locking?

57. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Effect of Fair-Locking
57
• Ruby 1.9 (multiple cores)
Messages + 1 CPU Thread (OS-X)
Messages + 1 CPU Thread (Linux)
• Beneﬁt : I/O threads get their turn (yay!)
: 42.0s (Fair)
: 5839.4s
• Python 2.7 (multiple cores)
2 CPU-Bound Threads (OS-X)
2 CPU-Bound Threads (Windows)
: 9.28s
: 63.0s (Fair)
• Problem: Too much context switching

58. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Fair-Locking - Bah!
58
• In reality, you don't want fairness
• Messaging Revisited (OS X, 4 Cores)
Ruby 1.9 (No Threads)
Ruby 1.9 (1 CPU-Bound thread)
: 1.29s
: 42.0s (33x slower)
• Why is it still 33x slower?
• Answer: Fair locking! (and convoying)

59. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Messaging Revisited
59
• Go back to the messaging server
def server():
while True:
msg = recv()
send(msg)

60. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Messaging Revisited
60
• The actual implementation (size-preﬁxed messages)
def server():
while True:
size = recv(4)
msg = recv(size)
send(size)
send(msg)

61. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Performance Explained
61
• What actually happens under the covers
def server():
while True:
size = recv(4)
msg = recv(size)
send(size)
send(msg)
GIL release
GIL release
GIL release
GIL release
• Why? Each operation might block
• Catch: Passes control back to CPU-bound thread

62. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Performance Illustrated
62
CPU Bound
run
Timer
10ms
I/O
10ms 10ms 10ms
Data
Arrives
recv recv send send done
run run run run run
10ms
• Each message has 40ms response cycle
• 1000 messages x 40ms = 40s (42.0s measured)

63. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
63
Despair

64. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Solution?
• Yes, yes, everyone hates threads
• However, that's only because they're useful!
• Threads are used for all sorts of things
• Even if they're hidden behind the scenes
64

65. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
A Better Solution
• It's probably not going away (very difﬁcult)
• However, does it have to thrash wildly?
• Question: Can you do anything?
65
Make the GIL better

66. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
GIL Efforts in Python 3
• Python 3.2 has a new GIL implementation
• It's imperfect--in fact, it has a lot of problems
• However, people are experimenting with it
66

67. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Python 3 GIL
• GIL acquisition now based on timeouts
67
running
wait(gil, TIMEOUT)
release
running
IOWAIT
data
arrives
wait(gil, TIMEOUT)
5ms
drop_request
• Involves waiting on a condition variable

68. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Problem: Convoying
68
running
run
data
arrives
• This is the same problem as in Ruby
• Just a shorter time delay (5ms)
data
arrives
running
run
release
running
data
arrives
5ms 5ms 5ms

69. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Problem: Convoying
• You can directly observe the delays (messaging)
69
Python 3.2 (1 Thread)
Ruby 1.9 (1 Thread)
: 1.29s (no delays)
: 20.1s (5ms delays)
: 42.0s (10ms delays)
• Still not great, but problem is understood

70. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
70
Promise

71. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Priorities
• Best promise : Priority scheduling
• Earlier versions of Ruby had it
• It works (OS-X, 4 cores)
71
Ruby 1.9 (1 Thread)
Ruby 1.8.7 (1 Thread)
Ruby 1.8.7 (1 Thread, lower priority)
: 42.0s
: 40.2s
: 10.0s
• Comment: Ruby-1.9 allows thread priorities to be
set in pthreads, but it doesn't seem to have much
(if any) effect

72. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Priorities
• Experimental Python-3.2 with priority scheduler
• Also features immediate preemption
• Messages (OS X, 4 Cores)
72
Python 3.2 (No threads)
Python 3.2 (1 Thread)
Python 3.2+priorities (1 Thread)
: 1.29s
: 20.2s
: 1.21s (faster?)
• That's a lot more promising!

73. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
New Problems
• Priorities bring new challenges
• Starvation
• Priority inversion
• Implementation complexity
• Do you have to write a full OS scheduler?
• Hopefully not, but it's an open question
73

74. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Final Words
• Implementing a GIL is a lot trickier than it looks
• Even work with priorities has problems
• Good example of how multicore is diabolical
74

75. Copyright (C) 2010, David Beazley, http://www.dabeaz.com
Thanks for Listening!
• I hope you learned at least one new thing
• I'm always interested in feedback