In Search of the Perfect Global Interpreter Lock

In Search of the Perfect Global Interpreter Lock

Conference presentation. RuPy 2011. Poznan, Poland. Conference video at https://www.youtube.com/watch?v=5jbG7UKT1l4

70c42f4cf225f1455a7e01379bbd4d48?s=128

David Beazley

October 15, 2011
Tweet

Transcript

  1. Copyright (C) 2010, David Beazley, http://www.dabeaz.com In Search of the

    Perfect Global Interpreter Lock 1 David Beazley http://www.dabeaz.com @dabeaz Presented at RuPy 2011 Poznan, Poland October 15, 2011
  2. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Introduction • As many

    programmers know, Python and Ruby feature a Global Interpreter Lock (GIL) • More precise: CPython and MRI • It limits thread performance on multicore • Theoretically restricts code to a single CPU 2
  3. Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment • Consider

    a trivial CPU-bound function def countdown(n): while n > 0: n -= 1 3 • Run it once with a lot of work COUNT = 100000000 # 100 million countdown(COUNT) • Now, divide the work across two threads t1 = Thread(target=count,args=(COUNT//2,)) t2 = Thread(target=count,args=(COUNT//2,)) t1.start(); t2.start() t1.join(); t2.join()
  4. Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment • Some

    Ruby def countdown(n) while n > 0 n -= 1 end end 4 • Sequential COUNT = 100000000 # 100 million countdown(COUNT) • Subdivided across threads t1 = Thread.new { countdown(COUNT/2) } t2 = Thread.new { countdown(COUNT/2) } t1.join t2.join
  5. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Expectations • Sequential and

    threaded versions perform the same amount of work (same # calculations) • There is the GIL... so no parallelism • Performance should be about the same 5
  6. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results 6 • Ruby

    1.9 on OS-X (4 cores) Sequential Threaded (2 threads) : 2.46s : 2.55s (~ same)
  7. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Python 2.7

    7 Sequential Threaded (2 threads) : 6.12s : 9.28s (1.5x slower!) • Ruby 1.9 on OS-X (4 cores) Sequential Threaded (2 threads) : 2.46s : 2.55s (~ same)
  8. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Python 2.7

    8 Sequential Threaded (2 threads) : 6.12s : 9.28s (1.5x slower!) • Ruby 1.9 on OS-X (4 cores) Sequential Threaded (2 threads) : 2.46s : 2.55s (~ same) • Question: Why does it get slower in Python?
  9. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results 9 • Ruby

    1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same)
  10. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Python 2.7

    10 Sequential Threaded (2 threads) : 6.9s : 63.0s (9.1x slower!) • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same)
  11. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Python 2.7

    11 Sequential Threaded (2 threads) : 6.9s : 63.0s (9.1x slower!) • Ruby 1.9 on Windows Server 2008 (2 cores) Sequential Threaded (2 threads) : 3.32s : 3.45s (~ same) • Why does it get that much slower on Windows?
  12. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Experiment: Messaging 12 •

    A request/reply server for size-prefixed messages Server Client • Each message: a size header + payload • Similar: ZeroMQ
  13. Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment: Messaging 13

    • A simple test - message echo (pseudocode) def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1 def server(): while True: msg = recv() send(msg)
  14. Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment: Messaging 14

    • A simple test - message echo (pseudocode) def client(nummsg,msg): while nummsg > 0: send(msg) resp = recv() sleep(0.001) nummsg -= 1 def server(): while True: msg = recv() send(msg) • To be less evil, it's throttled (<1000 msg/sec) • Not a messaging stress test
  15. Copyright (C) 2010, David Beazley, http://www.dabeaz.com An Experiment: Messaging 15

    • A test: send/receive 1000 8K messages • Scenario 1: Unloaded server Server Client • Scenario 2 : Server competing with one CPU-thread Server Client CPU-Thread
  16. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Messaging with

    no threads (OS-X, 4 cores) 16 C Python 2.7 Ruby 1.9 : 1.26s : 1.29s : 1.29s
  17. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Messaging with

    no threads (OS-X, 4 cores) 17 C Python 2.7 Ruby 1.9 : 1.26s : 1.29s : 1.29s • Messaging with one CPU-bound thread* C Python 2.7 Ruby 1.9 : 1.16s (~8% faster!?) : 12.3s (10x slower) : 42.0s (33x slower) • Hmmm. Curious. * On Ruby, the CPU-bound thread was also given lower priority
  18. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Messaging with

    no threads (Linux, 8 CPUs) 18 C Python 2.7 Ruby 1.9 : 1.13s : 1.18s : 1.18s
  19. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Messaging with

    no threads (Linux, 8 CPUs) 19 C Python 2.7 Ruby 1.9 : 1.13s : 1.18s : 1.18s • Messaging with one CPU-bound thread C Python 2.7 Ruby 1.9 : 1.11s (same) : 1.60s (1.4x slower) - better : 5839.4s (~5000x slower) - worse!
  20. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Results • Messaging with

    no threads (Linux, 8 CPUs) 20 C Python 2.7 Ruby 1.9 : 1.13s : 1.18s : 1.18s • Messaging with one CPU-bound thread C Python 2.7 Ruby 1.9 : 1.11s (same) : 1.60s (1.4x slower) - better : 5839.4s (~5000x slower) - worse! • 5000x slower? Really? Why?
  21. Copyright (C) 2010, David Beazley, http://www.dabeaz.com The Mystery Deepens •

    Disable all but one CPU core 21 Python 2.7 (4 cores+hyperthreading) Python 2.7 (1 core) : 9.28s : 7.9s (faster!) • Messaging with one CPU-bound thread Ruby 1.9 (4 cores+hyperthreading) Ruby 1.9 (1 core) : 42.0s : 10.5s (much faster!) • ?!?!?!?!?!? • CPU-bound threads (OS-X)
  22. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Better is Worse •

    Change software versions 22 Python 2.7 (Messaging) Python 3.2 (Messaging) : 12.3s : 20.1s (1.6x slower) • Let's downgrade to Ruby 1.8 (Linux) Ruby 1.9 (Messaging) Ruby 1.8.7 (Messaging) : 42.0 : 10.0s (4x faster) • Let's upgrade to Python 3 (Linux) • So much for progress (sigh)
  23. Copyright (C) 2010, David Beazley, http://www.dabeaz.com What's Happening? • The

    GIL does far more than limit cores • It can make performance much worse • Better performance by turning off cores? • 5000x performance hit on Linux? • Why? 23
  24. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Why You Might Care

    • Must you abandon Python/Ruby for concurrency? • Having threads restricted to one CPU core might be okay if it were sane • Analogy: A multitasking operating system (e.g., Linux) runs fine on a single CPU • Plus, threads get used a lot behind the scenes (even in thread alternatives, e.g., async) 24
  25. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Why I Care •

    It's an interesting little systems problem • How do you make a better GIL? • It's fun. 25
  26. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Some Background • I

    have been discussing some of these issues in the Python community since 2009 26 http://www.dabeaz.com/GIL • I'm less familiar with Ruby, but I've looked at its GIL implementation and experimented • Very interested in commonalities/differences
  27. Copyright (C) 2010, David Beazley, http://www.dabeaz.com 27 A Tale of

    Two GILs
  28. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Implementation • System

    threads (e.g., pthreads) • Managed by OS • Concurrent execution of the Python interpreter (written in C) 28 • System threads (e.g., pthreads) • Managed by OS • Concurrent execution of the Ruby VM (written in C)
  29. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Alas, the GIL •

    Parallel execution is forbidden • There is a "global interpreter lock" • The GIL ensures that only one thread runs in the interpreter at once • Simplifies many low-level details (memory management, callouts to C extensions, etc.) 29
  30. Copyright (C) 2010, David Beazley, http://www.dabeaz.com GIL Implementation 30 int

    gil_locked = 0; mutex_t gil_mutex; cond_t gil_cond; void gil_acquire() { mutex_lock(gil_mutex); while (gil_locked) cond_wait(gil_cond); gil_locked = 1; mutex_unlock(gil_mutex); } void gil_release() { mutex_lock(gil_mutex); gil_locked = 0; cond_notify(); mutex_unlock(gil_mutex); } mutex_t gil; void gil_acquire() { mutex_lock(gil); } void gil_release() { mutex_unlock(gil); } Simple mutex lock Condition variable
  31. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Execution Model •

    The GIL results in cooperative multitasking 31 Thread 1 Thread 2 Thread 3 block block block block block • When a thread is running, it holds the GIL • GIL released on blocking (e.g., I/O operations) run run run run run release GIL acquire GIL release GIL acquire GIL
  32. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Threads for I/O •

    For I/O it works great • GIL is never held very long • Most threads just sit around sleeping • Life is good 32
  33. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Threads for Computation •

    You may actually want to compute something! • Fibonacci numbers • Image/audio processing • Parsing • The CPU will be busy • And it won't give up the GIL on its own 33
  34. Copyright (C) 2010, David Beazley, http://www.dabeaz.com CPU-Bound Switching • Releases

    and reacquires the GIL every 100 "ticks" • 1 Tick ~= 1 interpreter instruction 34 • Background thread generates a timer interrupt every 10ms • GIL released and reacquired by current thread on interrupt
  35. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Python Thread Switching 35

    CPU Bound Thread Run 100 ticks Run 100 ticks Run 100 ticks • Every 100 VM instructions, GIL is dropped, allowing other threads to run if they want • Not time based--switching interval depends on kind of instructions executed release acquire release acquire release acquire
  36. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Ruby Thread Switching 36

    CPU Bound Thread Run Run Timer Thread Timer (10ms) Timer (10ms) release acquire release acquire • Loosely mimics the time-slice of the OS • Every 10ms, GIL is released/acquired
  37. Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Common Theme •

    Both Python and Ruby have C code like this: 37 void execute() { while (inst = next_instruction()) { // Run the VM instruction ... if (must_release_gil) { GIL_release(); /* Other threads may run now */ GIL_acquire(); } } } • Exact details vary, but concept is the same • Each thread has periodic release/acquire in the VM to allow other threads to run
  38. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Question 38 if (must_release_gil)

    { GIL_release(); /* Other threads may run now */ GIL_acquire(); } • Short answer: Everything! • What can go wrong with this bit of code?
  39. Copyright (C) 2010, David Beazley, http://www.dabeaz.com 39 Pathology

  40. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • Suppose

    you have two threads 40 • Thread 1 : Running • Thread 2 : Ready (Waiting for GIL) Thread 1 Running Thread 2 READY
  41. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • Easy

    case : Thread 1 performs I/O (read/write) 41 • Thread 1 : Releases GIL and blocks for I/O • Thread 2 : Gets scheduled, starts running Thread 1 Running Thread 2 READY I/O pthreads/OS schedule Running BLOCKED acquire GIL release GIL
  42. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • Tricky

    case : Thread 1 runs until preempted 42 Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL Which thread runs? ??? ???
  43. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • You

    might expect that Thread 2 will run 43 • But you assume the GIL plays nice... Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL Running schedule READY acquire GIL
  44. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thread Switching • What

    might actually happen on multicore 44 Thread 1 Running Thread 2 READY preem pt pthreads/OS release GIL schedule Running acquire GIL fails (GIL locked) READY • Both threads attempt to run simultaneously • ... but only one will succeed (depends on timing)
  45. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 45 if (must_release_gil)

    { GIL_release(); /* Other threads may run now */ GIL_acquire(); } • This code doesn't actually switch threads • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)
  46. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 46 if (must_release_gil)

    { GIL_release(); sleep(0); /* Other threads may run now */ GIL_acquire(); } • This doesn't force switching (sleeping) • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)
  47. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fallacy 47 if (must_release_gil)

    { GIL_release(); sched_yield() /* Other threads may run now */ GIL_acquire(); } • Neither does this (calling the scheduler) • It might switch threads, but it depends • What operating system • # cores • Lock scheduling policy (if any)
  48. Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Conflict • There

    are conflicting goals • Python/Ruby - wants to run on a single CPU, but doesn't want to do thread scheduling (i.e., let the OS do it). • OS - "Oooh. Multiple cores." Schedules as many runnable tasks as possible at any instant • Result: Threads fight with each other 48
  49. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Multicore GIL Battle 49

    • Python 2.7 on OS-X (4 cores) Sequential Threaded (2 threads) : 6.12s : 9.28s (1.5x slower!) Thread 1 100 ticks preem pt preem pt preem pt 100 ticks Thread 2 ... release schedule READY Eventually... READY release run pthreads/OS acquire acquire fail READY schedule fail READY • Millions of failed GIL acquisitions
  50. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Multicore GIL Battle 50

    • You can see it! (2 CPU-bound threads) Why >100%? • Comment: In Python, it's very rapid • GIL is released every few microseconds!
  51. Copyright (C) 2010, David Beazley, http://www.dabeaz.com I/O Handling • If

    there is a CPU-bound thread, I/O bound threads have a hard time getting the GIL 51 Thread 1 (CPU 1) Thread 2 (CPU 2) Network Packet Acquire GIL (fails) run Acquire GIL (fails) Acquire GIL (fails) Acquire GIL (success) preempt preempt preempt preempt run sleep Might repeat 100s-1000s of times run run run
  52. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Pathology 52 •

    Messaging on Linux (8 Cores) Ruby 1.9 (no threads) Ruby 1.9 (1 CPU thread) : 1.18s : 5839.4s • Locks in Linux have no fairness • Consequence: Really hard to steal the GIL • And Ruby only retries every 10ms
  53. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Let's Talk Fairness 53

    • Fair-locking means that locks have some notion of priorities, arrival order, queuing, etc. Lock t1 t2 t3 t4 t5 waiting t0 running Lock t2 t3 t4 t5 t0 waiting t1 running release • Releasing means you go to end of line
  54. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Effect of Fair-Locking 54

    • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Question: Which one uses fair locking? : 42.0s : 5839.4s
  55. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Effect of Fair-Locking 55

    • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Benefit : I/O threads get their turn (yay!) : 42.0s (Fair) : 5839.4s
  56. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Effect of Fair-Locking 56

    • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Benefit : I/O threads get their turn (yay!) : 42.0s (Fair) : 5839.4s • Python 2.7 (multiple cores) 2 CPU-Bound Threads (OS-X) 2 CPU-Bound Threads (Windows) : 9.28s : 63.0s • Question: Which one uses fair-locking?
  57. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Effect of Fair-Locking 57

    • Ruby 1.9 (multiple cores) Messages + 1 CPU Thread (OS-X) Messages + 1 CPU Thread (Linux) • Benefit : I/O threads get their turn (yay!) : 42.0s (Fair) : 5839.4s • Python 2.7 (multiple cores) 2 CPU-Bound Threads (OS-X) 2 CPU-Bound Threads (Windows) : 9.28s : 63.0s (Fair) • Problem: Too much context switching
  58. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Fair-Locking - Bah! 58

    • In reality, you don't want fairness • Messaging Revisited (OS X, 4 Cores) Ruby 1.9 (No Threads) Ruby 1.9 (1 CPU-Bound thread) : 1.29s : 42.0s (33x slower) • Why is it still 33x slower? • Answer: Fair locking! (and convoying)
  59. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Revisited 59 •

    Go back to the messaging server def server(): while True: msg = recv() send(msg)
  60. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Messaging Revisited 60 •

    The actual implementation (size-prefixed messages) def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg)
  61. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Performance Explained 61 •

    What actually happens under the covers def server(): while True: size = recv(4) msg = recv(size) send(size) send(msg) GIL release GIL release GIL release GIL release • Why? Each operation might block • Catch: Passes control back to CPU-bound thread
  62. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Performance Illustrated 62 CPU

    Bound Thread run Timer Thread 10ms I/O Thread 10ms 10ms 10ms Data Arrives recv recv send send done run run run run run 10ms • Each message has 40ms response cycle • 1000 messages x 40ms = 40s (42.0s measured)
  63. Copyright (C) 2010, David Beazley, http://www.dabeaz.com 63 Despair

  64. Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Solution? • Yes,

    yes, everyone hates threads • However, that's only because they're useful! • Threads are used for all sorts of things • Even if they're hidden behind the scenes 64 Don't use threads!
  65. Copyright (C) 2010, David Beazley, http://www.dabeaz.com A Better Solution •

    It's probably not going away (very difficult) • However, does it have to thrash wildly? • Question: Can you do anything? 65 Make the GIL better
  66. Copyright (C) 2010, David Beazley, http://www.dabeaz.com GIL Efforts in Python

    3 • Python 3.2 has a new GIL implementation • It's imperfect--in fact, it has a lot of problems • However, people are experimenting with it 66
  67. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Python 3 GIL •

    GIL acquisition now based on timeouts 67 Thread 1 Thread 2 READY running wait(gil, TIMEOUT) release running IOWAIT data arrives wait(gil, TIMEOUT) 5ms drop_request • Involves waiting on a condition variable
  68. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Problem: Convoying • CPU-bound

    threads significantly degrade I/O 68 Thread 1 Thread 2 READY running run data arrives • This is the same problem as in Ruby • Just a shorter time delay (5ms) data arrives running READY run release running READY data arrives 5ms 5ms 5ms
  69. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Problem: Convoying • You

    can directly observe the delays (messaging) 69 Python/Ruby (No threads) Python 3.2 (1 Thread) Ruby 1.9 (1 Thread) : 1.29s (no delays) : 20.1s (5ms delays) : 42.0s (10ms delays) • Still not great, but problem is understood
  70. Copyright (C) 2010, David Beazley, http://www.dabeaz.com 70 Promise

  71. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Priorities • Best promise

    : Priority scheduling • Earlier versions of Ruby had it • It works (OS-X, 4 cores) 71 Ruby 1.9 (1 Thread) Ruby 1.8.7 (1 Thread) Ruby 1.8.7 (1 Thread, lower priority) : 42.0s : 40.2s : 10.0s • Comment: Ruby-1.9 allows thread priorities to be set in pthreads, but it doesn't seem to have much (if any) effect
  72. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Priorities • Experimental Python-3.2

    with priority scheduler • Also features immediate preemption • Messages (OS X, 4 Cores) 72 Python 3.2 (No threads) Python 3.2 (1 Thread) Python 3.2+priorities (1 Thread) : 1.29s : 20.2s : 1.21s (faster?) • That's a lot more promising!
  73. Copyright (C) 2010, David Beazley, http://www.dabeaz.com New Problems • Priorities

    bring new challenges • Starvation • Priority inversion • Implementation complexity • Do you have to write a full OS scheduler? • Hopefully not, but it's an open question 73
  74. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Final Words • Implementing

    a GIL is a lot trickier than it looks • Even work with priorities has problems • Good example of how multicore is diabolical 74
  75. Copyright (C) 2010, David Beazley, http://www.dabeaz.com Thanks for Listening! •

    I hope you learned at least one new thing • I'm always interested in feedback • Follow me on Twitter (@dabeaz) 75