Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyParallel - Trent Nelson

PyParallel - Trent Nelson

The talk will cover that, but also give some real-life performance examples of where PyParallel shines in comparison to the existing options (e.g. against asyncio, Twisted, tornado etc). This will typically be web-server based stuff, as, well you know, that's actually something that works properly in PyParallel :-)

PyGotham 2014

August 17, 2014
Tweet

More Decks by PyGotham 2014

Other Decks in Programming

Transcript

  1. PyParallel – PyGotham 2014   Trent  Nelson   Managing  Director,

     New  York   Continuum  Analytics     @ContinuumIO,  @trentnelson   [email protected]   http://speakerdeck.com/trent/    
  2. About Me •  Systems  Software  Engineer   •  Core  Python

     Committer   •  Apache/Subversion  Committer   •  Founded  Snakebite  @  Michigan  State  University   o  AIX  RS/6000   o  SGI  IRIX/MIPS   o  Alpha/Tru64   o  Solaris/SPARC   o  HP-­‐UX/IA64   o  FreeBSD,  NetBSD,  OpenBSD,  DragonFlyBSD   •  Background  is  UNIX   •  Made  peace  with  Windows  when  XP  came  out  
  3. What is PyParallel? •  Set  of  modifications  to  CPython  interpreter

      •  Allows  multiple  interpreter  threads  to  run  in  parallel  without  incurring  any   additional  performance  penalties   •  Solves  the  GIL  problem  without  removing  the  GIL   o  Because  the  problem  isn’t  the  GIL.   o  (The  problem  is  that  I  want  to  optimally  exploit  my  hardware  as  efficiently  as  possible   with  a  reasonable  amount  of  development  effort.)   •  Started  as  a  proof  of  concept   •  I’m  now  convinced  it’s  essential  for  Python  to  stay  competitive  for  the  next   20+  years  and  beyond   •  That  time  is  going  to  pass  anyway,  we  may  as  well  have  a  plan  in  place  
  4. “Describe what developing for each console you’ve developed for is

    like.” •  Like  all  the  best  quotations,  this  one  comes  from  reddit:   o  http://www.reddit.com/r/gamedev/comments/xddlp/ describe_what_developing_for_each_console_youve/  
  5. PS2: You are handed a 10-inch thick stack of manuals

    written by Japanese hardware engineers. The first time you read the stack, nothing makes any sense at all. The second time your read the stack, the 3rd book makes a bit more sense because of what you learned in the 8th book. The machine has 10 different processors (IOP, SPU1&2, MDEC, R5900, VU0&1, GIF, VIF, GS) and 6 different memory spaces (IOP, SPU, CPU, GS, VU0&1) that all work in completely different ways. There are so many amazing things you can do, but everything requires backflips through invisible blades of segfault. Getting the first triangle to appear on the screen took some teams over a month because it involved routing commands through R5900->VIF->VU1->GIF->GS oddities with no feedback about what your were doing wrong until you got every step along the way to be correct. If you were willing to do twist your game to fit the machine, you could get awesome results. There was a debugger for the main CPU (R5900). It worked pretty OK. For the rest of the processors, you just had to write code without bugs. “everything  requires  backflips  through  invisible  blades  of  segfault”   -­‐  PyParallel:  The  Early  Days.   [*]: still applicable, 17th August, 2014, 2:18pm
  6. Motivation behind PyParallel •  What  problem  was  I  trying  to

     solve?   •  Wasn’t  happy  with  the  status  quo   o  Parallel  options  (for  compute-­‐bound,  data  parallelism  problems):   •  GIL  prevents  simultaneous  multithreading   •  ….so  you  have  to  rely  on  separate  Python  processes  if  you  want  to  exploit  more  than  one   core   o  Concurrency  options  (for  I/O-­‐bound  or  I/O-­‐driven,  task  parallelism  problems):   •  One  thread  per  client,  blocking  I/O   •  Single-­‐thread,  event  loop,  multiplexing  system  call  (select/poll/epoll/kqueue)  
  7. What if I’m I/O-bound and compute-bound? •  Contemporary  enterprise  problems:

      o  Computationally-­‐intensive  (compute-­‐bound)  work  against  TBs/PBs  of  data  (I/O-­‐bound)   o  Serving  tens  of  thousands  of  network  clients  (I/O-­‐driven)  with  non-­‐trivial  computation   required  per  request  (compute-­‐bound)   o  Serving  fewer  clients,  but  providing  ultra-­‐low  latency  or  maximum  throughput  to  those   you  do  serve  (HFT,  remote  array  servers,  etc)   •  Contemporary  data  center  hardware  :   o  128  cores,  512GB  RAM   o  Quad  10Gb  Ethernet  NICs   o  SSDs  &  Fusion-­‐IO  style  storage  -­‐>  500k-­‐800k+  IOPS  from  a  single  device   o  2016:  128Gb  Fibre  Channel  (4x32Gb)  -­‐>  25.6GB/s  throughput  
  8. Real Problems, Powerful Hardware •  I  want  to  solve  my

     problems  as  optimally  as  my  hardware  will  allow   •  Optimal  hardware  use  necessitates  things  like:   o  One  active  thread  per  core   •  Any  more  results  in  unnecessary  context  switches   o  No  unnecessary  duplication  of  shared/common  data  in  memory   o  Ability  to  saturate  the  bandwidth  of  my  I/O  devices   •  And  I  want  to  do  it  all  in  Python   •  ....yet  still  be  competitive  against  C/C++  where  it  matters    
  9. What do you want to see next? •  Segfaul^WLive  Demo!

      •  Benchmarks!   •  Moor  slides!   o  I  have  74  more  in  this  deck.   o  And  154  in  my  other  one!   •  Q&A!   •  Exclamation  points!  
  10. Concurrency versus Parallelism •  Concurrency:   o  Making  progress  on

     multiple  things  at  the  same  time   •  Task  A  doesn’t  need  to  complete  before  you  can  start  work  on  task  B     o  Typically  used  to  describe  I/O-­‐bound  or  I/O-­‐driven  systems,  especially  network-­‐ oriented  socket  servers   •  Parallelism:     o  Making  progress  on  one  thing  in  multiple  places  at  the  same  time   •  Task  A  is  split  into  8  parts,  each  part  runs  on  a  separate  core   o  Typically  used  in  compute-­‐bound  contexts   •  Map/reduce,  aggregation,  “embarrassingly  parallelizable”  data  etc  
  11. So for a given time frame T (1us, 1ms, 1s

    etc)… •  Concurrency:  how  many  things  did  I  do?   o  Things  =  units  of  work  (e.g.  servicing  network  clients)   o  Performance  benchmark:   •  How  fast  was  everyone  served?    (i.e.  request  latency)   •  And  were  they  served  fairly?   •  Parallelism:  how  many  things  did  I  do  them  on?   o  Things  =  hardware  units  (e.g.  CPU  cores,  GPU  cores)   o  Performance  benchmark  :   •  How  much  did  I  get  done?   •  How  long  did  it  take?  
  12. Concurrent Python •  I/O-­‐driven  client/server  systems  (socket-­‐oriented)   •  There

     are  some  pretty  decent  Python  libraries  out  there  geared  toward   concurrency   o  Twisted,  Tornado,  Tulip/asyncio  (3.x),  etc   •  Common  themes:   o  Set  all  your  sockets  and  file  descriptors  to  non-­‐blocking   o  Write  your  Python  in  an  event-­‐oriented  fashion   •  def  data_received(self,  data):  …   •  Hollywood  Principle:  don’t  call  us,  we’ll  call  you   o  Appearance  of  asynchronous  I/O  achieved  via  single-­‐threaded  event  loop  with  multiplexing   system  call   •  Biggest  drawback:   o  Inherently  limited  to  a  single-­‐core   o  Thus,  inadequate  for  problems  that  are  both  concurrent  and  computationally  bound  
  13. Coarse-grained versus fine-grained parallelism •  Coarse-­‐grained  (task  parallelism)   o 

    Batch  processing  daily  files   o  Data  mining  distinct  segments/chunks/partitions   o  Process  A  runs  on  X  data  set,  independent  to  process  B  running  on  Y  data  set   •  Fine-­‐grained  (data  parallelism)   o  Map/reduce,  divide  &  conquer,  aggregation  etc   o  Common  theme:  sequential  execution,  fan  out  to  parallel  against  shared  data   set,  collapse  back  down  to  sequential  
  14. Coarse-grained versus fine-grained parallelism •  Coarse-­‐grained  (multiple  processes):   o 

    Typically  adequate:  using  multiple  processes  that  don’t  need  to  talk  to  each   other  (or  if  they  do,  don’t  need  to  talk  often)   o  Depending  on  shared  state,  could  still  benefit  if  implemented  via  threads   instead  of  processes   •  Better  cache  usage,  less  duplication  of  identical  memory  structures,  less  overhead  overall   •  Fine-­‐grained  (multiple  threads):   o  Typically  optimal:  using  multiple  threads  within  the  same  address  space   o  IPC  overhead  can  severely  impact  net  performance  when  having  to  use   processes  instead  of  threads  
  15. Python landscape for fine-grained parallelism •  Python’s  GIL  (global  interpreter

     lock)  prevents  more  than  one  Python   interpreter  thread  running  at  a  given  time   •  If  you  want  to  use  multiple  threads  within  the  same  Python  process,   you  have  to  come  up  with  a  way  to  avoid  the  GIL   o  (Fine-­‐grained  parallelism  =~  multithreading)   •  Today,  this  relies  on:   o  Extension  modules  or  libraries   o  Bypassing  CPython  interpreter  entirely  and  compiling  to  machine  code  
  16. Python landscape for fine-grained parallelism •  Options  today:   o 

    Extension  modules  or  libraries:   •  Accelerate/NumbaPro  (GPU,  Multicore)   •  OpenCV   •  Intel  MKL  Libraries   o  Bypassing  the  CPython  interpreter  entirely  by  compiling  Python  to  machine  code:   •  Numba  with  threading   •  Cython  with  OpenMP   •  Options  tomorrow  (Python  4.x):   o  PyParallel?   •  Demonstrates  it  is  possible  to  have  multiple  threads  running  CPython  interpreter  threads  in   parallel  without  incurring  a  performance  overhead   o  PyPy-­‐STM?  
  17. Python Landscape for Coarse-grained Parallelism •  Rich  ecosystem  depending  on

     your  problem:   o  https://wiki.python.org/moin/ParallelProcessing   o  batchlib,  Celery,  Deap,  disco,  dispy,  DistributedPython,  exec_proxy,  execnet,   IPython  Parallel,  jug,  mpi4py,  PaPy,  pyMPI,  pypar,  pypvm,  Pyro,  rthread,   SCOOP,  seppo,  superspy   •  Python  stdlib  options:   o  multiprocessing  (since  2.6)   o  concurrent.futures  (introduced  in  3.2,  backported  to  2.7)   •  Common  throughout:   o  Separate  Python  processes  to  achieve  parallel  execution  
  18. Python & the GIL •  No  talk  on  parallelism  and

     concurrency  in  Python  would  be  complete  without   mentioning  the  GIL  (global  interpreter  lock)   •  What  is  it?   o  A  lock  that  ensures  only  one  thread  can  execute  CPython  innards  at  any  given  time   o  Create  100  threading.Thread()  instances…   o  ....and  only  one  will  run  at  any  given  time   •  So  why  even  support  threads  if  they  can’t  run  in  parallel?   •  Because  they  can  be  useful  for  blocking,  I/O-­‐bound  problems   o  Ironically,  they  facilitate  concurrency  in  Python,  not  parallelism   •  But  they  won’t  solve  your  compute-­‐bound  problem  any  faster   •  Nor  will  you  ever  exploit  more  than  one  core  
  19. import multiprocessing •  Added  in  Python  2.6  (2008)   • 

    Similar  interface  to  threading  module   •  Uses  separate  Python  processes  behind  the  scenes  
  20. from multiprocessing import pros •  It  works   •  It’s

     in  the  stdlib   •  It’s  adequate  for  coarse-­‐grained  parallelism   •  It’ll  use  all  my  cores  if  I’m  compute-­‐bound  
  21. from multiprocessing import cons •  Often  sub-­‐optimal  depending  on  the

     problem   •  Inadequate  for  fine-­‐grained  parallelism   •  Inadequate  for  I/O-­‐driven  problems  (specifically  socket  servers)   •  Overhead  of  extra  processes   •  No  shared  memory  out  of  the  box  (I’d  have  to  set  it  up  myself)   •  Kinda’  quirky  on  Windows   •  The  examples  in  the  docs  are  trivialized  and  don’t  really  map  to  real  world   problems   o  https://docs.python.org/2/library/multiprocessing.html   o  i.e.  x*x  for  x  in  [1,  2,  3,  4]    
  22. from multiprocessing import subtleties •  Recap:  contemporary  data  center  hardware:

     128  cores,  512GB  RAM   •  I  want  to  use  multiprocessing  to  solve  my  compute-­‐bound  problem   •  And  I  want  to  optimally  use  my  hardware;  idle  cores  are  useless   •  So  how  big  should  my  multiprocessor  pool  be?    How  many  processes?   •  128,  right?  
  23. 128 cores = 128 processes? •  Works  fine…  until  you

     need  to  do  I/O   •  And  you’re  probably  going  to  be  doing  blocking  I/O   o  i.e.  synchronous  read/write  calls   o  Non-­‐blocking  I/O  is  poorly  suited  to  multiprocessing  as  you’d  need  to  have  per-­‐process   event-­‐loops  doing  the  syscall  multiplexing  dance   •  The  problem  is,  as  soon  as  you  block,  that’s  one  less  process  able  to  do  useful   work   •  Can  quickly  become  pathological:   o  Start  a  pool  of  64  processors  (for  64  cores)   o  Few  minutes  later:  only  20-­‐25  active   •  Is  the  solution  to  create  a  bigger  pool?  
  24. 128 cores = 132 processes? 194? 256? •  Simply  increasing

     the  number  of  processes  isn’t  the  solution   •  Results  in  pathological  behavior  on  the  opposite  end  of  the  spectrum   o  Instead  of  idle  cores,  you  have  over-­‐scheduled  cores   o  Significant  overhead  incurred  by  context  switching   •  Cache  pollution,  TLB  contention   o  You  can  visibly  see  this  with  basic  tools  like  top:  20%  user,  80%  sys     •  Neither  approach  is  optimal  today:   o  processes  <=  ncpu:  idle  cores   o  processes  >  ncpu:  over-­‐scheduled  cores  
  25. What do we really need? •  We  want  to  solve

     our  problems  optimally  on  our  powerful  hardware   •  Avoid  the  sub-­‐optimal:   o  Blocking  I/O   o  Idleness  (under-­‐scheduled)   o  Context  switching  (over-­‐scheduled)   o  Wasteful  memory  use   •  Encourage  the  optimal:   o  One  active  thread  per  core   o  Efficient  memory  use  
  26. I want one active thread per core •  This  is

     a  subtlety  complex  problem   •  Intrinsically  dependent  upon  I/O  facilities  provided  by  the  OS:   o  Readiness-­‐oriented  or  completion-­‐oriented?   o  Thread-­‐agnostic  I/O  or  thread-­‐specific  I/O?   •  Plus  one  critical  element:   o  Disassociating  the  work  (computation)  from  the  worker  (thread)   o  Associating  a  desired  concurrency  level  (i.e.  use  all  my  cores)  with  the  work   •  This  allows  the  kernel  to  make  intelligent  thread  dispatching  decisions   o  Ensures  only  one  active  thread  per  core   o  No  over-­‐scheduling  or  unnecessary  context  switches   Blocking I/O
  27. The Desired Solution •  Getting  the  most  out  of  my

     hardware…   •  ….from  a  proportional  amount  of  development  time  
  28. Getting the most out of my hardware… •  The  target

     should  always  be  100%  core  use,  100%  I/O  saturation,   whichever  comes  first   •  Why?   •  Because  I  want  to  finish  the  job  as  fast  as  the  hardware  will  allow   •  Or  serve  the  most  clients  with  the  least  amount  of  hardware   •  With  sensible  amounts  of  development  effort   o  Python  has  always  been  fantastic  for  this   o  But  not  so  great  for  getting  the  most  out  of  my  hardware  
  29. •  IOCPs  can  be  thought  of  as  FIFO  queues  

    •  I/O  manager  pushes  completion  packets  asynchronously   •  Threads  pop  completions  off  and  process  results:       I/O Completion Ports do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP
  30. IOCP and Concurrency •  Set  I/O  completion  port’s  concurrency  to

     number  of  CPUs/cores  (2)   •  Create  double  the  number  of  threads  (4)   •  An  active  thread  does  something  that  blocks  (i.e.  file  I/O)   do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  31. IOCP and Concurrency •  Set  I/O  completion  port’s  concurrency  to

     number  of  CPUs/cores  (2)   •  Create  double  the  number  of  threads  (4)   •  An  active  thread  does  something  that  blocks  (i.e.  file  I/O)   •  Windows  can  detect  that  the  active  thread  count  (1)  has  dropped  below  max   concurrency  (2)  and  that  there  are  still  outstanding  packets  in  the  completion  queue   do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  32. IOCP and Concurrency •  Set  I/O  completion  port’s  concurrency  to

     number  of  CPUs/cores  (2)   •  Create  double  the  number  of  threads  (4)   •  An  active  thread  does  something  that  blocks  (i.e.  file  I/O)   •  Windows  can  detect  that  the  active  thread  count  (1)  has  dropped  below  max   concurrency  (2)  and  that  there  are  still  outstanding  packets  in  the  completion  queue   •  ....and  schedules  another  thread  to  run   do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  33. Windows and PyParallel •  The  Windows  concurrency  and  synchronization  primitives

     and  approach  to  asynchronous  I/ O  is  very  well  suited  to  what  I  wanted  to  do  with  PyParallel   •  Vista  introduced  new  thread  pool  APIs   •  Tightly  integrated  into  IOCP/overlapped  ecosystem   •  Greatly  reduces  the  amount  of  scaffolding  code  I  needed  to  write  to  prototype  the  concept   void PxSocketClient_Callback(); CreateThreadpoolIo(.., &PxSocketClient_Callback) .. StartThreadpoolIo(..) AcceptEx(..)/WSASend(..)/WSARecv(..) •  That’s  it.    When  the  async  I/O  op  completes,  your  callback  gets  invoked   •  Windows  manages  everything:  optimal  thread  pool  size,  NUMA-­‐cognizant  dispatching   •  Didn’t  need  to  create  a  single  thread,  no  mutexes,  none  of  the  normal  headaches  that   come  with  multithreading  
  34. Post-PyParallel •  I  now  have  the  Python  glue  to  optimally

     exploit  my  hardware   •  But  it’s  still  Python…   •  ….and  Python  can  be  kinda’  slow   •  Especially  when  doing  computationally-­‐intensive  work   o  Especially  especially  when  doing  numerically-­‐oriented  computation   •  Enter…  Numba!  
  35. What do I want in Python 4.x? •  Native  cross-­‐platform

     PyParallel  support   •  @jit  hooks  introduced  in  the  stdlib   •  ….and  an  API  for  multiple  downstream  jitters  to  hook  into   o  CPython  broadcasts  the  AST/bytecode  being  executed  in  ceval  to  jitters   o  Multiple  jitters  running  in  separate  threads   o  CPython:  “Hey,  can  you  optimize  this  chunk  of  Python?    Let  me  know.”   o  Next  time  it  encounters  that  chunk,  it  can  check  for  optimal  versions   •  Could  provide  a  viable  way  of  hooking  in  Numba,  PyPy,  Pythran,   ShedSkin,  etc,  whilst  still  staying  within  the  confines  of  CPython  
  36. I/O on Contemporary Windows Kernels (Vista+) •  Fantastic  support  for

     asynchronous  I/O   •  Threads  have  been  first  class  citizens  since  day  1  (not  bolted  on  as  an   afterthought)   •  Designed  to  be  programmed  in  a  completion-­‐oriented,  multi-­‐threaded   fashion   •  Overlapped  I/O  +  IOCP  +  threads  +  kernel  synchronization  primitives  =   excellent  combo  for  achieving  high  performance  
  37. I/O Completion Ports •  The  best  way  to  grok  IOCP

     is  to  understand  the  problem  it  was   designed  to  solve:   o Facilitate  writing  high-­‐performance  network/file  servers  (http,   database,  file  server)   o Extract  maximum  performance  from  multi-­‐processor/multi-­‐core   hardware   o (Which  necessitates  optimal  resource  usage)  
  38. IOCP: Goals •  Extract  maximum  performance  through  parallelism   o 

    Thread  running  on  every  core  servicing  a  client  request   o  Upon  finishing  a  client  request,  immediately  processes  the  next  request  if  one  is   waiting   o  Never  block   o  (And  if  you  do  block,  handle  it  as  optimally  as  possible)   •  Optimal  resource  usage   o  One  active  thread  per  core  
  39. On not blocking... •  UNIX  approach:   o  Set  file

     descriptor  to  non-­‐blocking   o  Try  read  or  write  data   o  Get  EAGAIN  instead  of  blocking   o  Try  again  later   •  Windows  approach     o  Create  an  overlapped  I/O  structure   o  Issue  a  read  or  write,  passing  the  overlapped  structure  and  completion  port  info   o  Call  returns  immediately   o  Read/write  done  asynchronously  by  I/O  manager   o  Optional  completion  packet  queued  to  the  completion  port  a)  on  error,  b)  on  completion.   o  Thread  waiting  on  completion  port  de-­‐queues  completion  packet  and  processes  request  
  40. On not blocking... •  UNIX  approach:   o  Is  this

     ready  to  write  yet  yet?   o  No?    How  about  now?   o  Still  no?   o  Now?   o  Yes!?    Really?    Ok,  write  it!   o  Hi!    Me  again.    Anything  to  read?   o  No?   o  How  about  now?   •  Windows  approach:   o  Here,  do  this.    Let  me  know  when  it’s  done.     Readiness-oriented Completion-oriented (reactor pattern) (proactor pattern)
  41. On not blocking... •  Windows  provides  an  asynchronous/overlapped  way  to

     do  just  about  everything   •  Basically,  if  it  could  block,  there’s  a  way  to  do  it  asynchronously  in  Windows   •  WSASend  and  WSARecv   •  AcceptEx()  vs  accept()   •  ConnectEx()  vs  connect()   •  DisconnectEx()  vs  close()   •  GetAddrinfoEx()  vs  getaddrinfo()  (Windows  8+)   •  (And  that’s  just  for  sockets;  all  device  I/O  can  be  done  asynchronously)  
  42. Thread-agnostic I/O with IOCP •  Secret  sauce  behind  asynchronous  I/O

     on  Windows   •  IOCPs  allow  IRP  completion  (copying  data  from  nonpaged  kernel   memory  back  to  user’s  buffer)  to  be  deferred  to  a  thread-­‐agnostic   queue   •  Any  thread  can  wait  on  this  queue  (completion  port)  via   GetQueuedCompletionStatus()   •  IRP  completion  done  just  before  that  call  returns   •  Allows  I/O  manager  to  rapidly  queue  IRP  completions   •  ....and  waiting  threads  to  instantly  dequeue  and  process  
  43. •  IOCPs  can  be  thought  of  as  FIFO  queues  

    •  I/O  manager  pushes  completion  packets  asynchronously   •  Threads  pop  completions  off  and  process  results:       IOCP and Concurrency do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); GQCS = GetQueuedCompletionStatus() Completion Packet IOCP I/O Manager NIC IRP
  44. IOCP and Concurrency •  Remember  IOCP  design  goals:   o 

    Maximize  performance   o  Optimize  resource  usage   •  Optimal  number  of  active  threads  running  per  core:  1   •  Optimal  number  of  total  threads  running:  1  *  ncpu   •  Windows  can’t  control  how  many  threads  you  create  and  then  have   waiting  against  the  completion  port   •  But  it  can  control  when  and  how  many  threads  get  awoken   •  ….via  the  IOCP’s  maximum  concurrency  value   •  (Specified  when  you  create  the  IOCP)  
  45. IOCP and Concurrency •  Set  I/O  completion  port’s  concurrency  to

     number  of  CPUs/cores  (2)   •  Create  double  the  number  of  threads  (4)   •  An  active  thread  does  something  that  blocks  (i.e.  file  I/O)   do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  46. IOCP and Concurrency •  Set  I/O  completion  port’s  concurrency  to

     number  of  CPUs/cores  (2)   •  Create  double  the  number  of  threads  (4)   •  An  active  thread  does  something  that  blocks  (i.e.  file  I/O)   •  Windows  can  detect  that  the  active  thread  count  (1)  has  dropped  below  max   concurrency  (2)  and  that  there  are  still  outstanding  packets  in  the  completion  queue   do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  47. IOCP and Concurrency •  Set  I/O  completion  port’s  concurrency  to

     number  of  CPUs/cores  (2)   •  Create  double  the  number  of  threads  (4)   •  An  active  thread  does  something  that  blocks  (i.e.  file  I/O)   •  Windows  can  detect  that  the  active  thread  count  (1)  has  dropped  below  max   concurrency  (2)  and  that  there  are  still  outstanding  packets  in  the  completion  queue   •  ....and  schedules  another  thread  to  run   do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); do { s = GQCS(i); process(s); } while (1); IOCP concurrency=2
  48. So how does it work? •  First, how it doesn’t

    work: o  No GIL removal •  This was previously tried and rejected •  Required fine-grained locking throughout the interpreter •  Mutexes are expensive •  Single-threaded execution significantly slower o  Not using PyPy’s approach via Software Transactional Memory (STM) •  Huge overhead •  64 threads trying to write to something, 1 wins, continues •  63 keep trying •  63 bottles of beer on the wall… •  Doesn’t support “free threading” o  Existing code using threading.Thread won’t magically run on all cores o  You need to use the new async APIs
  49. PyParallel’s Approach •  Don’t  touch  the  GIL   o  It’s

     great,  serves  a  very  useful  purpose   •  Instead,  intercept  all  thread-­‐sensitive  calls:   o  Reference  counting  (Py_(INCREF|DECREF|CLEAR))   o  Memory  management  (PyMem_(Malloc|Free),  PyObject_(INIT|NEW))   o  Free  lists   o  Static  C  globals   o  Interned  strings   •  If  we’re  the  main  thread,  do  what  we  normally  do   •  However,  if  we’re  a  parallel  thread,  do  a  thread-­‐safe  alternative  
  50. Main thread or Parallel Thread? •  “If  we’re  a  parallel

     thread,  do  X,  if  not,  do  Y”   o  X  =  thread-­‐safe  alternative   o  Y  =  what  we  normally  do   •  “If  we’re  a  parallel  thread”   o  Thread-­‐sensitive  calls  are  ubiquitous   o  But  we  want  to  have  a  negligible  performance  impact   o  So  the  challenge  is  how  quickly  can  we  detect  if  we’re  a  parallel  thread   o  The  quicker  we  can  detect  it,  the  less  overhead  incurred  
  51. The Py_PXCTX macro “Are we running in a parallel context?”

                   #define  Py_PXCTX  (Py_MainThreadId  !=  _Py_get_current_thread_id())     •  What’s  so  special  about  _Py_get_current_thread_id()?   o  On  Windows,  you  could  use  GetCurrentThreadId()   o  On  POSIX,  pthread_self()   •  Unnecessary  overhead  (this  macro  will  be  everywhere)   •  Is  there  a  quicker  way?   •  Can  we  determine  if  we’re  running  in  a  parallel  context  without  needing   a  function  call?  
  52. Windows Solution: Interrogate the TEB #ifdef  WITH_INTRINSICS   #  

       ifdef  MS_WINDOWS   #              include  <intrin.h>   #              if  defined(MS_WIN64)   #                      pragma  intrinsic(__readgsdword)   #                      define  _Py_get_current_process_id()  (__readgsdword(0x40))   #                      define  _Py_get_current_thread_id()    (__readgsdword(0x48))   #              elif  defined(MS_WIN32)   #                      pragma  intrinsic(__readfsdword)   #                      define  _Py_get_current_process_id()  __readfsdword(0x20)   #                      define  _Py_get_current_thread_id()    __readfsdword(0x24)  
  53. Py_PXCTX Example -­‐#define  _Py_ForgetReference(op)  _Py_INC_TPFREES(op)   +#define  _Py_ForgetReference(op)    

                                         \   +        do  {                                                                                        \   +                if  (Py_PXCTX)                                                              \   +                        _Px_ForgetReference(op);                                \   +                else                                                                                \   +                        _Py_INC_TPFREES(op);                                        \   +        }  while  (0)   +   +#endif  /*  WITH_PARALLEL  */   •  Py_PXCTX == (Py_MainThreadId == __readfsdword(0x48)) •  Overhead  reduced  to  a  couple  more  instructions  and  an  extra  branch  (cost  of  which   can  be  eliminated  by  branch  prediction)   •  That’s  basically  free  compared  to  STM  or  fine-­‐grained  locking  
  54. PyParallel Advantages •  Initial  profiling  results:  0.01%  overhead  incurred  by

     Py_PXCTX  for  normal   single-­‐threaded  code   o  GIL  removal:  40%  overhead   o  PyPy’s  STM:  “200-­‐500%  slower”   •  Only  touches  a  relatively  small  amount  of  code   o  No  need  for  intrusive  surgery  like  re-­‐writing  a  thread-­‐safe  bucket  memory   allocator  or  garbage  collector   •  Keeps  GIL  semantics   o  Important  for  legacy  code   o  3rd  party  libraries,  C  extension  code   •  Code  executing  in  parallel  context  has  full  visibility  to  “main  thread   objects”  (in  a  read-­‐only  capacity,  thus  no  need  for  locks)  
  55. PyParallel In Action •  Things  to  note  with  the  chargen

     demo  coming  up:   o  One  python_d.exe  process   o  Constant  memory  use   o  CPU  use  proportional  to  concurrent  client  count  (1  client  =  25%  CPU  use)   o  Every  10,000  sends,  a  status  message  is  printed   •  Depicts  dynamically  switching  from  synchronous  sends  to  async  sends   •  Illustrates  awareness  of  active  I/O  hogs   •  Environment:   o  Macbook  Pro,  8  core  i7  2.2GHz,  8GB  RAM   o  1-­‐5  netcat  instances  on  OS  X   o  Windows  7  instance  running  in  Parallels,  4  cores,  3GB  
  56. Thanks! Follow  us  on  Twitter  for  more  PyParallel  announcements!  

    @ContinuumIO   @trentnelson   http://continuum.io/