Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rudy Gilmore - Parallel processing - PyDSLA meetup - Nov 2014

Data Science LA
November 05, 2014
2.8k

Rudy Gilmore - Parallel processing - PyDSLA meetup - Nov 2014

Data Science LA

November 05, 2014
Tweet

More Decks by Data Science LA

Transcript

  1. Intro  to  Multiprocessing  with  Python     Rudy  Gilmore  

      Data  Scien3st,  TrueCar  Analy3cs  Team     PyData  Meetup,  11/3/14    
  2. Code  Paralleliza,on     •  Modern  processors  are  not  becoming

     much  faster,  but  are  more  numerous   •  Many  problems  in  analy3cs  are  easily  parallelizable     •  Wri3ng  parallel  code  will  oGen  allow  you  to  get  done  in  1/nth  the  3me   •  Amdahl’s  Law:     •  Python  has  some  barriers  to  paralleliza3on,  but  there  are  simple  workarounds   There  are  many  op3ons  for  high-­‐performance  parallel  compu3ng   Ø  Cluster  Compu,ng?     Ø  Hadoop?   Ø  Distributed  Processing?   Ø  GPGPUs?     Let’s  start  simple,  how  to  get  mul,ple  cores  on  one  machine  into  the  ac,on    
  3. “Embarrassingly  Parallel”   (Processes  completely  independent)     Examples:  

    1  independent    for    loop   2  .map  ops  on  dataset   3  integra3on   4  Monte-­‐Carlo  methods   5  Some  ML  problems   “Inherently  Serial”   (Difficult  or  impossible  to   run  in  parallel)     Example:    numerical  PDE   “Somewhat  Parallelizable”   (Some  communica3on  needed)     Example:  sor3ng   Parallel  algorithms  can  be  classified  by  data  transfer  required  between  processes    -­‐  this  can  be  done  via  message  passing  or  shared  memory  
  4. Python’s  Global  Interpreter  Lock  (GIL)   Only  one  thread  may

     access  code  in  python  interpreter  at  a  ,me     •  Mul3ple  threads  will  automa3cally  switch  off  at  standard  interval   •  GIL  appears  in  Cython;  some  other  distros  like  Jython  and  PyPy  do   not  have  this  limita3on    
  5. Python’s  thread  and  threading  modules     •  Provide  resources

     for  spli^ng  program  into   mul3ple  threads   •  However,  for  CPU-­‐intensive  tasks...                ....there  will  not  be  any  speedup  from                                  mul3threading  alone   •  GIL  s3ll  in  effect   •  So  what  good  is  mul3threading  anyways?   •  CPU-­‐bound  vs  I/O  bound:              threading  useful  in  lacer  but  not  former       What  you  want   What  you’re     gonna  get  
  6. mul,processing  module     •  part  of  standard  lib  as

     of  python  2.6     •  launchs  mul3ple  processes   •  processes  include  separate  interpreters  -­‐  and   therefore  separate  GILs   •  each  process  operates  on  a  separate  copy  of   memory  from  3me  of  launch   •  similar  syntax  to  threading   •  beware,  processes  have  significant  overhead  in   some  OS,  namely  Windows       GIL  1   GIL  2  
  7. Some  simple  examples  of  threading  and  mul,processing   Running  Cpython

     v2.7.6     First,  let’s  set  up  a  CPU-­‐bound  task:     def isprime(n):! for i in range(2,int(n**(0.5))+1):! if n%i==0:! return False! return True! ! def prime(Nth,q=None): # prints Nth prime! n_found = 0! i = 0! while n_found<Nth:! i+=1! n_found = n_found+int(isprime(i))! if q:! q.put(i) # send to Queue object if set! return i!
  8. import time! import threading as th! import multiprocessing as mp!

    ! start=20000! ! if __name__=='__main__':! t1=time.time() #time serial segment! print prime(start), prime(start+1), prime(start+2), prime(start+3)! print 'Serial test took',time.time() - t1,'seconds'! ! t2 = time.time() #time multithreaded segment! jobs = [th.Thread(target=prime, args=(start,q))\! ,th.Thread(target=prime, args=(start+1,q))\! ,th.Thread(target=prime, args=(start+2,q))\! ,th.Thread(target=prime, args=(start+3,q))]! for j in jobs:! j.start()! for j in jobs:! j.join()! print 'Multithreaded test took',time.time() - t2,'seconds'! ! q = mp.Queue()! t3 = time.time() #time multiprocessing segment! jobs = [mp.Process(target=prime, args=(start,q))\! ,mp.Process(target=prime, args=(start+1,q))\! ,mp.Process(target=prime, args=(start+2,q))\! ,mp.Process(target=prime, args=(start+3,q))]! for j in jobs:! j.start()! for j in jobs:! j.join()! print 'Multiprocessing test took',time.time() - t3,'seconds'!
  9. Output:     224729 224737 224743 224759! Serial test took

    3.68699979782 seconds! Multithreaded test took 5.64900016785 seconds! Multiprocessing test took 1.29299998283 seconds!
  10. mul3processing.Pool()  provides  a  map-­‐like  interface  with  automa3c   paralleliza3on  among

     pool  of  workers      # converting into a pool process! t4 = time.time()! pool = mp.Pool(processes=4)! result = pool.map(prime,range(start,start+4))! print result! print 'Pool test took',time.time() - t4,'seconds'!   Output:     Serial test took 3.68699979782 seconds! Multithreaded test took 5.64900016785 seconds! Multiprocessing test took 1.29299998283 seconds! [224729, 224737, 224743, 224759]! Pool test took 1.31299996376 seconds!   Notes:     •  Tasks  should  be  roughly  equal  size  -­‐  adjust  manually  if  possible   •  map() will  block  un3l  job  complete,  can  use    map_async()  to  return   result  immediately   •  mul3ple  args  will  need  to  be  combined  into  a  single  list,  unwrap  with  *  
  11. Further  reading:     •  mul3processing  supports  inter-­‐process  communica3on  using

       Queue() and  Pipe() ! •  support  for  sharing  objects  in  memory  using    Value() and    Array()! •  "premature  op,miza,on  is  the  root  of  all  evil”.    Discuss.     In  Conclusion:     •  Use  threading  if  you  have  a  poten3ally  blocking  I/O  procedure,  like  a  download   or  SQL  query   •  Use  mul3processing.Process()  and  mul3processing.Pool()  to  run  CPU-­‐intensive   tasks  in  parallel     References:   hcp://sebas3anraschka.com/Ar3cles/2014_mul3processing_intro.html#An-­‐introduc3on-­‐to-­‐parallel-­‐ programming-­‐using-­‐Python%27s-­‐mul3processing-­‐module   hcp://www.quantstart.com/ar3cles/Parallelising-­‐Python-­‐with-­‐Threading-­‐and-­‐Mul3processing   hcp://www.dabeaz.com/python/GIL.pdf   hcp://calcul.math.cnrs.fr/Documents/Ecoles/2010/cours_mul3processing.pdf   hcp://pymotw.com/2/mul3processing/communica3on.html#process-­‐pools