Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multiprocessing and Gearman

Ankur Gupta
November 29, 2011

Multiprocessing and Gearman

Talk at scipy.in

Ankur Gupta

November 29, 2011
Tweet

More Decks by Ankur Gupta

Other Decks in Programming

Transcript

  1. About me • Ankur Gupta • Computer Programmer, • Python,

    C++ Programmer for 5+ years, • Worked at two startups and HP, • Enjoy reading and making software that people use, • Homepage: http://uptosomething.in • Email: [email protected]
  2. Content 1) Moore's law – Multi Core - GIL, 2)

    Data Explosion – HP Case Study, 4) Multiprocessing – Introduction and Capabilities, 5) Multiprocessing – Live Code Review, 6) Gearman – Introduction and Capabilities, 7) Gearman – How-to and Code Review, 8) Logging, Debugging, Monitoring.
  3. Moore's law – Multi Core - GIL “Moore's law describes

    a long-term trend in the history of computing hardware: the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years.” Moore's law as of today comes to us in the form of multi-core CPU's. Developers of yesteryears still code like they have access to a single core. Even for Embarrassingly parallel class of problems. ( Search wikipedia for embarrassingly parallel ) Python developers are at severe disadvantage thanks to GIL and unavailability of Intel Thread building blocks like data structures as part of python standard library.
  4. Data Explosion : HP Case Study a) 40 Terabytes plus

    worth instrumentation and performance data received from cluster of storage (SAN) boxes deployed globally, b) If Box goes down (downtime) SLA, monetary consequences are dire, c) Best case solution is to analyze the data in near realtime. Thus finding problems waiting to happen and dispatch an automated email containing data to support what needs to be corrected in the box and how, d) Existing codebase for passive analysis of above exist. Codebase contains code in perl, sed, awk, shell script, tcl, c, e) Python (Multiprocessing, Django) and Gearman save the day, i) Python ensured we got version 1.0 working in no time, ii) Multiprocessing ensured we used all cores on a given machine, iii) Gearman ensured that we can not only use a diversified codebase but also scale our application/software to run across more machines with minimal refactoring of version 1.0.
  5. Multiprocessing – Introduction and Capabilities “Multiprocessing is a package that

    supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.” Source: http://docs.python.org/library/multiprocessing.html a) Shields programmers from the chores of IPC by offering Pipes, Queues, Process Pool, Managers for shared data, b) Similarity with Threading API, c) Multiprocessing can scale over farm of machines using Remote Manager, d) Multiprocessing doesn't suffer from GIL as each process is responsible for it's own memory management aka each process has it's own instance of Python Interpreter.
  6. Multiprocessing – Live Code Review a) Factorial example ( one

    process ) b) Factorial example using multiprocessing employing multiple cores ( 2 * No of cores ) Code can be downloaded from http://uptosomething.in/scipy/code.tar.gz ( Link will be valid from 4th December onwards )
  7. Gearman – Introduction and Capabilities “Gearman provides a generic application

    framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages.” http://gearman.org What Sqlite is to relational databases, Gearman is to Distributed Job Queues.
  8. Gearman – How-to and Code Review a) Installation and basic

    usage of gearman, b) Factorial example using gearman ( demo uses multiple machines ) Code can be downloaded from http://uptosomething.in/scipy/code.tar.gz ( Link will be valid from 4th December onwards )
  9. Logging, Debugging, Monitoring a) Use logging module to log multiprocess

    e.g SocketHandler for distributed application, syslogd for one using multiple processes running on one machine, b) For debugging attach cProfile to each process then dump stats output of it. Possible to join output of all. Yes it's cumbersome/painful specially if we are looking at resource deadlocks, c) Proactive Monitoring of distributed python software is important if it's to run 24/7. You got to know when and if it's down. Explore Monit ( http://mmonit.com/monit/ ) or ( http://www.nagios.org/ ) Code sample for logging and monitoring can be downloaded from http://uptosomething.in/scipy/code.tar.gz ( Link will be valid from 4th December onwards )