Slide 1

Slide 1 text

Case Studies in Case Studies in Running Python in Parallel Running Python in Parallel Tim McNamara (@timClicks) Tim McNamara (@timClicks) Kiwi Pycon 2012 Kiwi Pycon 2012 Content released under the Creative Commons Content released under the Creative Commons Attribution 3.0 New Zealand licence. Attribution 3.0 New Zealand licence. Original code released under MIT/X11 licence. Original code released under MIT/X11 licence.

Slide 2

Slide 2 text

Objectives Objectives 1) introduce people to distributed 1) introduce people to distributed programming programming 2) peek inside Python's standard library 2) peek inside Python's standard library 3) build community around distributed 3) build community around distributed computing in NZ computing in NZ

Slide 3

Slide 3 text

Outline Outline 1) introduce a problem 1) introduce a problem 2) implement several solutions 2) implement several solutions 3) consider trade offs between different 3) consider trade offs between different

Slide 4

Slide 4 text

Caveat Caveat I will be using the terms parallel, I will be using the terms parallel, distributed and concurrent fairly distributed and concurrent fairly interchangeably. This is wrong. interchangeably. This is wrong.

Slide 5

Slide 5 text

Why? Why? A tiny selections reasons... A tiny selections reasons...

Slide 6

Slide 6 text

before before

Slide 7

Slide 7 text

after after

Slide 8

Slide 8 text

before before t t

Slide 9

Slide 9 text

after after t t Key point: less time Key point: less time

Slide 10

Slide 10 text

before before error

Slide 11

Slide 11 text

after after Key point: isolation, leading to fault tolerance Key point: isolation, leading to fault tolerance error

Slide 12

Slide 12 text

Doing more than one thing Doing more than one thing on the same computer on the same computer

Slide 13

Slide 13 text

Where to begin? Where to begin?

Slide 14

Slide 14 text

Let Let’ ’s work through a use case together s work through a use case together

Slide 15

Slide 15 text

Facial recognition Facial recognition

Slide 16

Slide 16 text

Problem Statement: Problem Statement:

Slide 17

Slide 17 text

We are worried about We are worried about government survelliance; government survelliance; how many faces from the how many faces from the WWW is it possible to WWW is it possible to scan with $100? scan with $100?

Slide 18

Slide 18 text

We are worried about We are worried about government survelliance; government survelliance; how many faces from the how many faces from the WWW is it possible to WWW is it possible to scan with $100? scan with $100?

Slide 19

Slide 19 text

Original Version Original Version

Slide 20

Slide 20 text

def people_in_pages(url): page = get(url) for img_url in extract_img_urls(page): img = get(url) for face in detect_faces(img): yield face

Slide 21

Slide 21 text

def people_in_pages(url): page = get(url) for img_url in extract_img_urls(page): img = get(url) for face in detect_faces(img): yield face Where could we do two things at the same time?

Slide 22

Slide 22 text

def people_in_pages(url): page = get(url) for img_url in extract_img_urls(page): img = get(url) for face in detect_faces(img): yield face Which portions of the code are I/O bound?

Slide 23

Slide 23 text

def people_in_pages(url): page = get(url) for img_url in extract_img_urls(page): img = get(url) for face in detect_faces(img): yield face Which portions of the code are CPU bound?

Slide 24

Slide 24 text

def people_in_pages(url): page = get(url) for img_url in extract_img_urls(page): img = get(url) for face in detect_faces(img): yield face Each loop can be done independently

Slide 25

Slide 25 text

Working Implementation Working Implementation

Slide 26

Slide 26 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name)

Slide 27

Slide 27 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Download a page

Slide 28

Slide 28 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Extract URLs with images in them using a good enough regular expression pattern

Slide 29

Slide 29 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Matches h, t, p, s, /, word characters and dots (Redundancy added for readability)

Slide 30

Slide 30 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) That finishes in a useful file extension.

Slide 31

Slide 31 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) From the page contents.

Slide 32

Slide 32 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) While ignoring cases.

Slide 33

Slide 33 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) “cascade” is OpenCV terminology.

Slide 34

Slide 34 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) You can think of it as a model.

Slide 35

Slide 35 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Allocate a memory buffer.

Slide 36

Slide 36 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) The results of re.findall look like ('http://img.com/hi.jpg', 'jpg')

Slide 37

Slide 37 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Download the image, ignoring any problems

Slide 38

Slide 38 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Unless I have caused the problem.

Slide 39

Slide 39 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Write image to a named temporary file

Slide 40

Slide 40 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) (This approach was easier than deserialising the incoming string, decompressing the data and coverting it to an array)

Slide 41

Slide 41 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Set file extension correctly

Slide 42

Slide 42 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) This will enable OpenCV to load it properly later.

Slide 43

Slide 43 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Prevent automatic file deletion when file is closed.

Slide 44

Slide 44 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Load an image using the file's name.

Slide 45

Slide 45 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Editor's Note: This is wrong

Slide 46

Slide 46 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Do facial recognition

Slide 47

Slide 47 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Yield (x, y, height, width)

Slide 48

Slide 48 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Delete the file

Slide 49

Slide 49 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) Delete the file

Slide 50

Slide 50 text

How does that massive function relate to the much smaller one we saw earlier?

Slide 51

Slide 51 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name)

Slide 52

Slide 52 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) def get(url)

Slide 53

Slide 53 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) def extract_img_urls(page)

Slide 54

Slide 54 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) def extract_img_urls(page)

Slide 55

Slide 55 text

import re from tempfile import NamedTemporaryFile as NTF import cv, requests def people_in_pages(url): page = requests.get(url).content finds = re.findall('([https://\w\.]+(jpg|png))', page, re.IGNORECASE) cascade = cv.Load('/home/tim/Libs/haarcascade_frontalface_alt.xml') store = cv.CreateMemStorage() for url, extn in finds: try: image = requests.get(url).content except KeyboardInterrupt: raise except: # can't handle relative URLs :( continue # kludge to make things easy for OpenCV with NTF(suffix='.'+extn, delete=False) as f: f.write(image) image = cv.LoadImageM(f.name, iscolor=True) faces = cv.HaarDetectObjects(image, cascade, store) for face in faces: coordinates, neighbours = face yield coordinates f.unlink(f.name) def detect_faces(image)

Slide 56

Slide 56 text

Introducing Introducing multiprocessing.Pool multiprocessing.Pool

Slide 57

Slide 57 text

Introducing Introducing multiprocessing.Pool multiprocessing.Pool from from multiprocessing multiprocessing import import Pool, cpu_count Pool, cpu_count P = Pool(cpu_count()-1) P = Pool(cpu_count()-1) def people_in_pages(url): page = get(url) img_urls = extract_img_urls(page) imgs = P.map(get, img_urls) faces = P.map(detect_faces, imgs) return faces

Slide 58

Slide 58 text

By the way... By the way... In Python, using threading for I/O bound In Python, using threading for I/O bound problems will be just fine. problems will be just fine. That is if you are interfacing from the outside That is if you are interfacing from the outside world, audio, serial, network, disc, et cetera, world, audio, serial, network, disc, et cetera, then threads are easier. then threads are easier. Multiprocessing is a much a better way to Multiprocessing is a much a better way to distribute work across CPUs. distribute work across CPUs.

Slide 59

Slide 59 text

Applying this to webscale Applying this to webscale

Slide 60

Slide 60 text

from bottle import abort, request, route, run @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % people_in_pages(url) @route('/') def index(): page = ''' Hunt ''' return page run('localhost', 8000)

Slide 61

Slide 61 text

from bottle import abort, request, route, run @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % people_in_pages(url) @route('/') def index(): page = ''' Hunt ''' return page run('localhost', 8000) From previous slide

Slide 62

Slide 62 text

from bottle import abort, request, route, run @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % people_in_pages(url) @route('/') def index(): page = ''' Hunt ''' return page run('localhost', 8000)

Slide 63

Slide 63 text

Spreading across machines Spreading across machines

Slide 64

Slide 64 text

Options in Python Options in Python

Slide 65

Slide 65 text

Disco Celery Roll your own iPython

Slide 66

Slide 66 text

Disco Celery *-RPC *MQ Roll your own iPython

Slide 67

Slide 67 text

Disco Celery MPI *-RPC *MQ Roll your own PBS LSF WinHPC SGE SSH iPython

Slide 68

Slide 68 text

XML-RPC XML-RPC

Slide 69

Slide 69 text

You may hate the verbosity of XML You may hate the verbosity of XML

Slide 70

Slide 70 text

but the spec is XML-RPC solid but the spec is XML-RPC solid

Slide 71

Slide 71 text

and an implementation exists and an implementation exists in the standard library. in the standard library.

Slide 72

Slide 72 text

XML-RPC XML-RPC try: try: from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCServer SimpleXMLRPCServer except except ImportError: ImportError: from from xmlrpc.server xmlrpc.server import import SimpleXMLRPCServer SimpleXMLRPCServer server = SimpleXMLRPCServer(( server = SimpleXMLRPCServer(("localhost" "localhost", , 22022 22022)) )) server.register_introspection_functions() server.register_introspection_functions() server.register_function(people_in_pages, server.register_function(people_in_pages, 'people' 'people') ) server.serve_forever() server.serve_forever()

Slide 73

Slide 73 text

XML-RPC XML-RPC server = SimpleXMLRPCServer(( server = SimpleXMLRPCServer(("localhost" "localhost", , 22022 22022)) )) server.register_introspection_functions() server.register_introspection_functions() server.register_function(people_in_pages, server.register_function(people_in_pages, 'people' 'people') ) server.serve_forever() server.serve_forever() Less scary without the boilerplate

Slide 74

Slide 74 text

XML-RPC XML-RPC server = SimpleXMLRPCServer(( server = SimpleXMLRPCServer(("localhost" "localhost", , 22022 22022)) )) server.register_introspection_functions() server.register_introspection_functions() server.register_function(people_in_pages, server.register_function(people_in_pages, 'people' 'people') ) server.serve_forever() server.serve_forever() For spec compliance

Slide 75

Slide 75 text

XML-RPC XML-RPC server = SimpleXMLRPCServer(( server = SimpleXMLRPCServer(("localhost" "localhost", , 22022 22022)) )) server.register_introspection_functions() server.register_introspection_functions() server.register_function(people_in_pages, server.register_function(people_in_pages, 'people' 'people') ) server.serve_forever() server.serve_forever() One thing to note, our function with yield will raise an exception.

Slide 76

Slide 76 text

XML-RPC XML-RPC server = SimpleXMLRPCServer(( server = SimpleXMLRPCServer(("localhost" "localhost", , 22022 22022)) )) server.register_introspection_functions() server.register_introspection_functions() server.register_function(people_in_pages, server.register_function(people_in_pages, 'people' 'people') ) server.serve_forever() server.serve_forever() You can leave the registered function unnamed, defaults to fn.__name__

Slide 77

Slide 77 text

XML-RPC XML-RPC try: try: from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCServer SimpleXMLRPCServer except except ImportError: ImportError: from from xmlrpc.server xmlrpc.server import import SimpleXMLRPCServer SimpleXMLRPCServer server = SimpleXMLRPCServer(( server = SimpleXMLRPCServer(("localhost" "localhost", , 22022 22022)) )) server.register_introspection_functions() server.register_introspection_functions() server.register_function(people_in_pages, server.register_function(people_in_pages, 'people' 'people') ) server.serve_forever() server.serve_forever()

Slide 78

Slide 78 text

How do we connect this to our website? How do we connect this to our website?

Slide 79

Slide 79 text

from bottle import abort, request, route, run @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % people_in_pages(url) @route('/') def index(): page = ''' Hunt ''' return page run('localhost', 8000) Original

Slide 80

Slide 80 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... New

Slide 81

Slide 81 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... Connect to the XML-RPC server.

Slide 82

Slide 82 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... Use XML-RPC server.

Slide 83

Slide 83 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... Remember, we renamed the function

Slide 84

Slide 84 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ...

Slide 85

Slide 85 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... Has this helped?

Slide 86

Slide 86 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... No

Slide 87

Slide 87 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... We're still only using one core

Slide 88

Slide 88 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... Performance is now slower...

Slide 89

Slide 89 text

try: import xmlrpclib import ServerProxy except ImportError: from xmlrpc.client import ServerProxy from bottle import abort, request, route, run rpc = ServerProxy('http://localhost:22022 22022') @route('/faces') def faces(): url = request.query['url'] if not url: abort(400) return '%s' % rpc.people(url) ... ...everything is serialised as XML.

Slide 90

Slide 90 text

How can we use the standard How can we use the standard library to create something more useful? library to create something more useful?

Slide 91

Slide 91 text

The SocketServer/socketserver module The SocketServer/socketserver module provides a ThreadedTCPServer... provides a ThreadedTCPServer...

Slide 92

Slide 92 text

… … but that wont help with but that wont help with CPU-bound problems. CPU-bound problems.

Slide 93

Slide 93 text

Let's use adapt the Let's use adapt the SimpleXMLRPCServer SimpleXMLRPCServer code for our own use. code for our own use.

Slide 94

Slide 94 text

We'll fork an OS process We'll fork an OS process for each call. for each call.

Slide 95

Slide 95 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCDispatcher): """Simple XML-RPC server. """Simple XML-RPC server. Simple XML-RPC server that allows functions and a single instance Simple XML-RPC server that allows functions and a single instance to be installed to handle requests. The default implementation to be installed to handle requests. The default implementation attempts to dispatch XML-RPC calls to the functions or instance attempts to dispatch XML-RPC calls to the functions or instance installed in the server. Override the _dispatch method inhereted installed in the server. Override the _dispatch method inhereted from SimpleXMLRPCDispatcher to change this behavior. from SimpleXMLRPCDispatcher to change this behavior. """ """ allow_reuse_address = allow_reuse_address = True True # Warning: this is for debugging purposes only! Never set this to True in # Warning: this is for debugging purposes only! Never set this to True in # production code, as will be sending out sensitive information (exception # production code, as will be sending out sensitive information (exception # and stack trace details) when exceptions are raised inside # and stack trace details) when exceptions are raised inside # SimpleXMLRPCRequestHandler.do_POST # SimpleXMLRPCRequestHandler.do_POST _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) # [Bug #1222790] If possible, set close-on-exec flag; if a # [Bug #1222790] If possible, set close-on-exec flag; if a # method spawns a subprocess, the subprocess shouldn't have # method spawns a subprocess, the subprocess shouldn't have # the listening socket open. # the listening socket open. if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags)

Slide 96

Slide 96 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCDispatcher): """Simple XML-RPC server. """Simple XML-RPC server. Simple XML-RPC server that allows functions and a single instance Simple XML-RPC server that allows functions and a single instance to be installed to handle requests. The default implementation to be installed to handle requests. The default implementation attempts to dispatch XML-RPC calls to the functions or instance attempts to dispatch XML-RPC calls to the functions or instance installed in the server. Override the _dispatch method inhereted installed in the server. Override the _dispatch method inhereted from SimpleXMLRPCDispatcher to change this behavior. from SimpleXMLRPCDispatcher to change this behavior. """ """ allow_reuse_address = allow_reuse_address = True True # Warning: this is for debugging purposes only! Never set this to True in # Warning: this is for debugging purposes only! Never set this to True in # production code, as will be sending out sensitive information (exception # production code, as will be sending out sensitive information (exception # and stack trace details) when exceptions are raised inside # and stack trace details) when exceptions are raised inside # SimpleXMLRPCRequestHandler.do_POST # SimpleXMLRPCRequestHandler.do_POST _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) # [Bug #1222790] If possible, set close-on-exec flag; if a # [Bug #1222790] If possible, set close-on-exec flag; if a # method spawns a subprocess, the subprocess shouldn't have # method spawns a subprocess, the subprocess shouldn't have # the listening socket open. # the listening socket open. if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) http://hg.python.org/cpython/file/2.7/Lib/SimpleXMLRPCServer.py#l569

Slide 97

Slide 97 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCDispatcher): """Simple XML-RPC server. """Simple XML-RPC server. Simple XML-RPC server that allows functions and a single instance Simple XML-RPC server that allows functions and a single instance to be installed to handle requests. The default implementation to be installed to handle requests. The default implementation attempts to dispatch XML-RPC calls to the functions or instance attempts to dispatch XML-RPC calls to the functions or instance installed in the server. Override the _dispatch method inhereted installed in the server. Override the _dispatch method inhereted from SimpleXMLRPCDispatcher to change this behavior. from SimpleXMLRPCDispatcher to change this behavior. """ """ allow_reuse_address = allow_reuse_address = True True # Warning: this is for debugging purposes only! Never set this to True in # Warning: this is for debugging purposes only! Never set this to True in # production code, as will be sending out sensitive information (exception # production code, as will be sending out sensitive information (exception # and stack trace details) when exceptions are raised inside # and stack trace details) when exceptions are raised inside # SimpleXMLRPCRequestHandler.do_POST # SimpleXMLRPCRequestHandler.do_POST _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) # [Bug #1222790] If possible, set close-on-exec flag; if a # [Bug #1222790] If possible, set close-on-exec flag; if a # method spawns a subprocess, the subprocess shouldn't have # method spawns a subprocess, the subprocess shouldn't have # the listening socket open. # the listening socket open. if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Urgh

Slide 98

Slide 98 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCDispatcher): """Simple XML-RPC server. """Simple XML-RPC server. Simple XML-RPC server that allows functions and a single instance Simple XML-RPC server that allows functions and a single instance to be installed to handle requests. The default implementation to be installed to handle requests. The default implementation attempts to dispatch XML-RPC calls to the functions or instance attempts to dispatch XML-RPC calls to the functions or instance installed in the server. Override the _dispatch method inhereted installed in the server. Override the _dispatch method inhereted from SimpleXMLRPCDispatcher to change this behavior. from SimpleXMLRPCDispatcher to change this behavior. """ """ allow_reuse_address = allow_reuse_address = True True # Warning: this is for debugging purposes only! Never set this to True in # Warning: this is for debugging purposes only! Never set this to True in # production code, as will be sending out sensitive information (exception # production code, as will be sending out sensitive information (exception # and stack trace details) when exceptions are raised inside # and stack trace details) when exceptions are raised inside # SimpleXMLRPCRequestHandler.do_POST # SimpleXMLRPCRequestHandler.do_POST _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) # [Bug #1222790] If possible, set close-on-exec flag; if a # [Bug #1222790] If possible, set close-on-exec flag; if a # method spawns a subprocess, the subprocess shouldn't have # method spawns a subprocess, the subprocess shouldn't have # the listening socket open. # the listening socket open. if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Let's simplify

Slide 99

Slide 99 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags)

Slide 100

Slide 100 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Important things to note

Slide 101

Slide 101 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Subclassing from SocketServer mixin...

Slide 102

Slide 102 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) ...and dispatcher.

Slide 103

Slide 103 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Perhaps we can just use SocketServer.ForkingTCPServer?

Slide 104

Slide 104 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Watch out for hard coded values!

Slide 105

Slide 105 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) This thing appears to be a workaround to a bug.

Slide 106

Slide 106 text

class class SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): SimpleXMLRPCServer(SocketServer.TCPServer, SimpleXMLRPCDispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, requestHandler=SimpleXMLRPCRequestHandler, (self, addr, requestHandler=SimpleXMLRPCRequestHandler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SimpleXMLRPCDispatcher.__init__(self, allow_none, encoding) SocketServer.TCPServer.__init__(self, addr, SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate) requestHandler, bind_and_activate) if if fcntl fcntl is not is not None and hasattr(fcntl, 'FD_CLOEXEC'): None and hasattr(fcntl, 'FD_CLOEXEC'): flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags = fcntl.fcntl(self.fileno(), fcntl.F_GETFD) flags |= fcntl.FD_CLOEXEC flags |= fcntl.FD_CLOEXEC fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) fcntl.fcntl(self.fileno(), fcntl.F_SETFD, flags) Let's leave it alone.

Slide 107

Slide 107 text

Our implementation: Our implementation:

Slide 108

Slide 108 text

import import fcntl fcntl # needed for workaround, now hidden # needed for workaround, now hidden from from SocketServer SocketServer import import ForkingTCPServer as Server ForkingTCPServer as Server from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCDispatcher as Dispatcher SimpleXMLRPCDispatcher as Dispatcher from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCRequestHandler as Handler SimpleXMLRPCRequestHandler as Handler class class ForkingXMLRPCServer(Server, Dispatcher): ForkingXMLRPCServer(Server, Dispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, request_handler=Handler, (self, addr, request_handler=Handler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests Dispatcher.__init__(self, allow_none, encoding) Dispatcher.__init__(self, allow_none, encoding) Server.__init__(self, addr, request_handler, bind_and_activate) Server.__init__(self, addr, request_handler, bind_and_activate) if if fcntl fcntl is not is not None None and and hasattr(fcntl, 'FD_CLOEXEC'): hasattr(fcntl, 'FD_CLOEXEC'): ... ...

Slide 109

Slide 109 text

import import fcntl fcntl # needed for workaround, now hidden # needed for workaround, now hidden from from SocketServer SocketServer import import ForkingTCPServer as Server ForkingTCPServer as Server from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCDispatcher as Dispatcher SimpleXMLRPCDispatcher as Dispatcher from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCRequestHandler as Handler SimpleXMLRPCRequestHandler as Handler class class ForkingXMLRPCServer(Server, Dispatcher): ForkingXMLRPCServer(Server, Dispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, request_handler=Handler, (self, addr, request_handler=Handler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests Dispatcher.__init__(self, allow_none, encoding) Dispatcher.__init__(self, allow_none, encoding) Server.__init__(self, addr, request_handler, bind_and_activate) Server.__init__(self, addr, request_handler, bind_and_activate) if if fcntl fcntl is not is not None None and and hasattr(fcntl, 'FD_CLOEXEC'): hasattr(fcntl, 'FD_CLOEXEC'): ... ... Here is the magic.

Slide 110

Slide 110 text

What about gevent? What about gevent?

Slide 111

Slide 111 text

gevent is great gevent is great for problems bound by I/O for problems bound by I/O

Slide 112

Slide 112 text

import import gevent gevent gevent.patch_all() gevent.patch_all() import import fcntl fcntl # needed for workaround, now hidden # needed for workaround, now hidden from from SocketServer SocketServer import import ThreadedTCPServer as Server ThreadedTCPServer as Server from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCDispatcher as Dispatcher SimpleXMLRPCDispatcher as Dispatcher from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCRequestHandler as Handler SimpleXMLRPCRequestHandler as Handler class class ForkingXMLRPCServer(Server, Dispatcher): ForkingXMLRPCServer(Server, Dispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, request_handler=Handler, (self, addr, request_handler=Handler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests Dispatcher.__init__(self, allow_none, encoding) Dispatcher.__init__(self, allow_none, encoding) Server.__init__(self, addr, request_handler, bind_and_activate) Server.__init__(self, addr, request_handler, bind_and_activate) if if fcntl fcntl is not is not None None and and hasattr(fcntl, 'FD_CLOEXEC'): hasattr(fcntl, 'FD_CLOEXEC'): ... ...

Slide 113

Slide 113 text

import import gevent gevent gevent.patch_all() gevent.patch_all() import import fcntl fcntl # needed for workaround, now hidden # needed for workaround, now hidden from from SocketServer SocketServer import import ThreadedTCPServer as Server ThreadedTCPServer as Server from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCDispatcher as Dispatcher SimpleXMLRPCDispatcher as Dispatcher from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCRequestHandler as Handler SimpleXMLRPCRequestHandler as Handler class class ForkingXMLRPCServer(Server, Dispatcher): ForkingXMLRPCServer(Server, Dispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, request_handler=Handler, (self, addr, request_handler=Handler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests Dispatcher.__init__(self, allow_none, encoding) Dispatcher.__init__(self, allow_none, encoding) Server.__init__(self, addr, request_handler, bind_and_activate) Server.__init__(self, addr, request_handler, bind_and_activate) if if fcntl fcntl is not is not None None and and hasattr(fcntl, 'FD_CLOEXEC'): hasattr(fcntl, 'FD_CLOEXEC'): ... ... Can you spot the differences?

Slide 114

Slide 114 text

import import gevent gevent gevent.patch_all() gevent.patch_all() import import fcntl fcntl # needed for workaround, now hidden # needed for workaround, now hidden from from SocketServer SocketServer import import ThreadedTCPServer as Server ThreadedTCPServer as Server from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCDispatcher as Dispatcher SimpleXMLRPCDispatcher as Dispatcher from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCRequestHandler as Handler SimpleXMLRPCRequestHandler as Handler class class ForkingXMLRPCServer(Server, Dispatcher): ForkingXMLRPCServer(Server, Dispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, request_handler=Handler, (self, addr, request_handler=Handler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests Dispatcher.__init__(self, allow_none, encoding) Dispatcher.__init__(self, allow_none, encoding) Server.__init__(self, addr, request_handler, bind_and_activate) Server.__init__(self, addr, request_handler, bind_and_activate) if if fcntl fcntl is not is not None None and and hasattr(fcntl, 'FD_CLOEXEC'): hasattr(fcntl, 'FD_CLOEXEC'): ... ... gevent monkeypatches socket...

Slide 115

Slide 115 text

import import gevent gevent gevent.patch_all() gevent.patch_all() import import fcntl fcntl # needed for workaround, now hidden # needed for workaround, now hidden from from SocketServer SocketServer import import ThreadedTCPServer as Server ThreadedTCPServer as Server from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCDispatcher as Dispatcher SimpleXMLRPCDispatcher as Dispatcher from from SimpleXMLRPCServer SimpleXMLRPCServer import import SimpleXMLRPCRequestHandler as Handler SimpleXMLRPCRequestHandler as Handler class class ForkingXMLRPCServer(Server, Dispatcher): ForkingXMLRPCServer(Server, Dispatcher): allow_reuse_address = allow_reuse_address = True True _send_traceback_header = _send_traceback_header = False False def def __init__ __init__(self, addr, request_handler=Handler, (self, addr, request_handler=Handler, logRequests=True, allow_none=False, logRequests=True, allow_none=False, encoding=None, bind_and_activate=True): encoding=None, bind_and_activate=True): self.logRequests = logRequests self.logRequests = logRequests Dispatcher.__init__(self, allow_none, encoding) Dispatcher.__init__(self, allow_none, encoding) Server.__init__(self, addr, request_handler, bind_and_activate) Server.__init__(self, addr, request_handler, bind_and_activate) if if fcntl fcntl is not is not None None and and hasattr(fcntl, 'FD_CLOEXEC'): hasattr(fcntl, 'FD_CLOEXEC'): ... ... ...meaning you can use “threads”.

Slide 116

Slide 116 text

So, what's Celery? So, what's Celery?

Slide 117

Slide 117 text

A task queue that is even easier to use. A task queue that is even easier to use.

Slide 118

Slide 118 text

As a bonus, it can support both As a bonus, it can support both gevent for I/O intensive tasks & gevent for I/O intensive tasks & multiprocessing for CPU intensive tasks multiprocessing for CPU intensive tasks (at the same time). (at the same time).

Slide 119

Slide 119 text

As a bonus, it can support both As a bonus, it can support both gevent for I/O intensive tasks & gevent for I/O intensive tasks & multiprocessing for CPU intensive tasks multiprocessing for CPU intensive tasks (at the same time). (at the same time). You can use OS threads for I/O tasks, but you will get a soft max of 200 instead of over 1000.

Slide 120

Slide 120 text

Celery Celery import import celery celery @celery.task @celery.task def people_in_pages(url): ...

Slide 121

Slide 121 text

iPython iPython

Slide 122

Slide 122 text

iPython looks just like a shell, iPython looks just like a shell, but it also has very stable support for but it also has very stable support for distributed programming. distributed programming.

Slide 123

Slide 123 text

Controller ... Engine 0 ... An IPython Cluster An IPython Cluster

Slide 124

Slide 124 text

Controller ... Engine 0 ... An IPython Cluster An IPython Cluster Controllers are Controllers are your interface to your interface to one or more one or more engines engines

Slide 125

Slide 125 text

Controller ... Engine 0 ... An IPython Cluster An IPython Cluster Engines could be Engines could be local processes, local processes, remote remote processes via processes via SSH or batch SSH or batch jobs. jobs.

Slide 126

Slide 126 text

Controller ... Engine 0 ... An IPython Cluster An IPython Cluster Schedulers live Schedulers live here, but are here, but are hidden from you. hidden from you.

Slide 127

Slide 127 text

$ ipython profile create --parallel profile_name $ ipython profile create --parallel profile_name $ ipcluster start –n=4 --profile=profile_name $ ipcluster start –n=4 --profile=profile_name $ ipython notebook --profile=profile_name $ ipython notebook --profile=profile_name

Slide 128

Slide 128 text

In [1]: In [1]: from from IPython.parallel IPython.parallel import import Client Client In [2]: In [2]: rc = Client(profile= rc = Client(profile='profile_name' 'profile_name') ) In [3]: In [3]: rc.ids rc.ids Out[3]: Out[3]: [0, 1, 2, 3] [0, 1, 2, 3] In [4]: In [4]: view = rc[:] view = rc[:] In [n]: In [n]: view.apply_async(people_in_page, url) view.apply_async(people_in_page, url)

Slide 129

Slide 129 text

Some homework Some homework ● To do effective distributed programming, you need To do effective distributed programming, you need to think about (at least): to think about (at least): ● data locality data locality ● ease of configuration / maintainence ease of configuration / maintainence ● decomposition decomposition ● load balancing load balancing ● message size message size ● efficient serialisation efficient serialisation ● send references, rather than data through message queues send references, rather than data through message queues ● ease of debugging ease of debugging

Slide 130

Slide 130 text

Thanks! Thanks!