Fan-in and Fan-out: The crucial components of concurrency by Brett Slatkin

Fan-in and Fan-out The crucial components of concurrency Brett Slatkin
Google Inc PyCon 2014

Agenda •Goal •Definitions •The old way •The new way •It’s
everywhere •Links

Reference Slides & code github.com/bslatkin/pycon2014 Me onebigfluke.com @haxor

Why do we need Tulip? PEP 3156 – asyncio

Definitions

When one thread of control spawns one or more new
threads of control. Fan-out

Fan-in When one thread of control gathers results from one
or more separate threads of control.

Building a web crawler The old way

Retrieve a URL Fetch

def fetch(url): response = urlopen(url) assert response.status == 200 data
= response.read() assert data text = data.decode('utf-8') return text

>> fetch(‘http://example.com’) ‘<!doctype html>\n<html>\n<head>\n<title>...’

Find all URLs on a page Extract

def extract(url): data = fetch(url) found_urls = set() for match
in URL_EXPR.finditer(data): found = match.group('url') found_urls.add(found) return url, data, found_urls

>> extract(‘http://example.com’) ( ‘http://example.com’, ‘<!doctype html>\n<html>...’, set([ ‘http://example.com/foo’, ‘http://example.com/bar’, ...
]) )

Breadth-first search of links Crawl

def crawl(to_fetch=[]): results = [] for depth in range(MAX_DEPTH +
1): batch = extract_multi(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results

def extract_multi(to_fetch): results = [] for url in to_fetch: x
= extract(url) results.append(x) return results

>> crawl([‘http://example.com’]) [ (‘http://example.com’, ‘<data>’, set([...])), (‘.../bar’, ‘<data>’, set([...])), (‘.../foo’,
‘<data>’, set([...])), ... ]

Many simultaneous fetches Crawl in parallel

def crawl_parallel(url): fetchq = Queue() result = [] f =
lambda: fetcher(fetchq, result) for _ in range(3): Thread(target=f).start() fetchq.put((0, url)) fetchq.join() return result

def fetcher(fetchq, result): while True: depth, url = fetchq.get() try:
if depth > MAX_DEPTH: continue _, data, found = extract(url) result.append((depth, url, data)) # GIL for url in found: fetchq.put((depth + 1, url)) finally: fetchq.task_done()

>> crawl_parallel(‘http://example.com’) [ (‘http://example.com’, ‘...’, set([...])), (‘.../bar’, ‘...’, set([...])), (‘.../foo’,
‘...’, set([...])), ... ] # Same output, much faster

Many simultaneous crawls Concurrent crawls

Possible, but complex See example #12 here: github.com/bslatkin/pycon2014 ~200 lines
of code Makes no sense

Building a web crawler The new way

Retrieve a URL Fetch

# Old way def fetch(url): response = urlopen(url) try: assert
response.status == 200 data = response.read() assert data text = data.decode('utf-8') return text finally: pass

@asyncio.coroutine def fetch_async(url): response = yield from request('get', url) try:
assert response.status == 200 data = yield from response.read() assert data text = data.decode(‘utf-8’) return text finally: response.close()

Find all URLs on a page Extract

# Old way def extract(url): data = fetch(url) found_urls =
set() for match in URL_EXPR.finditer(data): found = match.group('url') found_urls.add(found) return url, data, found_urls

@asyncio.coroutine def extract_async(url): data = yield from fetch_async(url) found_urls =
set() for match in URL_EXPR.finditer(data): found = match.group('url') found_urls.add(found) return url, data, found_urls

Breadth-first search of links Crawl

# Old way def crawl(to_fetch=[]): results = [] for depth
in range(MAX_DEPTH + 1): batch = extract_multi(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results

@asyncio.coroutine def crawl_async(to_fetch=[]): results = [] for depth in range(MAX_DEPTH
+ 1): batch = yield from ex_multi_async(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results

# Old way def extract_multi(to_fetch): results = [] for url
in to_fetch: x = extract(url) results.append(x) return results

@asyncio.coroutine def ex_multi_async(to_fetch): results = [] for url in to_fetch:
x = yield from extract_async(url) results.append(x) return results

Many simultaneous fetches Crawl in parallel

@asyncio.coroutine def ex_multi_async(to_fetch): results = [] for url in to_fetch:
x = yield from extract_async(url) results.append(x) return results

@asyncio.coroutine def ex_multi_async(to_fetch): futures, results = [], [] for url
in to_fetch: futures.append(extract_async(url)) for future in asyncio.as_completed(futures): results.append((yield from future)) return results

@asyncio.coroutine # No changes def crawl_async(to_fetch=[]): results = [] for
depth in range(MAX_DEPTH + 1): batch = yield from ex_multi_async(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results

>> crawl_async([‘http://example.com’]) [ (‘http://example.com’, ‘...’, set([...])), (‘.../bar’, ‘...’, set([...])), (‘.../foo’,
‘...’, set([...])), ... ] # Same output, much faster, 4 line delta

Many simultaneous crawls Concurrent crawls

class MyServer(ServerHttpProtocol): @asyncio.coroutine def handle_request(self, message, payload): data = yield
from payload.read() url = get_url_param(data) result = yield from crawl_async([url]) response = Response(self.writer, 200) response.write(get_message(result)) response.write_eof()

It’s everywhere

SQL SELECT Customer.id, sum(Order.cost) FROM Customer, Order WHERE Customer.id =
Order.id -- Fan-out GROUP BY Customer.id -- Fan-in

Map Reduce def map(text): for word in WORD_EXPR.finditer(text): yield (word,
1) # Fan-out def reduce(word, count_iter): total = 0 for count in count_iter: total += count yield (word, total) # Fan-in

Measurement •Histograms •Reservoir samplers •Profilers •Estimators

•PEP3156 – asyncio •Google App Engine’s NDB library (by Guido)
•C# async / await – Promises & ES7 generators •Rob Pike: “Concurrency is not Parallelism” •Slides: github.com/bslatkin/pycon2014 •Me: onebigfluke.com – @haxor Links

Fan-in and Fan-out: The crucial components of c...

Fan-in and Fan-out: The crucial components of concurrency by Brett Slatkin

More Decks by PyCon 2014

Featured

Transcript