Fan-in and Fan-out: The crucial components of concurrency by Brett Slatkin

D21717ea76044d31115c573d368e6ff4?s=47 PyCon 2014
April 11, 2014
1.7k

Fan-in and Fan-out: The crucial components of concurrency by Brett Slatkin

D21717ea76044d31115c573d368e6ff4?s=128

PyCon 2014

April 11, 2014
Tweet

Transcript

  1. Fan-in and Fan-out The crucial components of concurrency Brett Slatkin

    Google Inc PyCon 2014
  2. Agenda •Goal •Definitions •The old way •The new way •It’s

    everywhere •Links
  3. Reference Slides & code github.com/bslatkin/pycon2014 Me onebigfluke.com @haxor

  4. Why do we need Tulip? PEP 3156 – asyncio

  5. Definitions

  6. When one thread of control spawns one or more new

    threads of control. Fan-out
  7. Fan-in When one thread of control gathers results from one

    or more separate threads of control.
  8. Building a web crawler The old way

  9. Retrieve a URL Fetch

  10. def fetch(url): response = urlopen(url) assert response.status == 200 data

    = response.read() assert data text = data.decode('utf-8') return text
  11. >> fetch(‘http://example.com’) ‘<!doctype html>\n<html>\n<head>\n<title>...’

  12. Find all URLs on a page Extract

  13. def extract(url): data = fetch(url) found_urls = set() for match

    in URL_EXPR.finditer(data): found = match.group('url') found_urls.add(found) return url, data, found_urls
  14. >> extract(‘http://example.com’) ( ‘http://example.com’, ‘<!doctype html>\n<html>...’, set([ ‘http://example.com/foo’, ‘http://example.com/bar’, ...

    ]) )
  15. Breadth-first search of links Crawl

  16. def crawl(to_fetch=[]): results = [] for depth in range(MAX_DEPTH +

    1): batch = extract_multi(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results
  17. def extract_multi(to_fetch): results = [] for url in to_fetch: x

    = extract(url) results.append(x) return results
  18. >> crawl([‘http://example.com’]) [ (‘http://example.com’, ‘<data>’, set([...])), (‘.../bar’, ‘<data>’, set([...])), (‘.../foo’,

    ‘<data>’, set([...])), ... ]
  19. Many simultaneous fetches Crawl in parallel

  20. def crawl_parallel(url): fetchq = Queue() result = [] f =

    lambda: fetcher(fetchq, result) for _ in range(3): Thread(target=f).start() fetchq.put((0, url)) fetchq.join() return result
  21. def fetcher(fetchq, result): while True: depth, url = fetchq.get() try:

    if depth > MAX_DEPTH: continue _, data, found = extract(url) result.append((depth, url, data)) # GIL for url in found: fetchq.put((depth + 1, url)) finally: fetchq.task_done()
  22. >> crawl_parallel(‘http://example.com’) [ (‘http://example.com’, ‘...’, set([...])), (‘.../bar’, ‘...’, set([...])), (‘.../foo’,

    ‘...’, set([...])), ... ] # Same output, much faster
  23. Many simultaneous crawls Concurrent crawls

  24. Possible, but complex See example #12 here: github.com/bslatkin/pycon2014 ~200 lines

    of code Makes no sense
  25. Building a web crawler The new way

  26. Retrieve a URL Fetch

  27. # Old way def fetch(url): response = urlopen(url) try: assert

    response.status == 200 data = response.read() assert data text = data.decode('utf-8') return text finally: pass
  28. @asyncio.coroutine def fetch_async(url): response = yield from request('get', url) try:

    assert response.status == 200 data = yield from response.read() assert data text = data.decode(‘utf-8’) return text finally: response.close()
  29. Find all URLs on a page Extract

  30. # Old way def extract(url): data = fetch(url) found_urls =

    set() for match in URL_EXPR.finditer(data): found = match.group('url') found_urls.add(found) return url, data, found_urls
  31. @asyncio.coroutine def extract_async(url): data = yield from fetch_async(url) found_urls =

    set() for match in URL_EXPR.finditer(data): found = match.group('url') found_urls.add(found) return url, data, found_urls
  32. Breadth-first search of links Crawl

  33. # Old way def crawl(to_fetch=[]): results = [] for depth

    in range(MAX_DEPTH + 1): batch = extract_multi(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results
  34. @asyncio.coroutine def crawl_async(to_fetch=[]): results = [] for depth in range(MAX_DEPTH

    + 1): batch = yield from ex_multi_async(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results
  35. # Old way def extract_multi(to_fetch): results = [] for url

    in to_fetch: x = extract(url) results.append(x) return results
  36. @asyncio.coroutine def ex_multi_async(to_fetch): results = [] for url in to_fetch:

    x = yield from extract_async(url) results.append(x) return results
  37. Many simultaneous fetches Crawl in parallel

  38. @asyncio.coroutine def ex_multi_async(to_fetch): results = [] for url in to_fetch:

    x = yield from extract_async(url) results.append(x) return results
  39. @asyncio.coroutine def ex_multi_async(to_fetch): futures, results = [], [] for url

    in to_fetch: futures.append(extract_async(url)) for future in asyncio.as_completed(futures): results.append((yield from future)) return results
  40. @asyncio.coroutine # No changes def crawl_async(to_fetch=[]): results = [] for

    depth in range(MAX_DEPTH + 1): batch = yield from ex_multi_async(to_fetch) to_fetch = [] for url, data, found in batch: results.append((depth, url, data)) to_fetch.extend(found) return results
  41. >> crawl_async([‘http://example.com’]) [ (‘http://example.com’, ‘...’, set([...])), (‘.../bar’, ‘...’, set([...])), (‘.../foo’,

    ‘...’, set([...])), ... ] # Same output, much faster, 4 line delta
  42. Many simultaneous crawls Concurrent crawls

  43. class MyServer(ServerHttpProtocol): @asyncio.coroutine def handle_request(self, message, payload): data = yield

    from payload.read() url = get_url_param(data) result = yield from crawl_async([url]) response = Response(self.writer, 200) response.write(get_message(result)) response.write_eof()
  44. It’s everywhere

  45. SQL SELECT Customer.id, sum(Order.cost) FROM Customer, Order WHERE Customer.id =

    Order.id -- Fan-out GROUP BY Customer.id -- Fan-in
  46. Map Reduce def map(text): for word in WORD_EXPR.finditer(text): yield (word,

    1) # Fan-out def reduce(word, count_iter): total = 0 for count in count_iter: total += count yield (word, total) # Fan-in
  47. Measurement •Histograms •Reservoir samplers •Profilers •Estimators

  48. •PEP3156 – asyncio •Google App Engine’s NDB library (by Guido)

    •C# async / await – Promises & ES7 generators •Rob Pike: “Concurrency is not Parallelism” •Slides: github.com/bslatkin/pycon2014 •Me: onebigfluke.com – @haxor Links