$30 off During Our Annual Pro Sale. View Details »

PyParallel - PyCon 2015 Language Summit

PyParallel - PyCon 2015 Language Summit

Trent Nelson

April 08, 2015
Tweet

More Decks by Trent Nelson

Other Decks in Programming

Transcript

  1. PyParallel PyCon 2015 Language Summit Trent Nelson Hopefully superseded by

    the following based on my 10 hour train ride tomorrow/yesterday: http://download.pyparallel.org/pycon2015-langsummit.pdf But if not…
  2. PyParallel Progress • Not a lot between PyCon 2013 and

    ~end 2014 • Did some presentations that were well received • https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all- cores • (154 slide brain dump) • https://speakerdeck.com/trent/parallelism-and-concurrency-with-python • Started sprinting on code again 18th Dec last year (2014) • Focused on getting things stable under load for the TechEmpower Frameworks Benchmark • https://www.techempower.com/blog/2014/05/01/framework-benchmarks- round-9/
  3. TechEmpower Frameworks Benchmark • http://server:8080/json • -> { ‘message’: ‘Hello,

    World!’ } • http://server:8080/plaintext • -> HTTP response with ‘Hello World!’ as the body. • Tested via wrk, e.g.: % ./wrk --latency --connections 16 --threads 16 --duration 30 http://server:8080/json • Is it simple and riddled with flaws when you start poking at it? Yes. • Is it still useful? Also yes.
  4. PyParallel Performance • It’s really good at stateless HTTP now,

    in particular, maintaining low- latency in high load (very good 99%-ile latency) • It optimally uses underlying hardware • Exploits all CPU cores, scales linearly with additional cores • Memory use proportional to concurrent client count • 50,000 concurrent clients ~= 3GB • Very low kernel overhead, e.g. profiling shows ~98% time in user space when under load, only %2 kernel overhead • (Although that’s just a side effect of optimally using Windows facilities for high performance I/O, not necessarily anything clever I’ve done in PyParallel.)
  5. 1.544, 99.9993% 3.707, 100.00% 7.203, 0.999995 0.0000% 10.0000% 20.0000% 30.0000%

    40.0000% 50.0000% 60.0000% 70.0000% 80.0000% 90.0000% 100.0000% 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 PERCENTILE REQUEST LATENCY (MILLISECONDS) wrk2 --latency -U -c 8 -t 8 -d 25 -R 50000 http://<host>:8080/json (8 threads, 1 connection each, attempt 50,000 req/s, 25 seconds) PyParallel Tornado NodeJS Client: Macbook Pro 1 x i7 (8 cores) @ 2.6GHz 16GB RAM (OS X 10.10) 32GB RAM Server: Mac Pro 2 x Xeon (2 x 4 cores) @ 3.2GHz 32GB RAM
  6. Wiki Demo (WIP) • Stateless HTTP is easy and not

    that hard to do fast… • Wanted a better demo for PyParallel that leveraged benefits • Wikipedia search! • Downloaded Wikipedia (enwiki-20150205-pages-articles.xml -> 50GB) • Extracted byte offsets of all the <title>xyz</title> entries (~15 million titles) • Created a digital search tree (datrie, cython module), mapping titles to byte offsets • Created a NumPy array of 64-bit unsigned ints storing all the offsets. • Wrote a little HTTP server wrapper
  7. • http://laptop/offsets/Python • Get the byte offset of every title

    starting with ‘Python’ • Get the next byte offset of the next page via offsets.searchsorted() • Adjust them (start – 7 bytes, end – 11 bytes) • …and return a JSON of [ [‘<title>’, starting_byte, ending_byte] ] • Web client then does a HTTP ranged request against /xml to grab the relevant XML fragment • Or, http://laptop/wiki/Python for exact lookup Does all of the above but issues the range request as well if there’s an exact hit, returning the fragment in one • Non-trivial app that does something half useful, good use case • Exercises external C modules (NumPy, Cythonized datrie, etc) • Not something you could easily do with existing solutions (multiprocessing) (Could you?)
  8. Does it work? • Heh, nope. • Datrie (the Cython

    wrapper around C libdatrie) is exhibiting odd behavior (crashing) upon subsequent trie lookups in parallel contexts • Showing signs that make me suspect static memory quirks (similar to the Unicode interning), easy enough to work around • Other than that everything else works (NumPy will happily work in parallel contexts against arrays allocated by the main thread)
  9. Other stuff • Broke generators in parallel contexts. • Temporarily

    broke exception handling in parallel contexts. • Purposely disabled for parallel contexts: • Importing. • Trace functions
  10. Stuff to write more about on the train tomorrow: •

    What else did I break? • PyObject struct • Generators • What are the current code restrictions? • What could we do for core Python in the short term, such that this could possibly something we adopt in the long term? • Extend the new memory allocator API to include reference counting, perhaps