PyParallel - PyCon 2015 Language Summit

PyParallel - PyCon 2015 Language Summit

A0e93b48a54900f6273c633915fb88cc?s=128

Trent Nelson

April 08, 2015
Tweet

Transcript

  1. 1.

    PyParallel PyCon 2015 Language Summit Trent Nelson Hopefully superseded by

    the following based on my 10 hour train ride tomorrow/yesterday: http://download.pyparallel.org/pycon2015-langsummit.pdf But if not…
  2. 2.

    PyParallel Progress • Not a lot between PyCon 2013 and

    ~end 2014 • Did some presentations that were well received • https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all- cores • (154 slide brain dump) • https://speakerdeck.com/trent/parallelism-and-concurrency-with-python • Started sprinting on code again 18th Dec last year (2014) • Focused on getting things stable under load for the TechEmpower Frameworks Benchmark • https://www.techempower.com/blog/2014/05/01/framework-benchmarks- round-9/
  3. 3.

    TechEmpower Frameworks Benchmark • http://server:8080/json • -> { ‘message’: ‘Hello,

    World!’ } • http://server:8080/plaintext • -> HTTP response with ‘Hello World!’ as the body. • Tested via wrk, e.g.: % ./wrk --latency --connections 16 --threads 16 --duration 30 http://server:8080/json • Is it simple and riddled with flaws when you start poking at it? Yes. • Is it still useful? Also yes.
  4. 4.

    PyParallel Performance • It’s really good at stateless HTTP now,

    in particular, maintaining low- latency in high load (very good 99%-ile latency) • It optimally uses underlying hardware • Exploits all CPU cores, scales linearly with additional cores • Memory use proportional to concurrent client count • 50,000 concurrent clients ~= 3GB • Very low kernel overhead, e.g. profiling shows ~98% time in user space when under load, only %2 kernel overhead • (Although that’s just a side effect of optimally using Windows facilities for high performance I/O, not necessarily anything clever I’ve done in PyParallel.)
  5. 5.

    1.544, 99.9993% 3.707, 100.00% 7.203, 0.999995 0.0000% 10.0000% 20.0000% 30.0000%

    40.0000% 50.0000% 60.0000% 70.0000% 80.0000% 90.0000% 100.0000% 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 PERCENTILE REQUEST LATENCY (MILLISECONDS) wrk2 --latency -U -c 8 -t 8 -d 25 -R 50000 http://<host>:8080/json (8 threads, 1 connection each, attempt 50,000 req/s, 25 seconds) PyParallel Tornado NodeJS Client: Macbook Pro 1 x i7 (8 cores) @ 2.6GHz 16GB RAM (OS X 10.10) 32GB RAM Server: Mac Pro 2 x Xeon (2 x 4 cores) @ 3.2GHz 32GB RAM
  6. 6.
  7. 7.

    Wiki Demo (WIP) • Stateless HTTP is easy and not

    that hard to do fast… • Wanted a better demo for PyParallel that leveraged benefits • Wikipedia search! • Downloaded Wikipedia (enwiki-20150205-pages-articles.xml -> 50GB) • Extracted byte offsets of all the <title>xyz</title> entries (~15 million titles) • Created a digital search tree (datrie, cython module), mapping titles to byte offsets • Created a NumPy array of 64-bit unsigned ints storing all the offsets. • Wrote a little HTTP server wrapper
  8. 9.

    • http://laptop/offsets/Python • Get the byte offset of every title

    starting with ‘Python’ • Get the next byte offset of the next page via offsets.searchsorted() • Adjust them (start – 7 bytes, end – 11 bytes) • …and return a JSON of [ [‘<title>’, starting_byte, ending_byte] ] • Web client then does a HTTP ranged request against /xml to grab the relevant XML fragment • Or, http://laptop/wiki/Python for exact lookup Does all of the above but issues the range request as well if there’s an exact hit, returning the fragment in one • Non-trivial app that does something half useful, good use case • Exercises external C modules (NumPy, Cythonized datrie, etc) • Not something you could easily do with existing solutions (multiprocessing) (Could you?)
  9. 10.

    Does it work? • Heh, nope. • Datrie (the Cython

    wrapper around C libdatrie) is exhibiting odd behavior (crashing) upon subsequent trie lookups in parallel contexts • Showing signs that make me suspect static memory quirks (similar to the Unicode interning), easy enough to work around • Other than that everything else works (NumPy will happily work in parallel contexts against arrays allocated by the main thread)
  10. 11.

    Other stuff • Broke generators in parallel contexts. • Temporarily

    broke exception handling in parallel contexts. • Purposely disabled for parallel contexts: • Importing. • Trace functions
  11. 12.

    Stuff to write more about on the train tomorrow: •

    What else did I break? • PyObject struct • Generators • What are the current code restrictions? • What could we do for core Python in the short term, such that this could possibly something we adopt in the long term? • Extend the new memory allocator API to include reference counting, perhaps