Slide 1

Slide 1 text

PyParallel PyCon 2015 Language Summit Trent Nelson Hopefully superseded by the following based on my 10 hour train ride tomorrow/yesterday: http://download.pyparallel.org/pycon2015-langsummit.pdf But if not…

Slide 2

Slide 2 text

PyParallel Progress • Not a lot between PyCon 2013 and ~end 2014 • Did some presentations that were well received • https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all- cores • (154 slide brain dump) • https://speakerdeck.com/trent/parallelism-and-concurrency-with-python • Started sprinting on code again 18th Dec last year (2014) • Focused on getting things stable under load for the TechEmpower Frameworks Benchmark • https://www.techempower.com/blog/2014/05/01/framework-benchmarks- round-9/

Slide 3

Slide 3 text

TechEmpower Frameworks Benchmark • http://server:8080/json • -> { ‘message’: ‘Hello, World!’ } • http://server:8080/plaintext • -> HTTP response with ‘Hello World!’ as the body. • Tested via wrk, e.g.: % ./wrk --latency --connections 16 --threads 16 --duration 30 http://server:8080/json • Is it simple and riddled with flaws when you start poking at it? Yes. • Is it still useful? Also yes.

Slide 4

Slide 4 text

PyParallel Performance • It’s really good at stateless HTTP now, in particular, maintaining low- latency in high load (very good 99%-ile latency) • It optimally uses underlying hardware • Exploits all CPU cores, scales linearly with additional cores • Memory use proportional to concurrent client count • 50,000 concurrent clients ~= 3GB • Very low kernel overhead, e.g. profiling shows ~98% time in user space when under load, only %2 kernel overhead • (Although that’s just a side effect of optimally using Windows facilities for high performance I/O, not necessarily anything clever I’ve done in PyParallel.)

Slide 5

Slide 5 text

1.544, 99.9993% 3.707, 100.00% 7.203, 0.999995 0.0000% 10.0000% 20.0000% 30.0000% 40.0000% 50.0000% 60.0000% 70.0000% 80.0000% 90.0000% 100.0000% 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 PERCENTILE REQUEST LATENCY (MILLISECONDS) wrk2 --latency -U -c 8 -t 8 -d 25 -R 50000 http://:8080/json (8 threads, 1 connection each, attempt 50,000 req/s, 25 seconds) PyParallel Tornado NodeJS Client: Macbook Pro 1 x i7 (8 cores) @ 2.6GHz 16GB RAM (OS X 10.10) 32GB RAM Server: Mac Pro 2 x Xeon (2 x 4 cores) @ 3.2GHz 32GB RAM

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Wiki Demo (WIP) • Stateless HTTP is easy and not that hard to do fast… • Wanted a better demo for PyParallel that leveraged benefits • Wikipedia search! • Downloaded Wikipedia (enwiki-20150205-pages-articles.xml -> 50GB) • Extracted byte offsets of all the xyz entries (~15 million titles) • Created a digital search tree (datrie, cython module), mapping titles to byte offsets • Created a NumPy array of 64-bit unsigned ints storing all the offsets. • Wrote a little HTTP server wrapper

Slide 8

Slide 8 text

Wiki Demo • The files are huge: • After everything loaded into memory:

Slide 9

Slide 9 text

• http://laptop/offsets/Python • Get the byte offset of every title starting with ‘Python’ • Get the next byte offset of the next page via offsets.searchsorted() • Adjust them (start – 7 bytes, end – 11 bytes) • …and return a JSON of [ [‘’, starting_byte, ending_byte] ] • Web client then does a HTTP ranged request against /xml to grab the relevant XML fragment • Or, http://laptop/wiki/Python for exact lookup Does all of the above but issues the range request as well if there’s an exact hit, returning the fragment in one • Non-trivial app that does something half useful, good use case • Exercises external C modules (NumPy, Cythonized datrie, etc) • Not something you could easily do with existing solutions (multiprocessing) (Could you?)

Slide 10

Slide 10 text

Does it work? • Heh, nope. • Datrie (the Cython wrapper around C libdatrie) is exhibiting odd behavior (crashing) upon subsequent trie lookups in parallel contexts • Showing signs that make me suspect static memory quirks (similar to the Unicode interning), easy enough to work around • Other than that everything else works (NumPy will happily work in parallel contexts against arrays allocated by the main thread)

Slide 11

Slide 11 text

Other stuff • Broke generators in parallel contexts. • Temporarily broke exception handling in parallel contexts. • Purposely disabled for parallel contexts: • Importing. • Trace functions

Slide 12

Slide 12 text

Stuff to write more about on the train tomorrow: • What else did I break? • PyObject struct • Generators • What are the current code restrictions? • What could we do for core Python in the short term, such that this could possibly something we adopt in the long term? • Extend the new memory allocator API to include reference counting, perhaps