Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Geoff Gerrietts - Diving into the Wreck: a postmortem look at real-world performance

Geoff Gerrietts - Diving into the Wreck: a postmortem look at real-world performance

As a young engineer interested in performance, much of the advice I saw on performance management focused on algorithms and rules of thumb. It’s good advice, but it doesn’t address the most common problems. This talk will cover a handful of the most common performance problems I’ve encountered in my career. We will talk about how to recognize them, what causes them, and how to resolve them.

https://us.pycon.org/2016/schedule/presentation/2032/

PyCon 2016

May 29, 2016
Tweet

More Decks by PyCon 2016

Other Decks in Programming

Transcript

  1. Geoff Gerrietts @ggerrietts Development Manager @AppNeta 16 years of Python

    Rehabilitated Poet Geoff Gerrietts Geoff Gerrietts @ggerrietts @ggerrietts • Geoff Gerrietts • Development Manager at AppNeta • Pythonista for 16 years; developer for ~20 • Manager, devops, programmer, webmaster, technical writer • Before that I studied English • I like to make not-really-joking jokes about how writing poetry and programming are similar.
  2. Geoff Gerrietts @ggerrietts First having read the book of myths,

    and loaded the camera, and checked the edge of the knife-blade, I put on the body-armor of black rubber the absurd flippers the grave and awkward mask. — Adrienne Rich, “Diving into the Wreck” • Title from a poem • Adrienne Rich • Metaphor of salvage diving • Reflection, analysis • The profound weight of the waters of time • Shining beams of light into tattered evidence
  3. Geoff Gerrietts @ggerrietts Performant Incantations and Rules of Thumb •

    When we talk about performance, conversations often start with folk wisdom
  4. Geoff Gerrietts @ggerrietts Concatenate & Minify Geoff Gerrietts @ggerrietts •

    Like, concatenate and minify your Javascript and CSS files.
  5. Geoff Gerrietts @ggerrietts Serve Assets Separately Geoff Gerrietts @ggerrietts •

    And don’t serve assets through your application, arrange to publish them separately. • Maybe use a CDN.
  6. Geoff Gerrietts @ggerrietts Hardware is Cheap, People are Expensive Geoff

    Gerrietts @ggerrietts • Throw money at the problem • Not just true in the web world
  7. Geoff Gerrietts @ggerrietts “”.join([“abc”, “def”, “ghi”]) not “abc” + “def”

    + “ghi” Source: https://wiki.python.org/moin/PythonSpeed/PerformanceTips • Python specific performance hacks • Join lists rather than concatenate • Which may be dated advice?
  8. Geoff Gerrietts @ggerrietts def spam(eggs, randrange=random.randrange): while True: yield eggs[randrange(len(eggs))]

    Source: https://wiki.python.org/moin/PythonSpeed/PerformanceTips • Bind local names to prevent lookups, especially in tight loops
  9. Geoff Gerrietts @ggerrietts None of this is wrong. • All

    good advice, and we should all be aware of what we’re doing.
  10. Geoff Gerrietts @ggerrietts None of this is wrong. But it

    rarely addresses the real issue. • But it’s all targeted at solving very specific problems • And most problems aren’t those problems • Dave Grothe story
  11. Geoff Gerrietts @ggerrietts nginx gunicorn Flask nginx gunicorn Flask Postgres

    web.wreck.tlys.us aux.wreck.tlys.us db.wreck.tlys.us • This is the architecture that I used for all the sample code in this talk • It’s a fairly typical architecture, though there are lots of variations • (Walk through) • When we talk about performance, we talk about latency • At a high level, the latency we care about is how long it takes from our request, to seeing the response • If that takes 100ms (which is not bad), that 100ms accrues as the request traverses the architecture and becomes a response
  12. Geoff Gerrietts @ggerrietts • L1 cache reference: 1ns • Branch

    mispredict: 3ns • Mutex lock/unlock: 17ns • Main memory reference: 100ns • SSD random read: 16,000ns (16µs) • Read 1M bytes from SSD: 123,000ns (123µs) • Round-trip in data center: 500,000ns (500µs) • Read 1M bytes from disk: 1Mns (1ms) • Disk seek: 3Mns (3ms) • Round-trip, CA to Amsterdam: 150Mns (150ms) Geoff Gerrietts @ggerrietts Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html • A key tool in our kit is knowing where latency comes from. • This is our razor, the thing we use to guide our suspicions. • The “Jeff Dean” numbers, really originate out of Berkeley. • They have an interactive graph where you can see the numbers change over time, shout out to the speed of light
  13. Geoff Gerrietts @ggerrietts • L1 cache reference: 1ns • Branch

    mispredict: 3ns • Mutex lock/unlock: 17ns • Main memory reference: 100ns • SSD random read: 16,000ns (16µs) • Read 1M bytes from SSD: 123,000ns (123µs) • Round-trip in data center: 500,000ns (500µs) • Read 1M bytes from disk: 1Mns (1ms) • Disk seek: 3Mns (3ms) • Round-trip, CA to Amsterdam: 150Mns (150ms) Geoff Gerrietts @ggerrietts Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html • Python scribbles all over that yellow area • But that’s also where most of the optimization guidelines help • For some applications, those kind of optimizations help a lot. • Not for most applications though, particularly not in the web domain.
  14. Geoff Gerrietts @ggerrietts Geoff Gerrietts @ggerrietts CPU < RAM <

    SSD < LAN < HDD < WAN • Roughly, CPU is faster than memory is faster than SSD is faster than LAN is faster than disk is faster than Internet. • Knowing the magnitude on each of those less-thans matters a lot (sometimes 4, sometimes 100) • But still: disk & network incur high latency
  15. Geoff Gerrietts @ggerrietts Monitoring is Everything. • One last thing.

    • Last year’s talk, won’t rehash • I’ll be showing a lot of TraceView, because TraceView • Lots of good choices • If you can’t measure it, you can’t know what’s wrong
  16. Geoff Gerrietts @ggerrietts SPOILER ALERT • Usually the database •

    Network + disk • Single bottleneck for all requests • Victorian mystery novel, the butler would be named Postgres
  17. So Let’s Dive • So we have reviewed the book

    of myths. • Checked our knife and camera • Let’s get wet.
  18. Geoff Gerrietts @ggerrietts • Or if you’re really doing this

    by hand, like this • These aren’t actionable. • You have to identify the problem you want to fix
  19. Geoff Gerrietts @ggerrietts • I like to use freq x

    duration, which also might be called cumulative time or total time • Captures really slow outliers • Also captures relatively fast stuff that’s being run a lot • Freq and duration separately can be useful
  20. The Shipwreck-in-a-Bottle Collection • For our dives today, I have

    selected a handful of wrecks • These are not the exact shipwrecks whose gunwales I leaped from as they sank • But rather dioramas that I have built
  21. Geoff Gerrietts @ggerrietts • This is a bad response time.

    • Now, averages can be deceiving. • But they’re not right now
  22. Geoff Gerrietts @ggerrietts • So we know this endpoint is

    slow. • Which part is slow? • If you have a tool that will split things out by layer, you can look at that.
  23. Geoff Gerrietts @ggerrietts • Since the database is our main

    suspect, we can also use a slow query log
  24. Geoff Gerrietts @ggerrietts • The slow query log shows us

    that one of our queries is taking more than 14 seconds. • Both of these indicators point to a problem with queries
  25. Geoff Gerrietts @ggerrietts • OK so here’s the code. •

    Kind of hard to see maybe, but can you see where it’s bad?
  26. Geoff Gerrietts @ggerrietts • Even a LIKE on an unindexed

    field • This query is exaggeratedly bad • But typical incarnations will do one or more of these things • Maybe 1 second instead of 14. • Can fix these with a better query
  27. Geoff Gerrietts @ggerrietts • So 4.2 is better than our

    14s average in the previous example • But it’s not good, especially for stuff that’s this policy-light
  28. Geoff Gerrietts @ggerrietts • First thing that jumps out is

    holy crap that’s a lot of queries, two orders of magnitude
  29. Geoff Gerrietts @ggerrietts • If we look at a profile

    or a graph of a transaction • A lot of red
  30. Geoff Gerrietts @ggerrietts • OK so here’s the code for

    this one. • See the problem? • No? • Where are all those queries coming from?
  31. Geoff Gerrietts @ggerrietts • Right there. • SQLAlchemy does lazy

    traversal of relationships. • Django too. • So the Rolls and Players are being queried separately — once for each game! • Can maybe fix this by “eagerly joining” relationships • Or maybe reconsider the scalability of the design • Another variation on this problem: same query, many times in single transaction
  32. Geoff Gerrietts @ggerrietts • Sometimes looking at a heatmap helps

    show how averages are being skewed by outliers • Here everything’s pretty tightly grouped though • One hint: after the break, the cluster is a lot looser
  33. Geoff Gerrietts @ggerrietts • Top is a detailed view into

    a fast trace • Bottom is a detailed view into a slow trace • Could do with profiling instead, have done that • Nginx olive, wsgi cyan, sqlalchemy red
  34. Geoff Gerrietts @ggerrietts • Basically the same except a couple

    sqlalchemy queries taking longer • 18.11 for first, 53.53 for second — more than 2x time • So, probably the database. Surprised?
  35. Geoff Gerrietts @ggerrietts • Cloudwatch CPU on DB for same

    period. • 20% CPU usage pre-spike • 100% at spike
  36. Geoff Gerrietts @ggerrietts • Now, there’s nothing to see in

    the code, but I thought I’d show it. • Pretty straightforward • Nothing looks too suspicious, nothing looks like it should change based on time
  37. Geoff Gerrietts @ggerrietts • So let’s go see what’s going

    on in the DB • Now, our webserver is 172.31.63.25... • So who’s 172.31.61.237? • Won’t necessarily be identifiable by IP -- might need to hunt by query instead • Sometimes you look at nginx logs instead
  38. Geoff Gerrietts @ggerrietts • Here’s the code. It might look

    familiar. • Once it started up, it beat the snot out of the DB
  39. Geoff Gerrietts @ggerrietts • So, this is a wild-looking graph

    • More peaks and valleys than the Rockies
  40. Geoff Gerrietts @ggerrietts • This olive area is nginx. •

    If you believe this graph, requests are spending ~30s in nginx!
  41. Geoff Gerrietts @ggerrietts • Here’s a similar graph from a

    different test run. • Totally different characteristics! • Only similarity is the weird spike in nginx
  42. Geoff Gerrietts @ggerrietts • Here’s a clue: error rate. •

    This graph matches our first test run: request rate falls off, but errors go away.
  43. Geoff Gerrietts @ggerrietts • The second test run • Errors

    throughout • One notable gap where nginx spiked
  44. Geoff Gerrietts @ggerrietts • Okay enough drama • Top graph

    is from our first test. • Middle graph and bottom are from our second test. • Both reveal a shifting memory profile • Memory topping out a couple times
  45. Geoff Gerrietts @ggerrietts • If we look in the server

    log • OOM killer has been killing processes • Segfaults have been killing processes • OMG!
  46. Geoff Gerrietts @ggerrietts • Here’s the code. • What’s wrong

    with what it does? • It seems ok? • But what happens when you pass in a large count?
  47. Geoff Gerrietts @ggerrietts Memory fragmentation • Grossly oversimplifying, Python manages

    memory in various pools • I’ve drawn the same pool at three different points in its lifecycle. • At first it’s of modest size • Then grows to accommodate new objects • Then those objects are collected • But the pool is not shrunk.
  48. Geoff Gerrietts @ggerrietts http://web.wreck.tlys.us/grenade/8 ⟶ 1.1% (46M) http://web.wreck.tlys.us/grenade/64 ⟶ 2.6%

    (105M) http://web.wreck.tlys.us/grenade/128 ⟶ 3.8% (152M) http://web.wreck.tlys.us/grenade/256 ⟶ 6.1% (247M) http://web.wreck.tlys.us/grenade/512 ⟶12.1% (488M) • Watching this on top is pretty interesting • Each request gobbles up a bunch of RAM • Then settles at the new level • 10 workers, 4GB main memory