Geoff Gerrietts - Diving into the Wreck: a postmortem look at real-world performance

Slide 1

Slide 1 text

Diving into the Wreck

Slide 2

Slide 2 text

Geoff Gerrietts @ggerrietts Development Manager @AppNeta 16 years of Python Rehabilitated Poet Geoff Gerrietts Geoff Gerrietts @ggerrietts @ggerrietts ● Geoff Gerrietts ● Development Manager at AppNeta ● Pythonista for 16 years; developer for ~20 ● Manager, devops, programmer, webmaster, technical writer ● Before that I studied English ● I like to make not-really-joking jokes about how writing poetry and programming are similar.

Slide 3

Slide 3 text

Geoff Gerrietts @ggerrietts First having read the book of myths, and loaded the camera, and checked the edge of the knife-blade, I put on the body-armor of black rubber the absurd flippers the grave and awkward mask. — Adrienne Rich, “Diving into the Wreck” ● Title from a poem ● Adrienne Rich ● Metaphor of salvage diving ● Reflection, analysis ● The profound weight of the waters of time ● Shining beams of light into tattered evidence

Slide 4

Slide 4 text

Thinking About Performance ● So let’s shine some lights into application performance.

Slide 5

Slide 5 text

Geoff Gerrietts @ggerrietts Performant Incantations and Rules of Thumb ● When we talk about performance, conversations often start with folk wisdom

Slide 6

Slide 6 text

Geoff Gerrietts @ggerrietts Concatenate & Minify Geoff Gerrietts @ggerrietts ● Like, concatenate and minify your Javascript and CSS files.

Slide 7

Slide 7 text

Geoff Gerrietts @ggerrietts Serve Assets Separately Geoff Gerrietts @ggerrietts ● And don’t serve assets through your application, arrange to publish them separately. ● Maybe use a CDN.

Slide 8

Slide 8 text

Geoff Gerrietts @ggerrietts Hardware is Cheap, People are Expensive Geoff Gerrietts @ggerrietts ● Throw money at the problem ● Not just true in the web world

Slide 9

Slide 9 text

Geoff Gerrietts @ggerrietts “”.join([“abc”, “def”, “ghi”]) not “abc” + “def” + “ghi” Source: https://wiki.python.org/moin/PythonSpeed/PerformanceTips ● Python specific performance hacks ● Join lists rather than concatenate ● Which may be dated advice?

Slide 10

Slide 10 text

Geoff Gerrietts @ggerrietts def spam(eggs, randrange=random.randrange): while True: yield eggs[randrange(len(eggs))] Source: https://wiki.python.org/moin/PythonSpeed/PerformanceTips ● Bind local names to prevent lookups, especially in tight loops

Slide 11

Slide 11 text

Geoff Gerrietts @ggerrietts Source: http://bigocheatsheet.com/ ● No discussion of performance is complete without mentioning Big O

Slide 12

Slide 12 text

Geoff Gerrietts @ggerrietts None of this is wrong. ● All good advice, and we should all be aware of what we’re doing.

Slide 13

Slide 13 text

Geoff Gerrietts @ggerrietts None of this is wrong. But it rarely addresses the real issue. ● But it’s all targeted at solving very specific problems ● And most problems aren’t those problems ● Dave Grothe story

Slide 14

Slide 14 text

Thinking Critically About Performance ● So let’s think a little more analytically about performance.

Slide 15

Slide 15 text

Geoff Gerrietts @ggerrietts nginx gunicorn Flask nginx gunicorn Flask Postgres web.wreck.tlys.us aux.wreck.tlys.us db.wreck.tlys.us ● This is the architecture that I used for all the sample code in this talk ● It’s a fairly typical architecture, though there are lots of variations ● (Walk through) ● When we talk about performance, we talk about latency ● At a high level, the latency we care about is how long it takes from our request, to seeing the response ● If that takes 100ms (which is not bad), that 100ms accrues as the request traverses the architecture and becomes a response

Slide 16

Slide 16 text

Geoff Gerrietts @ggerrietts • L1 cache reference: 1ns • Branch mispredict: 3ns • Mutex lock/unlock: 17ns • Main memory reference: 100ns • SSD random read: 16,000ns (16µs) • Read 1M bytes from SSD: 123,000ns (123µs) • Round-trip in data center: 500,000ns (500µs) • Read 1M bytes from disk: 1Mns (1ms) • Disk seek: 3Mns (3ms) • Round-trip, CA to Amsterdam: 150Mns (150ms) Geoff Gerrietts @ggerrietts Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html ● A key tool in our kit is knowing where latency comes from. ● This is our razor, the thing we use to guide our suspicions. ● The “Jeff Dean” numbers, really originate out of Berkeley. ● They have an interactive graph where you can see the numbers change over time, shout out to the speed of light

Slide 17

Slide 17 text

Geoff Gerrietts @ggerrietts • L1 cache reference: 1ns • Branch mispredict: 3ns • Mutex lock/unlock: 17ns • Main memory reference: 100ns • SSD random read: 16,000ns (16µs) • Read 1M bytes from SSD: 123,000ns (123µs) • Round-trip in data center: 500,000ns (500µs) • Read 1M bytes from disk: 1Mns (1ms) • Disk seek: 3Mns (3ms) • Round-trip, CA to Amsterdam: 150Mns (150ms) Geoff Gerrietts @ggerrietts Source: http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html ● Python scribbles all over that yellow area ● But that’s also where most of the optimization guidelines help ● For some applications, those kind of optimizations help a lot. ● Not for most applications though, particularly not in the web domain.

Slide 18

Slide 18 text

Geoff Gerrietts @ggerrietts Geoff Gerrietts @ggerrietts CPU < RAM < SSD < LAN < HDD < WAN ● Roughly, CPU is faster than memory is faster than SSD is faster than LAN is faster than disk is faster than Internet. ● Knowing the magnitude on each of those less-thans matters a lot (sometimes 4, sometimes 100) ● But still: disk & network incur high latency

Slide 19

Slide 19 text

Geoff Gerrietts @ggerrietts Monitoring is Everything. ● One last thing. ● Last year’s talk, won’t rehash ● I’ll be showing a lot of TraceView, because TraceView ● Lots of good choices ● If you can’t measure it, you can’t know what’s wrong

Slide 20

Slide 20 text

Geoff Gerrietts @ggerrietts SPOILER ALERT ● Usually the database ● Network + disk ● Single bottleneck for all requests ● Victorian mystery novel, the butler would be named Postgres

Slide 21

Slide 21 text

So Let’s Dive ● So we have reviewed the book of myths. ● Checked our knife and camera ● Let’s get wet.

Slide 22

Slide 22 text

Geoff Gerrietts @ggerrietts ● Application performance is complicated ● You might get a view like this

Slide 23

Slide 23 text

Geoff Gerrietts @ggerrietts ● Or a view like this

Slide 24

Slide 24 text

Geoff Gerrietts @ggerrietts ● Or if you’re really doing this by hand, like this ● These aren’t actionable. ● You have to identify the problem you want to fix

Slide 25

Slide 25 text

Geoff Gerrietts @ggerrietts ● I like to use freq x duration, which also might be called cumulative time or total time ● Captures really slow outliers ● Also captures relatively fast stuff that’s being run a lot ● Freq and duration separately can be useful

Slide 26

Slide 26 text

The Shipwreck-in-a-Bottle Collection ● For our dives today, I have selected a handful of wrecks ● These are not the exact shipwrecks whose gunwales I leaped from as they sank ● But rather dioramas that I have built

Slide 27

Slide 27 text

Geoff Gerrietts @ggerrietts The S.S. Truculent Query (Slow Queries)

Slide 28

Slide 28 text

Geoff Gerrietts @ggerrietts http://192.168.12.34/truculent/smith This is the site in question

Slide 29

Slide 29 text

Geoff Gerrietts @ggerrietts ● This is a bad response time. ● Now, averages can be deceiving. ● But they’re not right now

Slide 30

Slide 30 text

Geoff Gerrietts @ggerrietts ● So we know this endpoint is slow. ● Which part is slow? ● If you have a tool that will split things out by layer, you can look at that.

Slide 31

Slide 31 text

Geoff Gerrietts @ggerrietts ● Here, a ton of time in SQLAlchemy

Slide 32

Slide 32 text

Geoff Gerrietts @ggerrietts ● Since the database is our main suspect, we can also use a slow query log

Slide 33

Slide 33 text

Geoff Gerrietts @ggerrietts ● The slow query log shows us that one of our queries is taking more than 14 seconds. ● Both of these indicators point to a problem with queries

Slide 34

Slide 34 text

Geoff Gerrietts @ggerrietts ● OK so here’s the code. ● Kind of hard to see maybe, but can you see where it’s bad?

Slide 35

Slide 35 text

Geoff Gerrietts @ggerrietts ● It’s kind of bad everywhere in the highlight.

Slide 36

Slide 36 text

Geoff Gerrietts @ggerrietts ● Using a subquery as its data source

Slide 37

Slide 37 text

Geoff Gerrietts @ggerrietts ● Unnecessary join

Slide 38

Slide 38 text

Geoff Gerrietts @ggerrietts ● Even a LIKE on an unindexed field ● This query is exaggeratedly bad ● But typical incarnations will do one or more of these things ● Maybe 1 second instead of 14. ● Can fix these with a better query

Slide 39

Slide 39 text

Geoff Gerrietts @ggerrietts The SS Highly Selective (Excessive Queries) ●

Slide 40

Slide 40 text

Geoff Gerrietts @ggerrietts http://web.wreck.tlys.us/1000queries/jgarcia This is the site in question

Slide 41

Slide 41 text

Geoff Gerrietts @ggerrietts ● So 4.2 is better than our 14s average in the previous example ● But it’s not good, especially for stuff that’s this policy-light

Slide 42

Slide 42 text

Geoff Gerrietts @ggerrietts ● Can check latency-by-layer

Slide 43

Slide 43 text

Geoff Gerrietts @ggerrietts ● First thing that jumps out is holy crap that’s a lot of queries, two orders of magnitude

Slide 44

Slide 44 text

Geoff Gerrietts @ggerrietts ● Query log useful too

Slide 45

Slide 45 text

Geoff Gerrietts @ggerrietts ● Split between a couple offenders

Slide 46

Slide 46 text

Geoff Gerrietts @ggerrietts ● If we look at a profile or a graph of a transaction ● A lot of red

Slide 47

Slide 47 text

Geoff Gerrietts @ggerrietts ● In fact, we can see 106 SQLalchemy extents

Slide 48

Slide 48 text

Geoff Gerrietts @ggerrietts ● OK so here’s the code for this one. ● See the problem? ● No? ● Where are all those queries coming from?

Slide 49

Slide 49 text

Geoff Gerrietts @ggerrietts ● What about here? Red flags?

Slide 50

Slide 50 text

Geoff Gerrietts @ggerrietts ● Right there. ● SQLAlchemy does lazy traversal of relationships. ● Django too. ● So the Rolls and Players are being queried separately — once for each game! ● Can maybe fix this by “eagerly joining” relationships ● Or maybe reconsider the scalability of the design ● Another variation on this problem: same query, many times in single transaction

Slide 51

Slide 51 text

Geoff Gerrietts @ggerrietts The SS Oryx-Antlerson (The Noisy Neighbor) ● Judy

Slide 52

Slide 52 text

Geoff Gerrietts @ggerrietts http://web.wreck.tlys.us/quiet/153/3432/2d6 This is the site in question

Slide 53

Slide 53 text

Geoff Gerrietts @ggerrietts ● Finally, an endpoint with decent performance ● Clocks in at <30ms

Slide 54

Slide 54 text

Geoff Gerrietts @ggerrietts ● But then at 3:35, it grows to 150%!

Slide 55

Slide 55 text

Geoff Gerrietts @ggerrietts ● Sometimes looking at a heatmap helps show how averages are being skewed by outliers ● Here everything’s pretty tightly grouped though ● One hint: after the break, the cluster is a lot looser

Slide 56

Slide 56 text

Geoff Gerrietts @ggerrietts ● Top is a detailed view into a fast trace ● Bottom is a detailed view into a slow trace ● Could do with profiling instead, have done that ● Nginx olive, wsgi cyan, sqlalchemy red

Slide 57

Slide 57 text

Geoff Gerrietts @ggerrietts ● Basically the same except a couple sqlalchemy queries taking longer ● 18.11 for first, 53.53 for second — more than 2x time ● So, probably the database. Surprised?

Slide 58

Slide 58 text

Geoff Gerrietts @ggerrietts ● Cloudwatch CPU on DB for same period. ● 20% CPU usage pre-spike ● 100% at spike

Slide 59

Slide 59 text

Geoff Gerrietts @ggerrietts ● Now, there’s nothing to see in the code, but I thought I’d show it. ● Pretty straightforward ● Nothing looks too suspicious, nothing looks like it should change based on time

Slide 60

Slide 60 text

Geoff Gerrietts @ggerrietts ● So let’s go see what’s going on in the DB ● Now, our webserver is 172.31.63.25... ● So who’s 172.31.61.237? ● Won’t necessarily be identifiable by IP -- might need to hunt by query instead ● Sometimes you look at nginx logs instead

Slide 61

Slide 61 text

Geoff Gerrietts @ggerrietts ● The culprit ● Can see it spins up at 3:35

Slide 62

Slide 62 text

Geoff Gerrietts @ggerrietts ● Here’s the code. It might look familiar. ● Once it started up, it beat the snot out of the DB

Slide 63

Slide 63 text

Geoff Gerrietts @ggerrietts The SS Remembrance Lost (Memory Fragmentation) ●

Slide 64

Slide 64 text

Geoff Gerrietts @ggerrietts http://web.wreck.tlys.us/grenade/8 ● This is the site in question

Slide 65

Slide 65 text

Geoff Gerrietts @ggerrietts ● So, this is a wild-looking graph ● More peaks and valleys than the Rockies

Slide 66

Slide 66 text

Geoff Gerrietts @ggerrietts ● This olive area is nginx. ● If you believe this graph, requests are spending ~30s in nginx!

Slide 67

Slide 67 text

Geoff Gerrietts @ggerrietts ● Also, throughput starts strong then peters out ● Weird!

Slide 68

Slide 68 text

Geoff Gerrietts @ggerrietts ● Here’s a similar graph from a different test run. ● Totally different characteristics! ● Only similarity is the weird spike in nginx

Slide 69

Slide 69 text

Geoff Gerrietts @ggerrietts ● Here’s a clue: error rate. ● This graph matches our first test run: request rate falls off, but errors go away.

Slide 70

Slide 70 text

Geoff Gerrietts @ggerrietts ● The second test run ● Errors throughout ● One notable gap where nginx spiked

Slide 71

Slide 71 text

Geoff Gerrietts @ggerrietts ● So let’s check the error log. ●

Slide 72

Slide 72 text

Geoff Gerrietts @ggerrietts ● Uh oh.

Slide 73

Slide 73 text

Geoff Gerrietts @ggerrietts ● Uh oh.

Slide 74

Slide 74 text

Geoff Gerrietts @ggerrietts ● Uh oh.

Slide 75

Slide 75 text

Geoff Gerrietts @ggerrietts ● Okay enough drama ● Top graph is from our first test. ● Middle graph and bottom are from our second test. ● Both reveal a shifting memory profile ● Memory topping out a couple times

Slide 76

Slide 76 text

Geoff Gerrietts @ggerrietts ● If we look in the server log ● OOM killer has been killing processes ● Segfaults have been killing processes ● OMG!

Slide 77

Slide 77 text

Geoff Gerrietts @ggerrietts ● Here’s the code. ● What’s wrong with what it does? ● It seems ok? ● But what happens when you pass in a large count?

Slide 78

Slide 78 text

Geoff Gerrietts @ggerrietts Memory fragmentation ● Grossly oversimplifying, Python manages memory in various pools ● I’ve drawn the same pool at three different points in its lifecycle. ● At first it’s of modest size ● Then grows to accommodate new objects ● Then those objects are collected ● But the pool is not shrunk.

Slide 79

Slide 79 text

Geoff Gerrietts @ggerrietts http://web.wreck.tlys.us/grenade/8 ⟶ 1.1% (46M) http://web.wreck.tlys.us/grenade/64 ⟶ 2.6% (105M) http://web.wreck.tlys.us/grenade/128 ⟶ 3.8% (152M) http://web.wreck.tlys.us/grenade/256 ⟶ 6.1% (247M) http://web.wreck.tlys.us/grenade/512 ⟶12.1% (488M) ● Watching this on top is pretty interesting ● Each request gobbles up a bunch of RAM ● Then settles at the new level ● 10 workers, 4GB main memory

Slide 80

Slide 80 text

Surfacing ●

Slide 81

Slide 81 text

Geoff Gerrietts @ggerrietts DON’T: Aim to improve performance.

Slide 82

Slide 82 text

Geoff Gerrietts @ggerrietts DO: Identify a problem, and fix it.

Slide 83

Slide 83 text

Geoff Gerrietts @ggerrietts DON’T: Agonize over optimizations

Slide 84

Slide 84 text

Geoff Gerrietts @ggerrietts DO: Look at the network. Look at the DB.

Slide 85

Slide 85 text

Geoff Gerrietts @ggerrietts Thanks! Questions/comments: [email protected] @ggerrietts https://github.com/ggerrietts/the-wreck