Introspecting Running Python Processes

Slide 1

Slide 1 text

Introspecting Running Python Processes Adam Lowry PyCon 2012, 2012-03-09 Saturday, March 10, 12

Slide 2

Slide 2 text

Saturday, March 10, 12 My name is Adam Lowry; I’m a co-founder and developer at Urban Airship. We provide services for mobile application developers.

Slide 3

Slide 3 text

Saturday, March 10, 12 My colleague Michael Schurter (schmichael) unfortunately couldn’t be here with me, but he worked on this talk with me. You can trust him — look at that beard.

Slide 4

Slide 4 text

Saturday, March 10, 12 Here’s a picture of Michael in a Columbian drug runner’s submarine.

Slide 5

Slide 5 text

What is your application doing? Saturday, March 10, 12 With any application, but especially with high performance applications, you need to know what is going on. Tell me this: how do you know that your web application, say, is behaving as you expect? You have a monitor that hits it once a minute, you have logs and error email, but you don’t actually know. If everything ground to a halt without any change you made, could you identify the problem? It’s difficult, unless you instrument and expose internal data. All you can do is guess about the causes of “The Event”

Slide 6

Slide 6 text

What is your application doing? Saturday, March 10, 12

Slide 7

Slide 7 text

Logging • A stream of expected events • everything is persisted • process and search as needed Your application’s diary Saturday, March 10, 12

Slide 8

Slide 8 text

Logging, but • Operational burden—storage, aggregation, searching • Records event, but can’t inspect related state • Difﬁcult to predict what you need • Performance tradeoff with verbosity Saturday, March 10, 12 Slow request -- why was it slow? 500 -- did you store enough local vars to reproduce it Still vital for auditing, reference

Slide 9

Slide 9 text

Breaking in through the back door • gdb-heap • eventlet’s backdoor module • Werkzeug debugger Spelunking the uncharted territory of your application’s process Saturday, March 10, 12 gdb-heap - dig into memory itself backdoor: a greenthread exposes REPL over telnet - interpreter at the state it was spawned Werkzeug debugger - useful only for development, gives you a detailed view (& repl!) of a single request/ exception Not for production use.

Slide 10

Slide 10 text

What are we missing? A look at the JVM Saturday, March 10, 12

Slide 11

Slide 11 text

JMX Saturday, March 10, 12 Java Management Extensions Any library or application can expose a set of interfaces Reading is not language agnostic, but adapter libraries can expose via HTTP and JSON Communication can be two way Traditionally used for resource consumption/system monitoring, but any application can make use of it

Slide 12

Slide 12 text

jconsole Saturday, March 10, 12 jconsole is a GUI tool for visualizing & interacting JMX data Incredibly useful to point this at a production machine you’re concerned about This one’s doing pretty well

Slide 13

Slide 13 text

jstack 2012-03-08 00:17:54 Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.0-b11 mixed mode): "Attach Listener" daemon prio=10 tid=0x0000000058bd4000 nid=0x353f waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "558334662@qtp-103468119-1932" prio=10 tid=0x00002aaac8131800 nid=0x2e76 in Object.wait() [0x000000004ed3f000] java.lang.Thread.State: TIMED_WAITING (on object monitor) ! at java.lang.Object.wait(Native Method) ! at org.mortbay.thread.QueuedThreadPool $PoolThread.run(QueuedThreadPool.java:626) ! - locked <0x00000000f955c638> (a org.mortbay.thread.QueuedThreadPool$PoolThread) Saturday, March 10, 12 jstack is another useful tool—point it at a running JVM instance, and get a stack trace of all threads Not a very interesting trace here :) This is built into the runtime, accessible with a signal.

Slide 14

Slide 14 text

metrics http://metrics.codahale.com/ METRICS METRICS EVERYWHERE Saturday, April 9, 2011 http://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf Saturday, March 10, 12 Tools for retrieving information are good, but good data requires good instrumentation. metrics is a library for instrumenting your application; all data is exposed via JMX It’s not just the tools, though — metrics, and this presentation from last year, represent a mindset that knowledge is crucial to operations Highly recommend you watch/read this presentation

Slide 15

Slide 15 text

metrics Saturday, March 10, 12 Back to this view of JMX data -- these attributes are measurements from metrics Reports a particular set of calls and how long they’re taking with a variety of measurements you can get a much better idea of what’s really going on

Slide 16

Slide 16 text

metrics Saturday, March 10, 12 One more little bit from jconsole -- my colleague lucky wanted you to see the OneMinuteRate graph. Click on the previous view, it starts graphing from this point.

Slide 17

Slide 17 text

Approaches • New Relic • Graphite Saturday, March 10, 12 Know that we have an idea of what we want, how can we get it? Two systems people using today are New Relic and Graphite New Relic is a hosted web application monitoring solution — highly recommend you look into it; I know a lot of the New Relic people are around the conference, so I hope they’ll do some demos Graphite is a scalable graphing system for time series data. It has a Python library for sending data to it from inside your application

Slide 18

Slide 18 text

New Approaches Saturday, March 10, 12 Now I’d like to talk about a few ways Michael and I have been working on to instrument our Python applications.

Slide 19

Slide 19 text

socketconsole Stack trace dumps for Python processes Saturday, March 10, 12 The first is socketconsole, which is a terrible name. My apologies. I wanted jstack for Python. After talking with people in the #pdxpython IRC channel, some of them had tried a few different approaches at this. We have a mix of web servers and non-web services, and I wanted something that’d work the same on any of them. communication with the process is done over a unix socket -- one filename per process, so no fighting over names/ports/etc

Slide 20

Slide 20 text

socketconsole goals • For multi-process and multi-threaded apps • Easy or no conﬁguration • No ﬁghting over HTTP ports • No unix signals Saturday, March 10, 12 multi-proc/thread -- we use mod_wsgi for our most important application. One server has many processes, each process has many threads no unix signals -- don’t want to trust that every single app and library handles interruptions correctly.

Slide 21

Slide 21 text

import random import time import threading import socketconsole socketconsole.launch() def waiter(): some_value = random.randint(0, 1000) time.sleep(500) ts = [threading.Thread(target=waiter) for i in range(3)] for i, t in enumerate(ts): t.name = "thread-%d" % i t.daemon = True t.start() print "Started", t.name waiter() Saturday, March 10, 12 Example Start up the socketconsole thread (in a web app, a WSGI ﬁle that creates the application is a good place for this) This example starts up a few threads, each sleeping.

Slide 22

Slide 22 text

(socketconsole)saison:socketconsole adam$ bin/socketreader # Thread Name: thread-2, ThreadID: 4315418624 File "lib/python2.6/threading.py", line 504, in __bootstrap self.__bootstrap_inner() File "lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "example/sockdemo.py", line 11, in waiter time.sleep(500) # Locals: some_value: 160 # Thread Name: thread-1, ThreadID: 4312272896 File "lib/python2.6/threading.py", line 504, in __bootstrap self.__bootstrap_inner() File "lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "example/sockdemo.py", line 11, in waiter time.sleep(500) # Locals: some_value: 298 # Thread Name: MainThread, ThreadID: 140735302732128 File "example/sockdemo.py", line 21, in waiter() File "example/sockdemo.py", line 11, in waiter time.sleep(500) Saturday, March 10, 12 socketreader is a shortcut to communicating with each unix socket/proccess it ﬁnds, and outputting each thread paths edited for clarity

Slide 23

Slide 23 text

SQLAlchemy example >>> import sqlalchemy, socketconsole >>> socketconsole.launch() >>> eng = sqlalchemy.connect("postgresql:///test") >>> eng = sqlalchemy.create_engine("postgresql:///test") >>> q = sqlalchemy.text("select pg_sleep(15)") >>> eng.execute(q) Saturday, March 10, 12 If we do a little more work, we can see more use from it This script connects to a test postgres database and sleeps for a while (long enough for me to run socketreader :))

Slide 24

Slide 24 text

# Thread Name: MainThread, ThreadID: 140735302732128 File "", line 1, in File "sqlalchemy/engine/base.py", line 1788, in execute return connection.execute(statement, *multiparams, **params) File "sqlalchemy/engine/base.py", line 1191, in execute params) File "sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement return self.__execute_context(context) File "sqlalchemy/engine/base.py", line 1302, in __execute_context context.parameters[0], context=context) File "sqlalchemy/engine/base.py", line 1394, in _cursor_execute context) File "sqlalchemy/engine/default.py", line 299, in do_execute cursor.execute(statement, parameters) # Locals: cursor: self: statement: 'select pg_sleep(15)' parameters: {} context: Saturday, March 10, 12 Cool thing here is the locals -- one nice bit is that because SQLAlchemy has a local statement variable, we get the exact query that’s blocking the thread. Both of these examples involved basically frozen threads, but it’s good whenever you have a lot of activity

Slide 25

Slide 25 text

mmstats A simple way to expose metrics, statistics, and arbitrary data in an easy to consume format. Writing is framework agnostic Reading is language agnostic “The /proc ﬁlesystem for your application.” Saturday, March 10, 12 mmstats is a library written by Michael Schurter/schmichael, to provide a subset of what JMX might be used for.

Slide 26

Slide 26 text

What • mmap allows sharing memory between processes • Language independent data structure: • Series of ﬁelds (structs) • Fields have label, type, and values • Exposed in Python app as a model class Saturday, March 10, 12 mmstats exposes a memory mapped ﬁle for each process, and provides a reader that can pull it out.

Slide 27

Slide 27 text

Goals • Simple API • Minimal & predictable performance impact • Separate reading & writing Saturday, March 10, 12 separating reading & writing is most useful in a heterogeneous environment

Slide 28

Slide 28 text

Details • Single writer, multi-reader • No locks • No syscalls (write, read, send, recv, etc.)* • All in userspace for readers & writers* • Reading has no impact on writers • Fixed ﬁeld sizes Saturday, March 10, 12 single writer/no locks, means a mmap ﬁle per thread instrumentation should not cost too much application performance You are doing more work, but you need to be reasonable

Slide 29

Slide 29 text

import random import time import mmstats class ProgressStats(mmstats.BaseMmStats): count = mmstats.UIntField(label="counter.count") errors = mmstats.UIntField(label="counter.errors") last_sleep = mmstats.FloatField(label="counter.last_sleep") status = mmstats.StringField(label="counter.status") stats = ProgressStats(path=".", filename="progress-stats") for i in range(0, 1000): stats.count += 1 stats.status = "Up to %d, %d more to go" % (i, 1000 - i) stats.last_sleep = random.random() time.sleep(stats.last_sleep) Saturday, March 10, 12

Slide 30

Slide 30 text

saison:mms adam$ bin/slurpstats progress-stats ==> progress-stats counter.count 144 counter.status Up to 143, 857 more to go counter.last_sleep 0.294649809599 counter.errors 0 saison:mms adam$ bin/pollstats "counter.count,counter.errors" progress-stats counter.count | counter.errors 50 | 0 2 | 0 2 | 0 2 | 0 1 | 0 1 | 0 5 | 0 1 | 0 1 | 0 2 | 0 2 | 0 Saturday, March 10, 12 pollstats shows the difference between counters once a second Both of these are pretty useful for one-offs, as well. Next time you write a data munging script that might take a few hours to run, take a few minutes and expose some metrics. It’s much better than hoping you got your periodic print statements right.

Slide 31

Slide 31 text

mmash Saturday, March 10, 12 mmash -- web server; give it a settings ﬁle with locations of mmstats ﬁles exposes all found metrics There are also some built in aggregations for multiple processes

Slide 32

Slide 32 text

Visualization and Monitoring Saturday, March 10, 12 We use Icinga/Nagios for monitoring and pnp4nagios to graph a lot of that data This check pulls that aggregated sum from the previous slide

Slide 33

Slide 33 text

Visualization and Monitoring Saturday, March 10, 12 Then the ops team can use whatever tools they like; here’s a graph with all that data from multiple hosts. Being agnostic towards readers allows us to build a holistic dashboard of related graphs for a group of services, whether the source is /proc, a datastore, Java, Python, whatever

Slide 34

Slide 34 text

What’s next? • More fields • More measurement styles • More readers • Examples and plugins for common libraries • Unified web interface Saturday, March 10, 12 What’s next for mmstats? expansion along the current lines, exposing more kinds of data Unified web interface with a simple configuration so you could drop it on a server and point it at your mmstats and socketconsole locations

Slide 35

Slide 35 text

Parallel effort Scales: https://github.com/Greplin/scales/ pystuck: https://github.com/alonho/pystuck Your internal tools? Saturday, March 10, 12 Scales is a parallel effort by the Greplin team, and it’s come a long way. It supports a variety of metrics, and data is exported either by an in-process web server or by periodically pushing to Graphite Primary difference in philosophy is that it does require more locking and a thread/port per process Lots to learn from here, highly recommend checking it out pystuck -- just learned about this today!

Slide 36

Slide 36 text

Call to Action http://groups.google.com/group/python- introspection Sprinting on Monday and Tuesday https://github.com/robotadam/socketconsole https://github.com/schmichael/mmstats Saturday, March 10, 12 This is a practical problem, and many of us have solved portions of it in different ways. We want to learn from each other and build a set of tools and practices to make these solutions straightforward, and easy to integrate. This should be the default. We need this It’s within our reach Take a minute to join this mailing list and consider coming by to the sprints on Monday and Tuesday. I’ll be there, and schmichael will be sprinting remotely on Monday.

Slide 37

Slide 37 text

Thanks! Adam Lowry [email protected] [email protected] @robotadam Michael Schurter [email protected] [email protected] @schmichael http://urbanairship.com/jobs http://groups.google.com/group/python-introspection Saturday, March 10, 12

Slide 38

Slide 38 text

Bonus Saturday, March 10, 12