Introspecting Running Python Processes

Introspecting Running Python Processes Adam Lowry PyCon 2012, 2012-03-09 Saturday,
March 10, 12

Saturday, March 10, 12 My name is Adam Lowry; I’m
a co-founder and developer at Urban Airship. We provide services for mobile application developers.

Saturday, March 10, 12 My colleague Michael Schurter (schmichael) unfortunately
couldn’t be here with me, but he worked on this talk with me. You can trust him — look at that beard.

Saturday, March 10, 12 Here’s a picture of Michael in
a Columbian drug runner’s submarine.

What is your application doing? Saturday, March 10, 12 With
any application, but especially with high performance applications, you need to know what is going on. Tell me this: how do you know that your web application, say, is behaving as you expect? You have a monitor that hits it once a minute, you have logs and error email, but you don’t actually know. If everything ground to a halt without any change you made, could you identify the problem? It’s difficult, unless you instrument and expose internal data. All you can do is guess about the causes of “The Event”

What is your application doing? Saturday, March 10, 12

Logging • A stream of expected events • everything is
persisted • process and search as needed Your application’s diary Saturday, March 10, 12

Logging, but • Operational burden—storage, aggregation, searching • Records event,
but can’t inspect related state • Difﬁcult to predict what you need • Performance tradeoff with verbosity Saturday, March 10, 12 Slow request -- why was it slow? 500 -- did you store enough local vars to reproduce it Still vital for auditing, reference

Breaking in through the back door • gdb-heap • eventlet’s
backdoor module • Werkzeug debugger Spelunking the uncharted territory of your application’s process Saturday, March 10, 12 gdb-heap - dig into memory itself backdoor: a greenthread exposes REPL over telnet - interpreter at the state it was spawned Werkzeug debugger - useful only for development, gives you a detailed view (& repl!) of a single request/ exception Not for production use.

What are we missing? A look at the JVM Saturday,
March 10, 12

JMX Saturday, March 10, 12 Java Management Extensions Any library
or application can expose a set of interfaces Reading is not language agnostic, but adapter libraries can expose via HTTP and JSON Communication can be two way Traditionally used for resource consumption/system monitoring, but any application can make use of it

jconsole Saturday, March 10, 12 jconsole is a GUI tool
for visualizing & interacting JMX data Incredibly useful to point this at a production machine you’re concerned about This one’s doing pretty well

jstack 2012-03-08 00:17:54 Full thread dump Java HotSpot(TM) 64-Bit Server
VM (20.0-b11 mixed mode): "Attach Listener" daemon prio=10 tid=0x0000000058bd4000 nid=0x353f waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "558334662@qtp-103468119-1932" prio=10 tid=0x00002aaac8131800 nid=0x2e76 in Object.wait() [0x000000004ed3f000] java.lang.Thread.State: TIMED_WAITING (on object monitor) ! at java.lang.Object.wait(Native Method) ! at org.mortbay.thread.QueuedThreadPool $PoolThread.run(QueuedThreadPool.java:626) ! - locked <0x00000000f955c638> (a org.mortbay.thread.QueuedThreadPool$PoolThread) Saturday, March 10, 12 jstack is another useful tool—point it at a running JVM instance, and get a stack trace of all threads Not a very interesting trace here :) This is built into the runtime, accessible with a signal.

metrics http://metrics.codahale.com/ METRICS METRICS EVERYWHERE Saturday, April 9, 2011 http://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf
Saturday, March 10, 12 Tools for retrieving information are good, but good data requires good instrumentation. metrics is a library for instrumenting your application; all data is exposed via JMX It’s not just the tools, though — metrics, and this presentation from last year, represent a mindset that knowledge is crucial to operations Highly recommend you watch/read this presentation

metrics Saturday, March 10, 12 Back to this view of
JMX data -- these attributes are measurements from metrics Reports a particular set of calls and how long they’re taking with a variety of measurements you can get a much better idea of what’s really going on

metrics Saturday, March 10, 12 One more little bit from
jconsole -- my colleague lucky wanted you to see the OneMinuteRate graph. Click on the previous view, it starts graphing from this point.

Approaches • New Relic • Graphite Saturday, March 10, 12
Know that we have an idea of what we want, how can we get it? Two systems people using today are New Relic and Graphite New Relic is a hosted web application monitoring solution — highly recommend you look into it; I know a lot of the New Relic people are around the conference, so I hope they’ll do some demos Graphite is a scalable graphing system for time series data. It has a Python library for sending data to it from inside your application

New Approaches Saturday, March 10, 12 Now I’d like to
talk about a few ways Michael and I have been working on to instrument our Python applications.

socketconsole Stack trace dumps for Python processes Saturday, March 10,
12 The first is socketconsole, which is a terrible name. My apologies. I wanted jstack for Python. After talking with people in the #pdxpython IRC channel, some of them had tried a few different approaches at this. We have a mix of web servers and non-web services, and I wanted something that’d work the same on any of them. communication with the process is done over a unix socket -- one filename per process, so no fighting over names/ports/etc

socketconsole goals • For multi-process and multi-threaded apps • Easy
or no conﬁguration • No ﬁghting over HTTP ports • No unix signals Saturday, March 10, 12 multi-proc/thread -- we use mod_wsgi for our most important application. One server has many processes, each process has many threads no unix signals -- don’t want to trust that every single app and library handles interruptions correctly.

import random import time import threading import socketconsole socketconsole.launch() def
waiter(): some_value = random.randint(0, 1000) time.sleep(500) ts = [threading.Thread(target=waiter) for i in range(3)] for i, t in enumerate(ts): t.name = "thread-%d" % i t.daemon = True t.start() print "Started", t.name waiter() Saturday, March 10, 12 Example Start up the socketconsole thread (in a web app, a WSGI ﬁle that creates the application is a good place for this) This example starts up a few threads, each sleeping.

(socketconsole)saison:socketconsole adam$ bin/socketreader # Thread Name: thread-2, ThreadID: 4315418624 File
"lib/python2.6/threading.py", line 504, in __bootstrap self.__bootstrap_inner() File "lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "example/sockdemo.py", line 11, in waiter time.sleep(500) # Locals: some_value: 160 # Thread Name: thread-1, ThreadID: 4312272896 File "lib/python2.6/threading.py", line 504, in __bootstrap self.__bootstrap_inner() File "lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "example/sockdemo.py", line 11, in waiter time.sleep(500) # Locals: some_value: 298 # Thread Name: MainThread, ThreadID: 140735302732128 File "example/sockdemo.py", line 21, in <module> waiter() File "example/sockdemo.py", line 11, in waiter time.sleep(500) Saturday, March 10, 12 socketreader is a shortcut to communicating with each unix socket/proccess it ﬁnds, and outputting each thread paths edited for clarity

SQLAlchemy example >>> import sqlalchemy, socketconsole >>> socketconsole.launch() >>> eng
= sqlalchemy.connect("postgresql:///test") >>> eng = sqlalchemy.create_engine("postgresql:///test") >>> q = sqlalchemy.text("select pg_sleep(15)") >>> eng.execute(q) <sqlalchemy.engine.base.ResultProxy object at 0x101666610> Saturday, March 10, 12 If we do a little more work, we can see more use from it This script connects to a test postgres database and sleeps for a while (long enough for me to run socketreader :))

# Thread Name: MainThread, ThreadID: 140735302732128 File "<stdin>", line 1,
in <module> File "sqlalchemy/engine/base.py", line 1788, in execute return connection.execute(statement, *multiparams, **params) File "sqlalchemy/engine/base.py", line 1191, in execute params) File "sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement return self.__execute_context(context) File "sqlalchemy/engine/base.py", line 1302, in __execute_context context.parameters[0], context=context) File "sqlalchemy/engine/base.py", line 1394, in _cursor_execute context) File "sqlalchemy/engine/default.py", line 299, in do_execute cursor.execute(statement, parameters) # Locals: cursor: <sqlalchemy.pool._CursorFairy object at 0x101665710> self: <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 object at 0x101661ad0> statement: 'select pg_sleep(15)' parameters: {} context: <sqlalchemy.dialects.postgresql.psycopg2.PGExecutionContext_psycopg2 object at 0x101666550> Saturday, March 10, 12 Cool thing here is the locals -- one nice bit is that because SQLAlchemy has a local statement variable, we get the exact query that’s blocking the thread. Both of these examples involved basically frozen threads, but it’s good whenever you have a lot of activity

mmstats A simple way to expose metrics, statistics, and arbitrary
data in an easy to consume format. Writing is framework agnostic Reading is language agnostic “The /proc ﬁlesystem for your application.” Saturday, March 10, 12 mmstats is a library written by Michael Schurter/schmichael, to provide a subset of what JMX might be used for.

What • mmap allows sharing memory between processes • Language
independent data structure: • Series of ﬁelds (structs) • Fields have label, type, and values • Exposed in Python app as a model class Saturday, March 10, 12 mmstats exposes a memory mapped ﬁle for each process, and provides a reader that can pull it out.

Goals • Simple API • Minimal & predictable performance impact
• Separate reading & writing Saturday, March 10, 12 separating reading & writing is most useful in a heterogeneous environment

Details • Single writer, multi-reader • No locks • No
syscalls (write, read, send, recv, etc.)* • All in userspace for readers & writers* • Reading has no impact on writers • Fixed ﬁeld sizes Saturday, March 10, 12 single writer/no locks, means a mmap ﬁle per thread instrumentation should not cost too much application performance You are doing more work, but you need to be reasonable

import random import time import mmstats class ProgressStats(mmstats.BaseMmStats): count =
mmstats.UIntField(label="counter.count") errors = mmstats.UIntField(label="counter.errors") last_sleep = mmstats.FloatField(label="counter.last_sleep") status = mmstats.StringField(label="counter.status") stats = ProgressStats(path=".", filename="progress-stats") for i in range(0, 1000): stats.count += 1 stats.status = "Up to %d, %d more to go" % (i, 1000 - i) stats.last_sleep = random.random() time.sleep(stats.last_sleep) Saturday, March 10, 12

saison:mms adam$ bin/slurpstats progress-stats ==> progress-stats counter.count 144 counter.status Up
to 143, 857 more to go counter.last_sleep 0.294649809599 counter.errors 0 saison:mms adam$ bin/pollstats "counter.count,counter.errors" progress-stats counter.count | counter.errors 50 | 0 2 | 0 2 | 0 2 | 0 1 | 0 1 | 0 5 | 0 1 | 0 1 | 0 2 | 0 2 | 0 Saturday, March 10, 12 pollstats shows the difference between counters once a second Both of these are pretty useful for one-offs, as well. Next time you write a data munging script that might take a few hours to run, take a few minutes and expose some metrics. It’s much better than hoping you got your periodic print statements right.

mmash Saturday, March 10, 12 mmash -- web server; give
it a settings ﬁle with locations of mmstats ﬁles exposes all found metrics There are also some built in aggregations for multiple processes

Visualization and Monitoring Saturday, March 10, 12 We use Icinga/Nagios
for monitoring and pnp4nagios to graph a lot of that data This check pulls that aggregated sum from the previous slide

Visualization and Monitoring Saturday, March 10, 12 Then the ops
team can use whatever tools they like; here’s a graph with all that data from multiple hosts. Being agnostic towards readers allows us to build a holistic dashboard of related graphs for a group of services, whether the source is /proc, a datastore, Java, Python, whatever

What’s next? • More fields • More measurement styles •
More readers • Examples and plugins for common libraries • Unified web interface Saturday, March 10, 12 What’s next for mmstats? expansion along the current lines, exposing more kinds of data Unified web interface with a simple configuration so you could drop it on a server and point it at your mmstats and socketconsole locations

Parallel effort Scales: https://github.com/Greplin/scales/ pystuck: https://github.com/alonho/pystuck Your internal tools? Saturday,
March 10, 12 Scales is a parallel effort by the Greplin team, and it’s come a long way. It supports a variety of metrics, and data is exported either by an in-process web server or by periodically pushing to Graphite Primary difference in philosophy is that it does require more locking and a thread/port per process Lots to learn from here, highly recommend checking it out pystuck -- just learned about this today!

Call to Action http://groups.google.com/group/python- introspection Sprinting on Monday and Tuesday
https://github.com/robotadam/socketconsole https://github.com/schmichael/mmstats Saturday, March 10, 12 This is a practical problem, and many of us have solved portions of it in different ways. We want to learn from each other and build a set of tools and practices to make these solutions straightforward, and easy to integrate. This should be the default. We need this It’s within our reach Take a minute to join this mailing list and consider coming by to the sprints on Monday and Tuesday. I’ll be there, and schmichael will be sprinting remotely on Monday.

Thanks! Adam Lowry [email protected] [email protected] @robotadam Michael Schurter [email protected] [email protected]
@schmichael http://urbanairship.com/jobs http://groups.google.com/group/python-introspection Saturday, March 10, 12

Bonus Saturday, March 10, 12

Introspecting Running Python Processes

Introspecting Running Python Processes

More Decks by Adam Lowry

Other Decks in Programming

Featured

Transcript