Measuring and Logging Everything in Real Time

Measuring and Logging Everything in Real-Time @BastianHofmann As the title
says in the next hour we are going to talk about two very important aspects of any web application that are often overlooked in the beginning:

Logging 1st is logging, so that you actually now, what's
happening in your application if you have to track down an error

Measuring And 2nd is measuring and monitoring constantly how your
application is behaving

Many roads Of course as with many things there are
many roads to accomplish that and many tools available, that help you with it. i'll not going to show you all of them, but just what we ended up using at ResearchGate. They work great for us, but depending on your use cases other tools may be more suited to your needs. The important thing to get out of this talk: take care of it

A few words about me before that ...

i work at researchgate, the social network for scientists and
researchers

ResearchGate gives science back to the people who make it
happen. We help researchers build reputation and accelerate scientific progress. On their terms. ‟ the goal is to give...

over 2.9 million users

here some impressions of the page

http://gigaom.com/2013/06/05/heres-how-bill-gatess- researchgate-investment-might-change-the-world-for-the- better http://venturevillage.eu/researchgate

have this, and also work on some cool stuff

in addition to this i also speak frequently on conferences
throughout the world

work and live in berlin

Questions? Ask by the way, if you have any questions
throughout this talk, if you don't understand something, just raise your hand and ask. probably my fault anyways since i spoke to quickly or my accent was too bad

Logging So 1st: Logging

For when something goes wrong

server error log access log debug logs slow query log
... Let's start with a simple setup: we have one server with an apache and some database on it. Of course all of these services are writing logs somewhere

$ ssh my_simple_server getting at these is very simple, just
ssh into the server

$ tail -f error.log $ cat error.log | grep Exception
and use your favorite unix command line tools to view the logs

Error Logs Let's look at the application related logs a
bit more closely, for example the ... as it says, Apache and PHP log all errors that happen in your application there

ErrorLog /var/logs/apache/error.log The location is speciﬁed in your Apache conﬁguration
somewhere like this

and the user can get then such a nice error
page, looking at this screenshot: ﬁrst you may notice is the http response code displayed there

HTTP Response Codes http://www.w3.org/Protocols/rfc2616/rfc2616- sec6.html Please read up on it
and choose correct ones for your application. helps you greatly with ﬁltering the important stuff from the unimportant later on: a 503 may be much more serious then a 404

on the screenshot here you also see some additional codes
and ids displayed that help to identify the error in your system later on

Log additional info of course you should not only display
this information but also write it to your log ﬁles, you can easily do this in your custom error handler, and if you need to ﬁnd the error that resulted in this error page you can just grep for the error code

Not only error codes

Request Information

Access Logs Another important log is the access log, which
logs each incoming request

192.168.56.1 - - [09/Jul/2012:19:18:19 +0200] "GET /rg_trunk/webroot/c/af10c/ images/template/rg_logo_default.png HTTP/ 1.1"
200 882 "http://devm/rg_trunk/webroot/ directory/publications/" this is what it usually looks like

LogFormat "%h %l %u %t \"%r\" %>s %b" custom CustomLog
/var/logs/apache/access.log custom http://httpd.apache.org/docs/2.2/mod/ mod_log_conﬁg.html#logformat and you conﬁgure it in apache like this, there is already lots of information from the request you can put in there, like url, response code, request time etc

LogFormat "...\"%{referer}i\" \"%{user- agent}i\" %{session_id}o %{account_id}o..." custom but back to
apache: these notes then can be referenced in your log format like this

Debug Logs next log type that your application can write
are plain debug logs that are not related to errors

Fingers Crossed because logging everything like sql queries etc can
result in huge amount of log files, and in 99% of requests you are not interested in them you should have a look at fingers crossed handlers for logging: you just put your logs in this handler but it does not log them directly to a file, only when a certain condition was met, like a threshold all already written logs and all further logs are written to a file

Log in a structured way one thing that can help
you greatly with managing huge amount of logs with lot's of different additional information in it is logging in a structured way (not only for debug but also for error logs)

JSON http://www.ietf.org/rfc/rfc4627.txt in my opinion a good format is json,
since it is still human readable

Logs from other services but of course your application probably
is a bit more complex, soa anyone? so you have also other services somewhere logging something

web server http service http service http service http service
user request log log log log log the setup may look like this, a request comes to a webserver and your php application on there calls other services. each of them have their own logs. for example the http service there. now if an error happens in the http service we are probably going to display an error as well in our php app. but how can we identify then which exception on the http service lead to the displayed error on the web server?

Correlation / Tracing ID a nice way of doing this
in a generlized, non-custom way is by using a common correlation or tracing id

create unique trace_id for request user request trace_id trace_id trace_id trace_id log log log log log so when the user request ﬁrst hits your system you generate a unique trace_id for this request and you then pass this to all underlying services through an http header.

X-Trace-Id: bbr8ehb984tbab894

create unique trace_id for request user request trace_id trace_id trace_id trace_id log log log log log everyone then puts these tracing id in every logs they write. so if you have a tracing id you can easily ﬁnd all logs for this request by just greping for the trace_id

but... usually your application does not look like this

but more like this: you have multiple servers lying around

which means ssh-ing into all of these servers (dsh...) and
then grepping for information over multiple logs (access log, error log, debug logs, ...) can become quite tedious and time consuming

Aggregate the logs in a central place so to make
it easier: work on a central log management and ...

Make them easily full-text searchable also ...

Make them aggregate-able and make different kind of logs aggregateable

Always Log to file but whatever you do and whatever
we are going to talk about...

Seriously...

Always Log to file because this is your backup when
your central log management solution will fail (network errors, etc). and i'll guarantee you, it will fail sometime, probably at the worst moment

Directly to a database ﬁrst naive approach to central log
management is to log directly from you application to a database (you saw the mongoDbHandler from Monolog)

webserver webserver webserver DB setup would look like this: everything
in one place, easily searchable, great, ﬁnished

Disadvantages not quite, it has some disadvantages

Database is down? what happens if

Database is slow?

Database is full?

How to integrate access logs?

Influences application performance because of all the database queries you
are doing, and chances are if there are problems on your platform and multiple errors are occurring, directly writing to a database will make your problems worse

Frontend? also there is still no frontend to easily search
and monitor exceptions, of course there are sql clients, phpmyadmin, rockmongo etc, but they are multi purpose tools and not really made for displaying and ﬁltering logs

Better solutions? are there ...

graylog2 http://graylog2.org/ on tool that I quite like for this
is called graylog2

Full text search it comes out of the box with
a very nice interface to do...

Structured Messages in ...

Metrics & Alarms and offers metrics, alarms on your logs

that's what the interface looks like, you'll see this in
action in a minute

https://github.com/bashofmann/ vm_graylog2_amqp_logstash_apache

Graylog2 UDP GELF Messages elasticsearch webserver webserver webserver the easiest
approach is to just send the messages in a special format (gelf, supported by a monolog handler) with a upd request from your app servers to graylog which stores them in elasticsearch. udp is quite nice since it is ﬁre and forget, so it does not inﬂuence our application too much (if you reuse connections and don't create a new one for each log)

{ "version": "1.0", "host": "www1", "short_message": "Short message", "full_message": "Backtrace
here\n \nmore stuff", "timestamp": 1291899928.412, "level": 1, "facility": "payment-backend", "file": "/var/www/somefile.rb", "line": 356, "_user_id": 42, "_something_else": "foo" } a typical graylog gelf message looks like this, some default fields, and user specific fields prepended with an underscore

Disadvantages but ...

Graylog/ elasticsearch is down? still if graylog/es is down you
are using logs

Graylog/ elasticsearch is full?

Packet loss udp has the disadvantage that because of network
errors your logs possibly don't arrive

Graylog2 elasticsearch webserver webserver webserver AMQP GELF GELF GELF GELF
better approach, put a queue in between, also good for load balancing

Don't influence your application by logging but still if you
push something in a queue you are still inﬂuencing your production system unnecessarily

Always Log to file and remember you should log to
ﬁle anyways

logstash http://logstash.net/ enter the next tool we are using, logstash
is a very powerful tool to handle log processing

input filter output basic workflow is that you have some
input where logstash gets log messages in, on this input you can execute multiple filters that modify the message and then you can output the filtered message somewhere

Very rich plugin system to do this it offers a
very large and rich plugin system for all kinds of inputs, ﬁlters and outputs, and you can also write your own

Graylog2 elasticsearch webserver webserver webserver AMQP log log log logstash
logstash logstash so our setup would look like this, logstash instances on each app server take the local log ﬁles, parse and ﬁlter them and send them to the queue

Measuring now that you have all the log messages in
a central, searchable, aggregateable place: lets go to measuring

For knowing what happens so that you know what actually
happens in your application and in your system

Technical Metrics and with that i mean technical metrics like
load on your servers, available memory, requests, exceptions, response times (a bit more on that later)

Business Metrics but also and probably even more important, business
metrics

Define your KPIs or so called key performance indicators, example:
signups, proﬁle views, messages, uploaded images, average volume per sale, ... it really depends on your application

Graphite http://graphite.wikidot.com/ again there are many tools available to collect
and display these metrics: one i want to highlight is graphite

webserver webserver webserver graphite UDP easiest way to use it
is just sending udp requests with the metric you want to count to graphite, but if you're collection and measuring lots of stuff this could put quite a high load on graphite

StatsD https://github.com/etsy/statsd/ etsy wrote a great small tool called statsd
to help you with that

webserver webserver webserver graphite UDP statsd you can just hook
it in between your servers and graphite as a small load balancing daemon. it receives all messages, aggregates them and then ﬂushes them in a certain interval, like very 10 seconds to graphite

webserver webserver webserver statsd statsd statsd graphite aggregated UPD message
of course you also can run one statsd on every webserver, so that your statsd gets not overloaded

webserver webserver webserver statsd statsd statsd graphite aggregated UPD message
statsd and you even can arrange different statsd in a row

Metric? what is a metric: three different types

Counters counters (how often did something happen)

Timers timer (how long did something take)

Gauges gauges (collect arbitrary values (amount of a sale)

graphite comes with a powerful interface where you can plot
and aggregate this data into graphs and perform different mathematical functions on it to get it exactly the way you want to display your data

https://metrics.librato.com/ and if you don't want to be bothered with
setting graphite up, you can also use hosted solutions like librato (statsd comes with different backends to push data to)

http://mashable.com/2010/10/13/etsy-ofﬁce-pics/ and if you have all of this, you should
make your graphs really visible, here a picture from the etsy office

Remember: Response Times last but not least:

Easiest approach: Get it out of the access log

192.168.56.1 - - [09/Jul/2012:19:18:19 +0200] "GET /rg_trunk/webroot/c/af10c/ images/template/rg_logo_default.png HTTP/ 1.1"
200 882 "http://devm/rg_trunk/webroot/ directory/publications/" this is what it usually looks like

Is this actually what we want to measure? Is this
really the response time the user sees?

External Resources CSS JavaScript Images Slow connections Slow Browsers SSL
Handshakes DNS resolving so for the users the actual response time is actually when all this finished loading

That is what we want to measure!

https://github.com/lognormal/boomerang

browser JS: boomerang logstash trackingServer access log requests tracking image
with timing information as query parameters graphite statsd boomerang collects from various sources load times of the current page (e.g. window.performance api), and with them requests a tracking image from a tracking server you specify (should be on another domain, so that you don't send unnecessary cookies there). on this tracking server a logstash can parse the access logs and then send the collected timers via statsd to graphite

http://twitter.com/BastianHofmann http://profiles.google.com/bashofmann http://lanyrd.com/people/BastianHofmann http://speakerdeck.com/u/bastianhofmann https://github.com/bashofmann https://www.researchgate.net/profile/Bastian_Hofmann/ [email protected] thanks, you can
contact me on any of these platforms or via mail

Measuring and Logging Everything in Real Time

Measuring and Logging Everything in Real Time

More Decks by Bastian Hofmann

Other Decks in Programming

Featured

Transcript