Application performance management with open source tools

Slide 1

Slide 1 text

Application performance management with open source tools Monica Sarbu & Tudor Golubenco (@monicasarbu & @tudor_g)

Slide 2

Slide 2 text

Intro • Software devs • Worked at a startup doing a VoIP monitoring product • Startup acquired by Acme Packet, acquired by Oracle • Working on @packetbeat

Slide 3

Slide 3 text

Scaling • Infrastructure: • scale to 100s, 1.000s, 10.000s of servers • Organization: • scale to 100s, 1.000s, 10.000s of employees

Slide 4

Slide 4 text

Conway’s law • “Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations"

Slide 5

Slide 5 text

First org chart

Slide 6

Slide 6 text

First org chart

Slide 7

Slide 7 text

Microservices

Slide 8

Slide 8 text

Microservices

Slide 9

Slide 9 text

Evolution • Applications evolve over time • Adapt to new requirements • Mutations are kind of random • You need to select the good mutations

Slide 10

Slide 10 text

Operational monitoring • Critical • It’s how you ﬁlter out the bad mutations and keep the good ones • Difﬁcult • Highly heterogenous infrastructures • Show the global state of a distributed system

Slide 11

Slide 11 text

monitoring and troubleshooting distributed applications

Slide 12

Slide 12 text

Requirements • Scalable and reliable • Extract data from different sources • Low overhead • Low conﬁguration • Simple, easy to understand

Slide 13

Slide 13 text

Start from the communication • The communication between components gets you the big picture • Protocols are standard • Packet data is objective • No latency overhead

Slide 14

Slide 14 text

Packetbeat • First public version in 05.2014 • Open Source, written in Golang

Slide 15

Slide 15 text

What is Packetbeat? ¯\_(ツ)_/¯

Slide 16

Slide 16 text

Packetbeat shipper • Running on your application servers • Follows TCP streams, decodes upper layer protocols like HTTP, MySQL, PgSQL, Redis, Thrift-RPC, etc • Correlates requests with responses • Captures data and measurements from transactions and environment • Exports data in JSON format

Slide 17

Slide 17 text

{ "client_ip": "127.0.0.1", "client_port": 46981, "ip": “127.0.0.1", "query": "select * from test", "method": "SELECT", "pgsql": { "error_code": "", "error_message": "", "error_severity": "", "iserror": false, "num_fields": 2, "num_rows": 2 }, "port": 5432, "responsetime": 12, "bytes_out": 95, "status": "OK", "timestamp": "2015-05-27T22:27:57.409Z", "type": "pgsql" }

Slide 18

Slide 18 text

What do we do with the data? ¯\(°_o)/¯

Slide 19

Slide 19 text

The traditional way • Decide what metrics you need (requests per second for each server, response time percentiles, etc.) • Write code to extract these metrics, store them in a DB • Store the transactions in a DB • But: • Each metric adds complexity • Features like drilling down and top N are difﬁcult

Slide 20

Slide 20 text

Packetbeat + ELK

Slide 21

Slide 21 text

Why ELK? • Already proven to scale and perform for logs • Clear and simple ﬂow for the data • Don’t have to create the metrics beforehand • Powerful features that become simple: • Drilling down to the transactions related to a peak • Top N features are trivial • Slicing by different dimensions is easy

Slide 22

Slide 22 text

visualizing the data

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Percentile values over time • Combines date histogram and percentiles aggregations

Slide 26

Slide 26 text

Percentiles aggregation • 95th percentile means that 95% of the values are smaller it

Slide 27

Slide 27 text

Response

Slide 28

Slide 28 text

Percentiles aggregation • Approximate values • T-digests algorithm by Ted Dunning • Accurate for small sets of values • More accurate for extreme percentiles

Slide 29

Slide 29 text

Date histogram • Splits data in buckets of time • Example:

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Date histogram nested with percentiles

Slide 32

Slide 32 text

Response

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Kibana conﬁg

Slide 35

Slide 35 text

Latency histogram

Slide 36

Slide 36 text

Histogram by response time • Splits data in buckets by response time • [0-10ms), [10ms-20ms), …

Slide 37

Slide 37 text

Response

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Add a date histogram

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Response times repartition

Slide 42

Slide 42 text

Kibana conﬁg

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Slowest RPC methods • Combines terms and percentiles aggregations

Slide 45

Slide 45 text

Terms aggregation • Buckets are dynamically built: one per unique value • By default: top 10 by document count • Approximate because each shard can have a different top 10

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Order by 99th percentile

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Kibana conﬁg

Slide 50

Slide 50 text

Tips • Live demo: http://demo.elastic.co/packetbeat/ • All examples here: https://github.com/tsg/bbuzz2015 • Use Sense (chrome add-on)

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

from __future__ import beats

Slide 53

Slide 53 text

Future plans • Packet data is just the beginning • Other sources of operational data: • OS readings: CPU, memory, IO stats • Code instrumentation, tracing • API gateways • Common servers internal stats (Nginx, Elasticsearch)

Slide 54

Slide 54 text

Joining Elastic

Slide 55

Slide 55 text

ship operational data to elasticsearch

Slide 56

Slide 56 text

The Beats • Packetbeat - data from the wire • Filebeat (Logstash-Forwarder) - data from log ﬁles • Future: • Topbeat - CPU, mem, IO stats • Metricsbeat - arbitrary metrics from nagios/sensu like scripts • RUMbeat - data from the browser

Slide 57

Slide 57 text

Stay in touch • @packetbeat • https://discuss.elastic.co/c/beats • Sign up for the webinar: • https://www.elastic.co/webinars/beats-platform-for-leveraging- operational-data