Monitoring OpenStack and Dynamic Infrastructure

Slide 1

Slide 1 text

Monitoring OpenStack and Dynamic Infrastructure Matthew Williams – [email protected] - @technovangelist Datadog August 23, 2016

Slide 2

Slide 2 text

Datadog Overview SaaS-based Infrastructure Monitoring Focus on Modern Infrastructure (cloud, containers, micro-services) Processing about a trillion metrics per day Intelligent alerting

Slide 3

Slide 3 text

The Plan • The problem • What to monitor • Monitoring options

Slide 4

Slide 4 text

The Problem Monitoring a multi-layered, distributed system 4

Slide 5

Slide 5 text

The high level problem • OpenStack composed of > 16 services - many moving parts • OpenStack is infrastructure • OpenStack users have their own users • No such thing as “set it and forget it” for self-hosted services 5

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

But Monitoring is Hard • OpenStack services + many hosts * containers * VMs == lots to monitor • Not to mention user-run services • Information is EVERYWHERE: • Logs; metrics; console dumps; events/notification • Aggregation is challenging 8

Slide 9

Slide 9 text

Metrics explosion • 1 node • 70+ metrics from Nova (multi-tenant metrics) • Hundreds from RabbitMQ/supporting software • 1 operating system (e.g., Linux) • 100 metrics per instance • Custom Applications • 50~ metrics

Slide 10

Slide 10 text

Metrics explosion

Slide 11

Slide 11 text

Host-centric view

Slide 12

Slide 12 text

Traditional monitoring • Host-centric • Collect ALL THE DATA/Instrument ALL THE THINGS • Historical data is good to have • Passive collection

Slide 13

Slide 13 text

Self-describing infrastructure 13

Slide 14

Slide 14 text

Service-centric

Slide 15

Slide 15 text

Service-centric

Slide 16

Slide 16 text

Service-centric

Slide 17

Slide 17 text

Service-centric

Slide 18

Slide 18 text

Service-centric

Slide 19

Slide 19 text

Modern monitoring • The service is the unit of measure • Collect actionable data from everything • Collected data is enhanced with cohort data/historical data • Outlier detection • Anomaly detection • Combines alerting and visualization • Passive and active collection techniques

Slide 20

Slide 20 text

Collect actionable data • Data must either: • Inform decisions • Answer questions • Determine what to collect by asking relevant questions: • Admins: can my users start VMs? • Tenants: am I fully utilizing allocated resources? • Stakeholders: do we have adequate capacity until the end of quarter?

Slide 21

Slide 21 text

What to monitor

Slide 22

Slide 22 text

Start with a question • Starting with questions keeps goal clear • Question: • Can my users spawn VMs? • Data needs: • Data on VM creation/deletion • Keystone data to verify permissions • Host-specific hardware data to correlate failures • Try to remember the business goals

Slide 23

Slide 23 text

The Great Divide Operators Service Admins ` End Users

Slide 24

Slide 24 text

Everyone wants metrics • Operators need metrics on infrastructure and OpenStack services • Service administrators need metrics on the services they provide • End users want metrics on the applications they run

Slide 25

Slide 25 text

Digging into Nova: Overview http://docs.openstack.org/developer/nova/architecture.html

Slide 26

Slide 26 text

Digging into Nova: Overview • Daemons • Are they up? • APIs: • Up? • Responding quickly? Without errors? • Services: • Full functionality? • Doing their thing quickly?

Slide 27

Slide 27 text

Digging into Nova: Daemons • Question: • Are the necessary daemons running (correctly)? • Data needs: • ps output/process check • Keystone logs to verify permissions • Host-specific hardware data to correlate failures

Slide 28

Slide 28 text

Digging into Nova: APIs • Questions: • Are the APIs up? • Are they responding quickly? • Data needs: • Service check for availability • Latency monitoring • Network data

Slide 29

Slide 29 text

Digging into Nova: Functionality • Questions: • Can users start/stop VMs? • Are volumes attaching correctly? • Data needs: • Canaries routinely check VM spawning • Keystone logs for failed authorizations • Hardware metrics

Slide 30

Slide 30 text

Tools are available out of the box • Horizon • Difficult to create cross-tenant and/or rollup reports • Multi-region compounds the cross-tenant issues • Not very user friendly for all teams involved (Ops, DevOps, Dev, Finance, Mgmt, etc.) • Doesn’t have time-based metrics to show usage over time • Nova APIs • Rolling your own? Who really has time for that? • Need to run graphite (or similar) to represent the data • Push metrics into statsd or similar service using Python

Slide 31

Slide 31 text

Getting our hands dirty: tools • curl • top • ps • nova • Mysql client • RabbitMQ client

Slide 32

Slide 32 text

Are the daemons running? root 26721 0.0 0.0 11992 2220 pts/20 S+ 21:59 0:00 grep --color=auto nova stack 28910 1.2 0.5 240576 92412 pts/11 S+ Aug01 45:16 /usr/bin/python /usr/local/bin/nova- scheduler --config-file /etc/nova/nova.conf stack 29915 1.2 0.7 270672 120208 pts/8 S+ Aug01 44:59 /usr/bin/python /usr/local/bin/nova- api stack 29937 1.2 0.4 229212 80636 pts/9 S+ Aug01 44:11 /usr/bin/python /usr/local/bin/nova- conductor --config-file /etc/nova/nova.conf stack 29945 0.0 0.7 283088 127160 pts/8 S+ Aug01 0:04 /usr/bin/python /usr/local/bin/nova- api stack 29946 1.1 0.8 311104 141984 pts/8 S+ Aug01 40:36 /usr/bin/python /usr/local/bin/nova- api

Slide 33

Slide 33 text

Are the daemons running?

Slide 34

Slide 34 text

curl -X POST -H "x-auth-token: cbce5e67627a486f82c9f812b05a65a8" -H "Content-Type: application/json" -d '{ "server": { "name": "new-server-test", "imageRef": "f6ab055b-b70b-463e-899d-8c231c6235a4", "flavorRef": "d2", "metadata": { "My Server Name": "host0" } } }' "http://openstack:8774/v2.1/7747c70227f74168b3adc9b05bb5175b/servers" CLI: nova boot --image cirros-0.3.1-x86_64-uec --flavor m1.tiny \ MyFirstInstance Can my users spawn VMs?

Slide 35

Slide 35 text

Can my users spawn VMs? curl -X GET -H "x-auth-token: cbce5e67627a486f82c9f812b05a65a8" -H "Content-Type: application/json" "http://openstack:8774/v2.1/7747c70227f74168b3adc9b05bb5175b/ser vers/bd06af9e-2ae2-4b56-927f-cb0925e0bd8c"

Slide 36

Slide 36 text

{ "server": { "status": "ACTIVE", "updated": "2016-08-05T15:02:56Z", "hostId": "1499086836b2f9e5f3edfe97b1aafd84bcf95e318304ddb442d88894", "OS-EXT-SRV-ATTR:host": "death-by-devstack", "OS-SRV-USG:terminated_at": null, "key_name": null, "OS-EXT-SRV-ATTR:hypervisor_hostname": "death-by-devstack", "name": "new-server-test", "created": "2016-08-05T15:02:51Z", "tenant_id": "7747c70227f74168b3adc9b05bb5175b", "os-extended-volumes:volumes_attached": [], "config_drive": "True" } [...] } Can my users spawn VMs?

Slide 37

Slide 37 text

Nova server metrics via API, CLI curl -H "X-Subject-Token: 3939c299ba0743eb94b6f4ff6ff97f6d" http://controller:8774/v2.1//servers//diagnostics nova diagnostics select * from compute_nodes where deleted=0; nova diagnostics select * from compute_nodes where deleted=0;

Slide 38

Slide 38 text

Rube Goldberg complexity

Slide 39

Slide 39 text

Important concepts to remember • You are no longer monitoring an infrastructure stack • It’s a set of applications that provide your infrastructure • You need to start monitoring not just server stats (cpu, memory, disk) but also how the applications work together • Servers may look fine even if the services are not responding properly • Probably have > 1 network providing the network to the running instances https://en.wikipedia.org/wiki/Wikipedia

Slide 40

Slide 40 text

Monitoring Options

Slide 41

Slide 41 text

Many monitoring options

Slide 42

Slide 42 text

Choosing the right tool

Slide 43

Slide 43 text

Visualization is key

Slide 44

Slide 44 text

Events

Slide 45

Slide 45 text

Event correlation w/ metrics http://bit.ly/openstack-2013hjgfgggf

Slide 46

Slide 46 text

Cross-system event correlation

Slide 47

Slide 47 text

Data-driven insights

Slide 48

Slide 48 text

Dimensions

Slide 49

Slide 49 text

Rich, actionable alerts

Slide 50

Slide 50 text

Rich, actionable alerts

Slide 51

Slide 51 text

Service Discovery

Slide 52

Slide 52 text

Support for multiple sources

Slide 53

Slide 53 text

Monitoring OpenStack and Dynamic Infrastructure Matthew Williams – [email protected] - @technovangelist Datadog August 23, 2016