Monitoring OpenStack and Dynamic Infrastructure

Monitoring OpenStack and Dynamic Infrastructure Matthew Williams – [email protected] -
@technovangelist Datadog August 23, 2016

Datadog Overview SaaS-based Infrastructure Monitoring Focus on Modern Infrastructure (cloud,
containers, micro-services) Processing about a trillion metrics per day Intelligent alerting

The Plan • The problem • What to monitor •
Monitoring options

The Problem Monitoring a multi-layered, distributed system 4

The high level problem • OpenStack composed of > 16
services - many moving parts • OpenStack is infrastructure • OpenStack users have their own users • No such thing as “set it and forget it” for self-hosted services 5

But Monitoring is Hard • OpenStack services + many hosts
* containers * VMs == lots to monitor • Not to mention user-run services • Information is EVERYWHERE: • Logs; metrics; console dumps; events/notification • Aggregation is challenging 8

Metrics explosion • 1 node • 70+ metrics from Nova
(multi-tenant metrics) • Hundreds from RabbitMQ/supporting software • 1 operating system (e.g., Linux) • 100 metrics per instance • Custom Applications • 50~ metrics

Metrics explosion

Host-centric view

Traditional monitoring • Host-centric • Collect ALL THE DATA/Instrument ALL
THE THINGS • Historical data is good to have • Passive collection

Self-describing infrastructure 13

Service-centric

Modern monitoring • The service is the unit of measure
• Collect actionable data from everything • Collected data is enhanced with cohort data/historical data • Outlier detection • Anomaly detection • Combines alerting and visualization • Passive and active collection techniques

Collect actionable data • Data must either: • Inform decisions
• Answer questions • Determine what to collect by asking relevant questions: • Admins: can my users start VMs? • Tenants: am I fully utilizing allocated resources? • Stakeholders: do we have adequate capacity until the end of quarter?

What to monitor

Start with a question • Starting with questions keeps goal
clear • Question: • Can my users spawn VMs? • Data needs: • Data on VM creation/deletion • Keystone data to verify permissions • Host-specific hardware data to correlate failures • Try to remember the business goals

The Great Divide Operators Service Admins ` End Users

Everyone wants metrics • Operators need metrics on infrastructure and
OpenStack services • Service administrators need metrics on the services they provide • End users want metrics on the applications they run

Digging into Nova: Overview http://docs.openstack.org/developer/nova/architecture.html

Digging into Nova: Overview • Daemons • Are they up?
• APIs: • Up? • Responding quickly? Without errors? • Services: • Full functionality? • Doing their thing quickly?

Digging into Nova: Daemons • Question: • Are the necessary
daemons running (correctly)? • Data needs: • ps output/process check • Keystone logs to verify permissions • Host-specific hardware data to correlate failures

Digging into Nova: APIs • Questions: • Are the APIs
up? • Are they responding quickly? • Data needs: • Service check for availability • Latency monitoring • Network data

Digging into Nova: Functionality • Questions: • Can users start/stop
VMs? • Are volumes attaching correctly? • Data needs: • Canaries routinely check VM spawning • Keystone logs for failed authorizations • Hardware metrics

Tools are available out of the box • Horizon •
Difficult to create cross-tenant and/or rollup reports • Multi-region compounds the cross-tenant issues • Not very user friendly for all teams involved (Ops, DevOps, Dev, Finance, Mgmt, etc.) • Doesn’t have time-based metrics to show usage over time • Nova APIs • Rolling your own? Who really has time for that? • Need to run graphite (or similar) to represent the data • Push metrics into statsd or similar service using Python

Getting our hands dirty: tools • curl • top •
ps • nova • Mysql client • RabbitMQ client

Are the daemons running? root 26721 0.0 0.0 11992 2220
pts/20 S+ 21:59 0:00 grep --color=auto nova stack 28910 1.2 0.5 240576 92412 pts/11 S+ Aug01 45:16 /usr/bin/python /usr/local/bin/nova- scheduler --config-file /etc/nova/nova.conf stack 29915 1.2 0.7 270672 120208 pts/8 S+ Aug01 44:59 /usr/bin/python /usr/local/bin/nova- api stack 29937 1.2 0.4 229212 80636 pts/9 S+ Aug01 44:11 /usr/bin/python /usr/local/bin/nova- conductor --config-file /etc/nova/nova.conf stack 29945 0.0 0.7 283088 127160 pts/8 S+ Aug01 0:04 /usr/bin/python /usr/local/bin/nova- api stack 29946 1.1 0.8 311104 141984 pts/8 S+ Aug01 40:36 /usr/bin/python /usr/local/bin/nova- api

Are the daemons running?

curl -X POST -H "x-auth-token: cbce5e67627a486f82c9f812b05a65a8" -H "Content-Type: application/json" -d
'{ "server": { "name": "new-server-test", "imageRef": "f6ab055b-b70b-463e-899d-8c231c6235a4", "flavorRef": "d2", "metadata": { "My Server Name": "host0" } } }' "http://openstack:8774/v2.1/7747c70227f74168b3adc9b05bb5175b/servers" CLI: nova boot --image cirros-0.3.1-x86_64-uec --flavor m1.tiny \ MyFirstInstance Can my users spawn VMs?

Can my users spawn VMs? curl -X GET -H "x-auth-token:
cbce5e67627a486f82c9f812b05a65a8" -H "Content-Type: application/json" "http://openstack:8774/v2.1/7747c70227f74168b3adc9b05bb5175b/ser vers/bd06af9e-2ae2-4b56-927f-cb0925e0bd8c"

{ "server": { "status": "ACTIVE", "updated": "2016-08-05T15:02:56Z", "hostId": "1499086836b2f9e5f3edfe97b1aafd84bcf95e318304ddb442d88894", "OS-EXT-SRV-ATTR:host":
"death-by-devstack", "OS-SRV-USG:terminated_at": null, "key_name": null, "OS-EXT-SRV-ATTR:hypervisor_hostname": "death-by-devstack", "name": "new-server-test", "created": "2016-08-05T15:02:51Z", "tenant_id": "7747c70227f74168b3adc9b05bb5175b", "os-extended-volumes:volumes_attached": [], "config_drive": "True" } [...] } Can my users spawn VMs?

Nova server metrics via API, CLI curl -H "X-Subject-Token: 3939c299ba0743eb94b6f4ff6ff97f6d"
http://controller:8774/v2.1/<tenant-id>/servers/<server- id>/diagnostics nova diagnostics <server-id> select * from compute_nodes where deleted=0; nova diagnostics <server-id> select * from compute_nodes where deleted=0;

Rube Goldberg complexity

Important concepts to remember • You are no longer monitoring
an infrastructure stack • It’s a set of applications that provide your infrastructure • You need to start monitoring not just server stats (cpu, memory, disk) but also how the applications work together • Servers may look fine even if the services are not responding properly • Probably have > 1 network providing the network to the running instances https://en.wikipedia.org/wiki/Wikipedia

Monitoring Options

Many monitoring options

Choosing the right tool

Visualization is key

Events

Event correlation w/ metrics http://bit.ly/openstack-2013hjgfgggf

Cross-system event correlation

Data-driven insights

Dimensions

Rich, actionable alerts

Service Discovery

Support for multiple sources

Monitoring OpenStack and Dynamic Infrastructure Matthew Williams – [email protected] -
@technovangelist Datadog August 23, 2016

Monitoring OpenStack and Dynamic Infrastructure

Monitoring OpenStack and Dynamic Infrastructure

More Decks by Matt Williams

Other Decks in Technology

Featured

Transcript