LTC: Cloud monitoring without limit

Cloud Monitoring Without Limit using The Assimilation Monitoring Project #AssimMon
@OSSAlanR http://assimmon.org/ Alan Robertson <[email protected]> Project Founder

6/25/12 2/24 Project background • Sub-project of the Linux-HA project
• Personal-time open source project • Soon to become my full time endeavor • Currently around 25K lines of code • A work-in-progress

6/25/12 3/24 Project Scope Discovery-Driven Exception Monitoring of Systems and
Services – cloud and non-cloud • EXTREME monitoring scalability 100K systems without breathing hard • Integrated Continuous Stealth DiscoveryTM: systems, switches, services, and dependencies – without setting off network alarms

6/25/12 4/24 Problems Addressed • Scale up monitoring indefinitely •
Minimize & simplify configuration • Keep monitoring up-to-date • Know that everything is monitored • Distinguish switch vs system failures • Discovery without setting off alarms • Highlight root causes of cascading failures • Find “forgotten” servers and services • Discover uses and installs of licensed software

6/25/12 5/24 Architectural Overview Collective Monitoring Authority (CMA) • One
CMA per installation Nanoprobes • One nanoprobe per OS image Data Storage • Central Neo4j graph database General Rule: “No News Is Good News”

6/25/12 6/24 Massive Scalability – or “I see dead servers
in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation

6/25/12 7/24 Massive Scalability – or “I see dead servers
in O(1) time” Planned Topology-Aware Architecture Multiple levels of rings: • Support diagnosing switch issues • Minimize network traffic • Ideal for multi-site arrangements

6/25/12 8/24 How does this apply to clouds? • Fits
nicely into a huge cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs • Can also monitor customer VMs – bottom level of rings probably disappear – unless LLDP or CDP is provided – If you add this to your base image, with one configuration file per customer, then no need to configure anything for basic monitoring.

6/25/12 9/24 Why Discovery? • Simplifies monitoring configuration • Finds
things that aren't monitored • Dependencies simplify root cause analysis • Simplifies understanding the environment • “Forgotten” systems are implicated in about 1/3 of all break-ins • Simplifies license management for proprietary software (saves $$)

6/25/12 10/24 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental
Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security jail” :-D

6/25/12 11/24 Service Monitoring Based on Linux-HA LRM ideas •
LRM == Local Resource Manager • Well-proven architecture: “no news is good news” • Implements Open Cluster Framework standard (and others) • Each system monitors own services

6/25/12 12/24 Monitoring Pros and Cons Pros Simple & Scalable
Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on

6/25/12 13/24 Basic CMA Functions (python) Nanoprobe management • Configure
& direct • Hear alerts & discovery • Update rings: join/leave Update database Issue alerts

6/25/12 14/24 Nanoprobe Functions ('C') Announce self to CMA •
Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots

6/25/12 15/24 Why a graph database? (Neo4j) • Dependency &
Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries  graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment

6/25/12 16/24 How does discovery work? Nanoprobe scripts perform discovery
• Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships

6/25/12 17/24 sshd Service JSON Snippet (from netstat and /proc)
"sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "0.0.0.0:22": { "proto": "tcp", "addr": "0.0.0.0", "port": 22 }, and so on...

6/25/12 18/24 ssh Client JSON Snippet (from netstat and /proc)
"ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "10.10.10.5:22": { "proto": "tcp", "addr": "10.10.10.5", "port": 22 }, and so on...

6/25/12 19/24 ssh -> sshd dependency graph

6/25/12 20/24 Switch Discovery Data from LLDP (or CDP) CRM
transforms LLDP (CDP) Data to JSON

6/25/12 21/24 Current State • First release was April 2013
• Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacking digital signatures, encryption, compression • Reliable UDP comm code all working • CMA code works, much more to go • Several discovery methods written • Licensed under the GPL

6/25/12 22/24 Future Plans • Production grade by end of
year • Commercial licenses with support • “Real digital signatures, compression, encryption • Other security enhancements • Dynamic (aka cloud) specialization • Much more discovery • Alerting • GUI • Reporting • Create/audit an ITIL CMDB • Add Statistical Monitoring • Best Practice Audits • Hundreds more ideas – See: https://trello.com/b/OpaED3AT

6/25/12 23/24 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking
project Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Test Code (simulate 106 servers!) • Packaging, Continuous Integration • Python, C, script coding • Evangelism, community building • Documentation • Integration with OpenStack • Feedback: Testing, Ideas, Plans • Many others!

6/25/12 24/24 Resistance Is Futile! #AssimMon @OSSAlanR Project Web Site
http://assimmon.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation

LTC: Cloud monitoring without limit

LTC: Cloud monitoring without limit

Alan Robertson

More Decks by Alan Robertson

Other Decks in Technology

Featured

Transcript

Cloud Monitoring Without Limit using The Assimilation Monitoring Project #AssimMon

6/25/12 2/24 Project background • Sub-project of the Linux-HA project

6/25/12 3/24 Project Scope Discovery-Driven Exception Monitoring of Systems and

6/25/12 4/24 Problems Addressed • Scale up monitoring indefinitely •

6/25/12 5/24 Architectural Overview Collective Monitoring Authority (CMA) • One

6/25/12 6/24 Massive Scalability – or “I see dead servers

6/25/12 7/24 Massive Scalability – or “I see dead servers

6/25/12 8/24 How does this apply to clouds? • Fits

6/25/12 9/24 Why Discovery? • Simplifies monitoring configuration • Finds

6/25/12 10/24 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental

6/25/12 11/24 Service Monitoring Based on Linux-HA LRM ideas •

6/25/12 12/24 Monitoring Pros and Cons Pros Simple & Scalable

6/25/12 13/24 Basic CMA Functions (python) Nanoprobe management • Configure

6/25/12 14/24 Nanoprobe Functions ('C') Announce self to CMA •

6/25/12 15/24 Why a graph database? (Neo4j) • Dependency &

6/25/12 16/24 How does discovery work? Nanoprobe scripts perform discovery

6/25/12 17/24 sshd Service JSON Snippet (from netstat and /proc)

6/25/12 18/24 ssh Client JSON Snippet (from netstat and /proc)

6/25/12 19/24 ssh -> sshd dependency graph

6/25/12 20/24 Switch Discovery Data from LLDP (or CDP) CRM

6/25/12 21/24 Current State • First release was April 2013

6/25/12 22/24 Future Plans • Production grade by end of

6/25/12 23/24 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking

6/25/12 24/24 Resistance Is Futile! #AssimMon @OSSAlanR Project Web Site