Monitoring Without Limit using The Assimilation Project #AssimProj @OSSAlanR http://assimproj.org/ Alan Robertson <[email protected]> Assimilation Systems Limited http://assimilationsystems.com
L C A 2 0 1 4 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring • Continuous extensible discovery – systems, switches, services, dependencies – zero network footprint • Extensible exception monitoring – more than 100K systems • All data goes into central graph database
L C A 2 0 1 4 Questions • How many of you have monitoring? – Open or closed source? – How many of you are happy with it? • How many of you have discovery? – Open or closed source? – Is it continuous? – How many of you are happy with it?
L C A 2 0 1 4 Assimilation Project History • Inspired by 2 million core computer (cyclops64) • Concerns for extreme scale • Topology aware monitoring • Topology discovery w/out security issues =►Discovery of everything!
L C A 2 0 1 4 An 8-dimensional overview • Problems Addressed • Unique Capabilities • Distribution of Work • Architectural Components • Discovery Graph Schema • Extensible Discovery API • Current Status • Project Needs
L C A 2 0 1 4 First Dimension: Problems Addressed Risk Management at extreme scale 1. Maintaining detailed discovery database 2. Discovering systems you've forgotten about 3. Discovering what (licensed) software you're running – and where 4. Monitoring services, systems and switches 5. Finding services you aren't monitoring
L C A 2 0 1 4 Second Dimension: Unique Powerful Features 1. Continuous Discovery 2. Zero network discovery footprint 3. Centralized graph database 4. We know everything that changes 5. Discover and update dependency information
L C A 2 0 1 4 (even more) Features... 6. Discovery and monitoring tightly integrated – discovery drives monitoring 7. Discovery and monitoring easily extensible 8. Naturally scalable to > 100K systems 9. Server failures distinguishable from switch failures 10.Minimal network load 11.Multi-tenant support
L C A 2 0 1 4 Third Dimension: Uniformly, fully distributed work Two philosophical underpinnings 1. Monitoring and Discovery are fully distributed 2. Reliable “no news is good news” Only responses to changes are centralized
L C A 2 0 1 4 Massive Scalability – or “I see dead servers in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Current Implementation
L C A 2 0 1 4 Fourth Dimension: Architectural Components Three Architectural Components Collective Management Authority • One CMA per installation Nanoprobes • One nanoprobe per system Data Storage • Central Neo4j graph database
L C A 2 0 1 4 Nanoprobe Functions ('C') Announce self to CMA • Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
L C A 2 0 1 4 Service Monitoring based on Linux-HA/Pacemaker LRM • LRM == Local Resource Manager • Well-proven architecture: – “no news is good news” AKA management by exception • Implements Open Cluster Framework standard (and others) • Each system monitors own services • Can also start, stop, migrate services
L C A 2 0 1 4 Monitoring Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Multi-tenant approach Cons Active agents Potential slowness at power-on
L C A 2 0 1 4 Why a graph database? (Neo4j) • Humans describe systems as graphs • Dependency & Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries graph traversals – notoriously slow in relational databases • Visualization is Natural • Schema-less design: good for constantly changing heterogeneous environment • Graph Model === Object Model
L C A 2 0 1 4 Fifth Dimension: Discovery API Scripts perform discovery – output JSON Three Sample Discovery Snippets • OS information • Service discovery • Client discovery
L C A 2 0 1 4 A multi-dimensional demo • Demonstrate basic capabilities – Discovery – Automatic monitoring configuration – Monitoring – failures / successes • No configuration was supplied – everything comes from discovery
L C A 2 0 1 4 How does discovery work? Nanoprobe scripts perform discovery • Each discovers one kind of information • Can take arguments from environment • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
L C A 2 0 1 4 Seventh Dimension: Current Status • First release April 2013 • Great unit tests • Nanoprobe code works well • Several discovery methods written • CMA restructuring complete • Discovery => Automatic Monitoring (WOOT!) • UI development underway • Licensed under GPL: commercial options available
L C A 2 0 1 4 Eighth Dimension: Get Involved! We need every talent! • Early adopters • Testers, Continuous Integration • Designers • Developers (C,Python, Shell, PowerShell, JavaScript) • Porters (esp Windows) • Promoters, publicists • Packagers • And so on...
L C A 2 0 1 4 Resistance Is Futile! Mailing List bit.ly/AssimML #AssimProj @OSSAlanR Project Web Site assimproj.org Blog techthoughts.typepad.com assimilationsystems.com