using The Assimilation Project #AssimProj @OSSAlanR http://assimproj.org/ http://bit.ly/AssimOSMC2013 Alan Robertson <[email protected]> Assimilation Systems Limited http://assimilationsystems.com
O S M C Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring • Continuous extensible discovery – systems, switches, services, dependencies – zero network footprint • Extensible exception monitoring – more than 100K systems • All data goes into central graph database
O S M C Assimilation Project History • Inspired by 2 million core computer (cyclops64) • Concerns for extreme scale • Topology aware monitoring • Topology discovery w/out security issues =►Discovery of everything!
O S M C An 8-dimensional overview • Problems Addressed • Unique Capabilities • Distribution of Work • Architectural Components • Discovery Graph Schema • Extensible Discovery API • Current Status • Project Needs
O S M C First Dimension: Problems Addressed Risk Management at extreme scale 1. Maintaining detailed discovery database 2. Discovering systems you've forgotten about 3. Discovering what (licensed) software you're running – and where 4. Monitoring services, systems and switches 5. Finding services you aren't monitoring
O S M C Why Discovery? (DevOps) • Documentation: incomplete, incorrect • Dependencies: unknown • Planning: Needs accurate data • Best Practices: Verification needs data • ITIL CMDB (Configuration Mgmt DataBase) Our Discovery: continuous, low-profile
O S M C Second Dimension: Unique Powerful Features 1. Continuous Discovery 2. Zero network footprint 3. Centralized graph database 4. We know everything that changes 5. Discover and update dependency information
O S M C (even more) Features... 6. Discovery and monitoring tightly integrated 7. Discovery and monitoring easily extensible 8. Naturally scalable to > 100K systems 9. Server failures distinguishable from switch failures 10.Minimal network load 11.Multi-tenant support
O S M C Third Dimension: Uniformly, fully distributed work Two philosophical underpinnings 1. Monitoring and Discovery are fully distributed 2. Reliable “no news is good news” Only responses to changes are centralized
O S M C Massive Scalability – or “I see dead servers in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Current Implementation
O S M C Fourth Dimension: Architectural Components Three Architectural Compnents Collective Management Authority • One CMA per installation Nanoprobes • One nanoprobe per system Data Storage • Central Neo4j graph database
O S M C Nanoprobe Functions ('C') Announce self to CMA • Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
O S M C Service Monitoring based on Linux-HA/Pacemaker LRM • LRM == Local Resource Manager • Well-proven architecture: – “no news is good news” AKA management by exception • Implements Open Cluster Framework standard (and others) • Each system monitors own services • Can also start, stop, migrate services
O S M C Monitoring Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Multi-tenant approach Cons Active agents Potential slowness at power-on
O S M C Why a graph database? (Neo4j) • Humans describe systems as graphs • Dependency & Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries graph traversals – notoriously slow in relational databases • Visualization is Natural • Schema-less design: good for constantly changing heterogeneous environment • Graph Model === Object Model
O S M C Fifth Dimension: Discovery API Scripts perform discovery – output JSON Three Discovery Snippets • OS information • Service discovery • Client discovery
O S M C How does discovery work? Nanoprobe scripts perform discovery • Each discovers one kind of information • Can take arguments from environment • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
O S M C Seventh Dimension: Current Status • First release April 2013 • Great unit tests • Nanoprobe code works well • Several discovery methods written • CMA restructuring finishing up • UI development underway • Licensed under GPL: commercial options available
O S M C Eighth Dimension: Get Involved! We need every talent! • Early adopters • Testers, Continuous Integration • Designers • Developers (C,Python, Shell, PowerShell, JavaScript) • Porters (esp Windows) • Promoters, publicists • Packagers • And so on...
O S M C Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking project Looking for early adopters, testers!! Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Evangelism, community building • Test Code (simulate 106 servers!) • Python, C, script coding • Documentation • Feedback: Testing, Ideas, Plans • Many others!
O S M C Resistance Is Futile! Mailing List bit.ly/AssimML #AssimProj @OSSAlanR Project Web Site assimproj.org Blog techthoughts.typepad.com assimilationsystems.com
O S M C Discovery Discovering • systems you've forgotten • what you're not monitoring • whatever you'd like • without setting off network security alarms
O S M C How does this apply to clouds? • Fits nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can control VMs • Can monitor customer VMs – Add nanoprobe to base image – bottom level of rings disappear without LLDP or CDP
O S M C Future Plans • Production grade by end of year • Purchased support • “Real digital signatures, compression, encryption • Other security enhancements • Much more discovery • GUI • Alerting • Reporting • Add Statistical Monitoring • Best Practice Audits • Dynamic (aka cloud) specialization • Hundreds more ideas – See: https://trello.com/b/OpaED3AT