#AssimProj #AssimMon @OSSAlanR http://assimproj.org/ This presentation: http://bit.ly/1bh1iO8 Alan Robertson <[email protected]> Assimilation Systems Limited
on the project/product – Is it useful – why or why not? – Would it sell to management? • Feedback on my approach to presenting it • Other presentation feedback – Clarity, Style, etc...
Monitoring • Extensible discovery of systems, switches, services, and dependencies – without setting off network alarms • Extensible monitoring of > 100K systems • All data goes into central graph database
have monitoring? – Open or closed source? – How many of you are happy with it? • How many of you have discovery? – Open or closed source? – How many of you are happy with it?
to reduce the risk of extended outages • We discover systems to reduce the risk of intrusions • We discover services to reduce the risk of extended outages • We discover switch connections, dependencies, etc to decrease risk of system maintenance, growth and management • Reducing risk is good for everyone
unknown or forgotten systems – We discover them • Most documentation is incomplete, incorrect • Dependencies often unknown • Licensed software & lawsuits are expensive • Auditibility: you know if you got it all • Improves planning, understanding, mgmt • Creates opportunities to check best practices • Discovery database is an ITIL CMDB
(in theory) • Growth unlikely to ever be an issue – No need for proxies, multiple servers • Dependencies help diagnose problems • Extremely low network traffic • Ideal for cross-WAN monitoring • Highlight cascading failure root causes • Not confused by switch failures • Most switches get monitored “for free”
in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation
LRM == Local Resource Manager • Well-proven architecture: “no news is good news” • Implements Open Cluster Framework standard (and others) • Each system monitors own services
Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on
nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs • Can also monitor VMs – bottom level of rings disappear without LLDP or CDP – If you add this to your base image, with one configuration file per customer, then no need to configure anything else for basic monitoring.
Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security
Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment
Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
• Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
• Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacking real digital signatures, encryption, compression • Reliable UDP comm code all working • CMA code works, much more to go • Several discovery methods written • Licensed under the GPL
year • Commercial licenses with support • “Real digital signatures, compression, encryption • Other security enhancements • Much more discovery • GUI • Alerting • Reporting • Add Statistical Monitoring • Best Practice Audits • Dynamic (aka cloud) specialization • Hundreds more ideas – See: https://trello.com/b/OpaED3AT
project Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Test Code (simulate 106 servers!) • Packaging, Continuous Integration • Python, C, script coding • Evangelism, community building • Documentation • Feedback: Testing, Ideas, Plans • Integration with OpenStack • Many others!