IT Discovery and Monitoring Without Limit using The Assimilation Project #AssimProj #AssimMon @OSSAlanR http://assimproj.org/ This presentation: http://bit.ly/1bh1iO8 Alan Robertson Assimilation Systems Limited
6/25/12 2/30 Biography ● Founded Linux-HA project - led 1998-2007 - now called Pacemaker ● Founded Assimilation Project in 2010 ● Founded Assimilation Systems Limited in 2013 ● Alumnus of Bell Labs, SuSE, IBM
6/25/12 3/30 Project background ● Available as GPL (or commercial) ● Founded in late 2010 ● Now my full time endeavor – Assimilation Systems Limited ● Currently around 25K lines of code ● First release: April 2013
6/25/12 4/30 T.A.N.S.T.A.A.F.L. What I need from you... ● Feedback on the project/product – Is it useful – why or why not? – Would it sell to management? ● Feedback on my approach to presenting it ● Other presentation feedback – Clarity, Style, etc...
6/25/12 5/30 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring ● Extensible discovery of systems, switches, services, and dependencies – without setting off network alarms ● Extensible monitoring of > 100K systems ● All data goes into central graph database
6/25/12 6/30 Questions for Audience ● How many of you have monitoring? – Open or closed source? – How many of you are happy with it? ● How many of you have discovery? – Open or closed source? – How many of you are happy with it?
6/25/12 7/30 Risk Management ● We monitor systems and services to reduce the risk of extended outages ● We discover systems to reduce the risk of intrusions ● We discover services to reduce the risk of extended outages ● We discover switch connections, dependencies, etc to decrease risk of system maintenance, growth and management ● Reducing risk is good for everyone
6/25/12 8/30 Why Discovery? ● 30% of intrusions come from unknown or forgotten systems – We discover them ● Most documentation is incomplete, incorrect ● Dependencies often unknown ● Licensed software & lawsuits are expensive ● Auditibility: you know if you got it all ● Improves planning, understanding, mgmt ● Creates opportunities to check best practices ● Discovery database is an ITIL CMDB
6/25/12 9/30 Why Our Monitoring? ● Much simpler to configure (in theory) ● Growth unlikely to ever be an issue – No need for proxies, multiple servers ● Dependencies help diagnose problems ● Extremely low network traffic ● Ideal for cross-WAN monitoring ● Highlight cascading failure root causes ● Not confused by switch failures ● Most switches get monitored “for free”
6/25/12 11/30 Architectural Overview Collective Monitoring Authority (CMA) ● One CMA per installation Nanoprobes ● One nanoprobe per OS image Data Storage ● Central Neo4j graph database General Rule: “No News Is Good News”
6/25/12 12/30 Massive Scalability – or “I see dead servers in O(1) time” ● Adding systems does not increase the monitoring work on any system ● Each server monitors 2 (or 4) neighbors ● Each server monitors its own services ● Ring repair and alerting is O(n) – but a very small amount of work ● Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation
6/25/12 13/30 Massive Scalability – or “I see dead servers in O(1) time” Planned Topology-Aware Architecture Multiple levels of rings: ● Support diagnosing switch issues ● Minimize network traffic ● Ideal for multi-site arrangements
6/25/12 14/30 Who will watch the watchers? ● CMA in HA cluster ● Services watched by scripts ● Scripts watched by nanoprobe ● nanoprobe watch each other ● CMA runs nanoprobes
6/25/12 15/30 Service Monitoring Based on Linux-HA LRM ideas ● LRM == Local Resource Manager ● Well-proven architecture: “no news is good news” ● Implements Open Cluster Framework standard (and others) ● Each system monitors own services
6/25/12 16/30 Monitoring Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on
6/25/12 17/30 How does this apply to clouds? ● Fits nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs ● Can also monitor VMs – bottom level of rings disappear without LLDP or CDP – If you add this to your base image, with one configuration file per customer, then no need to configure anything else for basic monitoring.
6/25/12 19/30 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security
6/25/12 20/30 Why a graph database? (Neo4j) ● Dependency & Discovery information: graph ● Speed of graph traversals depends on size of subgraph, not total graph size ● Root cause queries graph traversals – notoriously slow in relational databases ● Visualization of relationships ● Schema-less design: good for constantly changing heterogeneous environment
6/25/12 21/30 Nanoprobe Functions ('C') Announce self to CMA ● Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says ● receive configuration information – CMA addresses, ports, defaults ● send/expect heartbeats ● perform discovery actions ● perform monitoring actions No persistent state across reboots
6/25/12 22/30 How does discovery work? Nanoprobe scripts perform discovery ● Each discovers one kind of information ● Can take arguments (in environment) ● Output JSON CMA stores Discovery Information ● JSON stored in Neo4j database ● CMA discovery plugins => graph nodes and relationships
6/25/12 27/30 Current State ● First release was April 2013 ● Great unit test infrastructure ● Nanoprobe code – works well ● Service monitoring works ● Lacking real digital signatures, encryption, compression ● Reliable UDP comm code all working ● CMA code works, much more to go ● Several discovery methods written ● Licensed under the GPL
6/25/12 28/30 Future Plans ● Production grade by end of year ● Commercial licenses with support ● “Real digital signatures, compression, encryption ● Other security enhancements ● Much more discovery ● GUI ● Alerting ● Reporting ● Add Statistical Monitoring ● Best Practice Audits ● Dynamic (aka cloud) specialization ● Hundreds more ideas – See: https://trello.com/b/OpaED3AT
6/25/12 29/30 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking project Needs for every kind of skill ● Awesome User Interfaces (UI/UX) ● Test Code (simulate 106 servers!) ● Packaging, Continuous Integration ● Python, C, script coding ● Evangelism, community building ● Documentation ● Feedback: Testing, Ideas, Plans ● Integration with OpenStack ● Many others!
6/25/12 30/30 Resistance Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web Site http://assimproj.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation