Why Am I Here? • Understand Your Environment and Needs – What you currently do for discovery – What you currently do for monitoring – How those are working for you – Are multi-tenant capabilities appealing? • Give an overview of the project • Engage with your communities • Understand if we ought to stay in touch
Upcoming Events National Center for Atmospheric Research (today!) Denver Open Source User’s Group Facebook presentation GraphConnect San Francisco Open Source Monitoring Conference - Nürnberg NSA / Homeland Security Assimilation Technical Talk Large Installation System Administration Conference - DC Colorado Springs Open Source User’s Group linux.conf.au – Linux Conference in Australia - Perth Details on http://assimilationsystems.com/
Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring • Continuous extensible discovery – systems, switches, services, dependencies – zero network footprint • Extensible exception monitoring – more than 100K systems • All data goes into central graph database
Architectural Overview Collective Management Authority • One CMA per installation Nanoprobes • One nanoprobe per OS image Data Storage • Central Neo4j graph database General Rule: “No News Is Good News”
Massive Scalability – or “I see dead servers in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Current Implementation
Service Monitoring Based on Linux-HA LRM • LRM == Local Resource Manager • Well-proven architecture: – “no news is good news” AKA management by exception • Implements Open Cluster Framework standard (and others) • Each system monitors own services • Can also start, stop, migrate services
Monitoring Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Multi-tenant approach Cons Active agents Potential slowness at power-on
How does this apply to clouds? • Fits nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can control VMs • Can monitor customer VMs – Add nanoprobe to base image – bottom level of rings disappear without LLDP or CDP
Nanoprobe Functions ('C') Announce self to CMA • Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
Why a graph database? (Neo4j) • Dependency & Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment • Graph Model === Object Model
How does discovery work? Nanoprobe scripts perform discovery • Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
Current State • First release was April 2013 • Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacks digital signatures, encryption, compression • Reliable UDP comm code working • Several discovery methods written • CMA and database code restructuring near-complete • UI development underway • Licensed under the GPL, commercial license available
Future Plans • Production grade by end of year • Purchased support • “Real digital signatures, compression, encryption • Other security enhancements • Much more discovery • GUI • Alerting • Reporting • Add Statistical Monitoring • Best Practice Audits • Dynamic (aka cloud) specialization • Hundreds more ideas – See: https://trello.com/b/OpaED3AT
Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking project Looking for early adopters, testers!! Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Evangelism, community building • Test Code (simulate 106 servers!) • Python, C, script coding • Documentation • Feedback: Testing, Ideas, Plans • Many others!
Resistance Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web Site http://assimproj.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation