2013 July Boulder DevOps - Assimilation Project introduction

IT Discovery and Monitoring Without Limit using The Assimilation Project
#AssimProj #AssimMon @OSSAlanR http://assimproj.org/ This presentation: http://bit.ly/1bh1iO8 Alan Robertson <[email protected]> Assimilation Systems Limited

6/25/12 2/30 Biography • Founded Linux-HA project - led 1998-2007
- now called Pacemaker • Founded Assimilation Project in 2010 • Founded Assimilation Systems Limited in 2013 • Alumnus of Bell Labs, SuSE, IBM

6/25/12 3/30 Project background • Available as GPL (or commercial)
• Founded in late 2010 • Now my full time endeavor – Assimilation Systems Limited • Currently around 25K lines of code • First release: April 2013

6/25/12 4/30 T.A.N.S.T.A.A.F.L. What I need from you... • Feedback
on the project/product – Is it useful – why or why not? – Would it sell to management? • Feedback on my approach to presenting it • Other presentation feedback – Clarity, Style, etc...

6/25/12 5/30 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale
Monitoring • Extensible discovery of systems, switches, services, and dependencies – without setting off network alarms • Extensible monitoring of > 100K systems • All data goes into central graph database

6/25/12 6/30 Questions for Audience • How many of you
have monitoring? – Open or closed source? – How many of you are happy with it? • How many of you have discovery? – Open or closed source? – How many of you are happy with it?

6/25/12 7/30 Risk Management • We monitor systems and services
to reduce the risk of extended outages • We discover systems to reduce the risk of intrusions • We discover services to reduce the risk of extended outages • We discover switch connections, dependencies, etc to decrease risk of system maintenance, growth and management • Reducing risk is good for everyone

6/25/12 8/30 Why Discovery? • 30% of intrusions come from
unknown or forgotten systems – We discover them • Most documentation is incomplete, incorrect • Dependencies often unknown • Licensed software & lawsuits are expensive • Auditibility: you know if you got it all • Improves planning, understanding, mgmt • Creates opportunities to check best practices • Discovery database is an ITIL CMDB

6/25/12 9/30 Why Our Monitoring? • Much simpler to configure
(in theory) • Growth unlikely to ever be an issue – No need for proxies, multiple servers • Dependencies help diagnose problems • Extremely low network traffic • Ideal for cross-WAN monitoring • Highlight cascading failure root causes • Not confused by switch failures • Most switches get monitored “for free”

6/25/12 10/30 Problems Addressed • Discovery without setting off alarms
• Find “forgotten” servers and services • Discover uses and installs of licensed software • Scale up monitoring indefinitely • Know that everything is monitored • Minimize & simplify monitoring configuration • Keep monitoring up-to-date • Distinguish switch vs system failures • Highlight root causes of cascading failures

6/25/12 11/30 Architectural Overview Collective Monitoring Authority (CMA) • One
CMA per installation Nanoprobes • One nanoprobe per OS image Data Storage • Central Neo4j graph database General Rule: “No News Is Good News”

6/25/12 12/30 Massive Scalability – or “I see dead servers
in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation

6/25/12 13/30 Massive Scalability – or “I see dead servers
in O(1) time” Planned Topology-Aware Architecture Multiple levels of rings: • Support diagnosing switch issues • Minimize network traffic • Ideal for multi-site arrangements

6/25/12 14/30 Who will watch the watchers? • CMA in
HA cluster • Services watched by scripts • Scripts watched by nanoprobe • nanoprobe watch each other • CMA runs nanoprobes

6/25/12 15/30 Service Monitoring Based on Linux-HA LRM ideas •
LRM == Local Resource Manager • Well-proven architecture: “no news is good news” • Implements Open Cluster Framework standard (and others) • Each system monitors own services

6/25/12 16/30 Monitoring Pros and Cons Pros Simple & Scalable
Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on

6/25/12 17/30 How does this apply to clouds? • Fits
nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs • Can also monitor VMs – bottom level of rings disappear without LLDP or CDP – If you add this to your base image, with one configuration file per customer, then no need to configure anything else for basic monitoring.

6/25/12 18/30 Basic CMA Functions (python) Nanoprobe management • Configure
& direct • Hear alerts & discovery • Update rings: join/leave Update database Issue alerts

6/25/12 19/30 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental
Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security

6/25/12 20/30 Why a graph database? (Neo4j) • Dependency &
Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries  graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment

6/25/12 21/30 Nanoprobe Functions ('C') Announce self to CMA •
Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots

6/25/12 22/30 How does discovery work? Nanoprobe scripts perform discovery
• Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships

6/25/12 23/30 sshd Service JSON Snippet (from netstat and /proc)
"sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "0.0.0.0:22": { "proto": "tcp", "addr": "0.0.0.0", "port": 22 }, and so on...

6/25/12 24/30 ssh Client JSON Snippet (from netstat and /proc)
"ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "10.10.10.5:22": { "proto": "tcp", "addr": "10.10.10.5", "port": 22 }, and so on...

6/25/12 25/30 ssh -> sshd dependency graph

6/25/12 26/30 Switch Discovery Data from LLDP (or CDP) CRM
transforms LLDP (CDP) Data to JSON

6/25/12 27/30 Current State • First release was April 2013
• Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacking real digital signatures, encryption, compression • Reliable UDP comm code all working • CMA code works, much more to go • Several discovery methods written • Licensed under the GPL

6/25/12 28/30 Future Plans • Production grade by end of
year • Commercial licenses with support • “Real digital signatures, compression, encryption • Other security enhancements • Much more discovery • GUI • Alerting • Reporting • Add Statistical Monitoring • Best Practice Audits • Dynamic (aka cloud) specialization • Hundreds more ideas – See: https://trello.com/b/OpaED3AT

6/25/12 29/30 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking
project Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Test Code (simulate 106 servers!) • Packaging, Continuous Integration • Python, C, script coding • Evangelism, community building • Documentation • Feedback: Testing, Ideas, Plans • Integration with OpenStack • Many others!

6/25/12 30/30 Resistance Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web
Site http://assimproj.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation

2013 July Boulder DevOps - Assimilation Project...

2013 July Boulder DevOps - Assimilation Project introduction

Alan Robertson

More Decks by Alan Robertson

Other Decks in Technology

Featured

Transcript

IT Discovery and Monitoring Without Limit using The Assimilation Project

6/25/12 2/30 Biography • Founded Linux-HA project - led 1998-2007

6/25/12 3/30 Project background • Available as GPL (or commercial)

6/25/12 4/30 T.A.N.S.T.A.A.F.L. What I need from you... • Feedback

6/25/12 5/30 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale

6/25/12 6/30 Questions for Audience • How many of you

6/25/12 7/30 Risk Management • We monitor systems and services

6/25/12 8/30 Why Discovery? • 30% of intrusions come from

6/25/12 9/30 Why Our Monitoring? • Much simpler to configure

6/25/12 10/30 Problems Addressed • Discovery without setting off alarms

6/25/12 11/30 Architectural Overview Collective Monitoring Authority (CMA) • One

6/25/12 12/30 Massive Scalability – or “I see dead servers

6/25/12 13/30 Massive Scalability – or “I see dead servers

6/25/12 14/30 Who will watch the watchers? • CMA in

6/25/12 15/30 Service Monitoring Based on Linux-HA LRM ideas •

6/25/12 16/30 Monitoring Pros and Cons Pros Simple & Scalable

6/25/12 17/30 How does this apply to clouds? • Fits

6/25/12 18/30 Basic CMA Functions (python) Nanoprobe management • Configure

6/25/12 19/30 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental

6/25/12 20/30 Why a graph database? (Neo4j) • Dependency &

6/25/12 21/30 Nanoprobe Functions ('C') Announce self to CMA •

6/25/12 22/30 How does discovery work? Nanoprobe scripts perform discovery

6/25/12 23/30 sshd Service JSON Snippet (from netstat and /proc)

6/25/12 24/30 ssh Client JSON Snippet (from netstat and /proc)

6/25/12 25/30 ssh -> sshd dependency graph

6/25/12 26/30 Switch Discovery Data from LLDP (or CDP) CRM

6/25/12 27/30 Current State • First release was April 2013

6/25/12 28/30 Future Plans • Production grade by end of

6/25/12 29/30 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking

6/25/12 30/30 Resistance Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web