LCA2014 Main Conference Assimilation Presentation

L C A 2 0 1 4 IT Discovery and
Monitoring Without Limit using The Assimilation Project #AssimProj @OSSAlanR http://assimproj.org/ Alan Robertson <[email protected]> Assimilation Systems Limited http://assimilationsystems.com

linux.conf.au 08 January 2014 © 2013 Assimilation Systems Limited 2/37
L C A 2 0 1 4 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring • Continuous extensible discovery – systems, switches, services, dependencies – zero network footprint • Extensible exception monitoring – more than 100K systems • All data goes into central graph database

L C A 2 0 1 4 Questions • How many of you have monitoring? – Open or closed source? – How many of you are happy with it? • How many of you have discovery? – Open or closed source? – Is it continuous? – How many of you are happy with it?

L C A 2 0 1 4 Assimilation Project History • Inspired by 2 million core computer (cyclops64) • Concerns for extreme scale • Topology aware monitoring • Topology discovery w/out security issues =►Discovery of everything!

L C A 2 0 1 4

L C A 2 0 1 4 An 8-dimensional overview • Problems Addressed • Unique Capabilities • Distribution of Work • Architectural Components • Discovery Graph Schema • Extensible Discovery API • Current Status • Project Needs

L C A 2 0 1 4 First Dimension: Problems Addressed Risk Management at extreme scale 1. Maintaining detailed discovery database 2. Discovering systems you've forgotten about 3. Discovering what (licensed) software you're running – and where 4. Monitoring services, systems and switches 5. Finding services you aren't monitoring

L C A 2 0 1 4 Risk Management/Mitigation • Intrusions • Licensed Software • Audit Risk • Outages • System management

L C A 2 0 1 4 Why Discovery? (DevOps) • Documentation: incomplete, incorrect • Dependencies: unknown • Planning: Needs accurate data • Best Practices: Verification needs data • ITIL CMDB (Configuration Mgmt DataBase) Our Discovery: continuous, low-profile

L C A 2 0 1 4 Second Dimension: Unique Powerful Features 1. Continuous Discovery 2. Zero network discovery footprint 3. Centralized graph database 4. We know everything that changes 5. Discover and update dependency information

L C A 2 0 1 4 (even more) Features... 6. Discovery and monitoring tightly integrated – discovery drives monitoring 7. Discovery and monitoring easily extensible 8. Naturally scalable to > 100K systems 9. Server failures distinguishable from switch failures 10.Minimal network load 11.Multi-tenant support

L C A 2 0 1 4 This all sounds unreasonable... • Huge scalability without complexity? • Discovery without sending packets? Really?

L C A 2 0 1 4 Third Dimension: Uniformly, fully distributed work Two philosophical underpinnings 1. Monitoring and Discovery are fully distributed 2. Reliable “no news is good news” Only responses to changes are centralized

L C A 2 0 1 4 Simple Scalability • I can explain how we distribute work so your grandmother would understand

L C A 2 0 1 4 Massive Scalability – or “I see dead servers in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Current Implementation

L C A 2 0 1 4 Minimizing Network Footprint (planned) • Support diagnosing switch issues • Minimize network traffic • Ideal for multi-site arrangements

L C A 2 0 1 4 Fourth Dimension: Architectural Components Three Architectural Components Collective Management Authority • One CMA per installation Nanoprobes • One nanoprobe per system Data Storage • Central Neo4j graph database

L C A 2 0 1 4 Basic CMA Functions (python) Nanoprobe management • Configure & direct • Hear alerts & discovery • Update rings: join/leave Update database Issue alerts

L C A 2 0 1 4 Nanoprobe Functions ('C') Announce self to CMA • Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots

L C A 2 0 1 4 Service Monitoring based on Linux-HA/Pacemaker LRM • LRM == Local Resource Manager • Well-proven architecture: – “no news is good news” AKA management by exception • Implements Open Cluster Framework standard (and others) • Each system monitors own services • Can also start, stop, migrate services

L C A 2 0 1 4 Monitoring Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Multi-tenant approach Cons Active agents Potential slowness at power-on

L C A 2 0 1 4 Why a graph database? (Neo4j) • Humans describe systems as graphs • Dependency & Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries  graph traversals – notoriously slow in relational databases • Visualization is Natural • Schema-less design: good for constantly changing heterogeneous environment • Graph Model === Object Model

L C A 2 0 1 4 Fifth Dimension: Discovery API Scripts perform discovery – output JSON Three Sample Discovery Snippets • OS information • Service discovery • Client discovery

L C A 2 0 1 4 A multi-dimensional demo • Demonstrate basic capabilities – Discovery – Automatic monitoring configuration – Monitoring – failures / successes • No configuration was supplied – everything comes from discovery

L C A 2 0 1 4 How does discovery work? Nanoprobe scripts perform discovery • Each discovers one kind of information • Can take arguments from environment • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships

L C A 2 0 1 4 OS discovery JSON Snippet { "nodename": "alanr-1225B", "operating-system": "GNU/Linux", "machine": "x86_64", "processor": "x86_64", "hardware-platform": "x86_64", "kernel-name": "Linux", "kernel-release": "3.8.0-31-generic", "kernel-version": "#46-Ubuntu SMP ...", "Distributor ID": "Ubuntu", "Description": "Ubuntu 13.04", "Release": "13.04", "Codename": "raring" }

L C A 2 0 1 4 sshd Service JSON Snippet (from netstat and /proc) "sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "0.0.0.0:22": { "proto": "tcp", "addr": "0.0.0.0", "port": 22 }, and so on...

L C A 2 0 1 4 ssh Client JSON Snippet (from netstat and /proc) "ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "10.10.10.5:22": { "proto": "tcp", "addr": "10.10.10.5", "port": 22 }, and so on...

L C A 2 0 1 4 Sixth Dimension: Graph Schema Two Schema subgraphs • Client / server dependency • Switch interconnect

L C A 2 0 1 4 ssh -> sshd dependency graph

L C A 2 0 1 4 Switch Discovery Data from LLDP (or CDP) CRM transforms LLDP (CDP) Data to JSON

L C A 2 0 1 4 Seventh Dimension: Current Status • First release April 2013 • Great unit tests • Nanoprobe code works well • Several discovery methods written • CMA restructuring complete • Discovery => Automatic Monitoring (WOOT!) • UI development underway • Licensed under GPL: commercial options available

L C A 2 0 1 4 Eighth Dimension: Get Involved! We need every talent! • Early adopters • Testers, Continuous Integration • Designers • Developers (C,Python, Shell, PowerShell, JavaScript) • Porters (esp Windows) • Promoters, publicists • Packagers • And so on...

L C A 2 0 1 4 Resistance Is Futile! Mailing List bit.ly/AssimML #AssimProj @OSSAlanR Project Web Site assimproj.org Blog techthoughts.typepad.com assimilationsystems.com

LCA2014 Main Conference Assimilation Presentation

LCA2014 Main Conference Assimilation Presentation

More Decks by Alan Robertson

Other Decks in Technology

Featured

Transcript