2013 July Boulder DevOps - Assimilation Project introduction

Slide 1

Slide 1 text

IT Discovery and Monitoring Without Limit using The Assimilation Project #AssimProj #AssimMon @OSSAlanR http://assimproj.org/ This presentation: http://bit.ly/1bh1iO8 Alan Robertson Assimilation Systems Limited

Slide 2

Slide 2 text

6/25/12 2/30 Biography ● Founded Linux-HA project - led 1998-2007 - now called Pacemaker ● Founded Assimilation Project in 2010 ● Founded Assimilation Systems Limited in 2013 ● Alumnus of Bell Labs, SuSE, IBM

Slide 3

Slide 3 text

6/25/12 3/30 Project background ● Available as GPL (or commercial) ● Founded in late 2010 ● Now my full time endeavor – Assimilation Systems Limited ● Currently around 25K lines of code ● First release: April 2013

Slide 4

Slide 4 text

6/25/12 4/30 T.A.N.S.T.A.A.F.L. What I need from you... ● Feedback on the project/product – Is it useful – why or why not? – Would it sell to management? ● Feedback on my approach to presenting it ● Other presentation feedback – Clarity, Style, etc...

Slide 5

Slide 5 text

6/25/12 5/30 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring ● Extensible discovery of systems, switches, services, and dependencies – without setting off network alarms ● Extensible monitoring of > 100K systems ● All data goes into central graph database

Slide 6

Slide 6 text

6/25/12 6/30 Questions for Audience ● How many of you have monitoring? – Open or closed source? – How many of you are happy with it? ● How many of you have discovery? – Open or closed source? – How many of you are happy with it?

Slide 7

Slide 7 text

6/25/12 7/30 Risk Management ● We monitor systems and services to reduce the risk of extended outages ● We discover systems to reduce the risk of intrusions ● We discover services to reduce the risk of extended outages ● We discover switch connections, dependencies, etc to decrease risk of system maintenance, growth and management ● Reducing risk is good for everyone

Slide 8

Slide 8 text

6/25/12 8/30 Why Discovery? ● 30% of intrusions come from unknown or forgotten systems – We discover them ● Most documentation is incomplete, incorrect ● Dependencies often unknown ● Licensed software & lawsuits are expensive ● Auditibility: you know if you got it all ● Improves planning, understanding, mgmt ● Creates opportunities to check best practices ● Discovery database is an ITIL CMDB

Slide 9

Slide 9 text

6/25/12 9/30 Why Our Monitoring? ● Much simpler to configure (in theory) ● Growth unlikely to ever be an issue – No need for proxies, multiple servers ● Dependencies help diagnose problems ● Extremely low network traffic ● Ideal for cross-WAN monitoring ● Highlight cascading failure root causes ● Not confused by switch failures ● Most switches get monitored “for free”

Slide 10

Slide 10 text

6/25/12 10/30 Problems Addressed ● Discovery without setting off alarms ● Find “forgotten” servers and services ● Discover uses and installs of licensed software ● Scale up monitoring indefinitely ● Know that everything is monitored ● Minimize & simplify monitoring configuration ● Keep monitoring up-to-date ● Distinguish switch vs system failures ● Highlight root causes of cascading failures

Slide 11

Slide 11 text

6/25/12 11/30 Architectural Overview Collective Monitoring Authority (CMA) ● One CMA per installation Nanoprobes ● One nanoprobe per OS image Data Storage ● Central Neo4j graph database General Rule: “No News Is Good News”

Slide 12

Slide 12 text

6/25/12 12/30 Massive Scalability – or “I see dead servers in O(1) time” ● Adding systems does not increase the monitoring work on any system ● Each server monitors 2 (or 4) neighbors ● Each server monitors its own services ● Ring repair and alerting is O(n) – but a very small amount of work ● Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation

Slide 13

Slide 13 text

6/25/12 13/30 Massive Scalability – or “I see dead servers in O(1) time” Planned Topology-Aware Architecture Multiple levels of rings: ● Support diagnosing switch issues ● Minimize network traffic ● Ideal for multi-site arrangements

Slide 14

Slide 14 text

6/25/12 14/30 Who will watch the watchers? ● CMA in HA cluster ● Services watched by scripts ● Scripts watched by nanoprobe ● nanoprobe watch each other ● CMA runs nanoprobes

Slide 15

Slide 15 text

6/25/12 15/30 Service Monitoring Based on Linux-HA LRM ideas ● LRM == Local Resource Manager ● Well-proven architecture: “no news is good news” ● Implements Open Cluster Framework standard (and others) ● Each system monitors own services

Slide 16

Slide 16 text

6/25/12 16/30 Monitoring Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on

Slide 17

Slide 17 text

6/25/12 17/30 How does this apply to clouds? ● Fits nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs ● Can also monitor VMs – bottom level of rings disappear without LLDP or CDP – If you add this to your base image, with one configuration file per customer, then no need to configure anything else for basic monitoring.

Slide 18

Slide 18 text

6/25/12 18/30 Basic CMA Functions (python) Nanoprobe management ● Configure & direct ● Hear alerts & discovery ● Update rings: join/leave Update database Issue alerts

Slide 19

Slide 19 text

6/25/12 19/30 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security

Slide 20

Slide 20 text

6/25/12 20/30 Why a graph database? (Neo4j) ● Dependency & Discovery information: graph ● Speed of graph traversals depends on size of subgraph, not total graph size ● Root cause queries  graph traversals – notoriously slow in relational databases ● Visualization of relationships ● Schema-less design: good for constantly changing heterogeneous environment

Slide 21

Slide 21 text

6/25/12 21/30 Nanoprobe Functions ('C') Announce self to CMA ● Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says ● receive configuration information – CMA addresses, ports, defaults ● send/expect heartbeats ● perform discovery actions ● perform monitoring actions No persistent state across reboots

Slide 22

Slide 22 text

6/25/12 22/30 How does discovery work? Nanoprobe scripts perform discovery ● Each discovers one kind of information ● Can take arguments (in environment) ● Output JSON CMA stores Discovery Information ● JSON stored in Neo4j database ● CMA discovery plugins => graph nodes and relationships

Slide 23

Slide 23 text

6/25/12 23/30 sshd Service JSON Snippet (from netstat and /proc) "sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "0.0.0.0:22": { "proto": "tcp", "addr": "0.0.0.0", "port": 22 }, and so on...

Slide 24

Slide 24 text

6/25/12 24/30 ssh Client JSON Snippet (from netstat and /proc) "ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "10.10.10.5:22": { "proto": "tcp", "addr": "10.10.10.5", "port": 22 }, and so on...

Slide 25

Slide 25 text

6/25/12 25/30 ssh -> sshd dependency graph

Slide 26

Slide 26 text

6/25/12 26/30 Switch Discovery Data from LLDP (or CDP) CRM transforms LLDP (CDP) Data to JSON

Slide 27

Slide 27 text

6/25/12 27/30 Current State ● First release was April 2013 ● Great unit test infrastructure ● Nanoprobe code – works well ● Service monitoring works ● Lacking real digital signatures, encryption, compression ● Reliable UDP comm code all working ● CMA code works, much more to go ● Several discovery methods written ● Licensed under the GPL

Slide 28

Slide 28 text

6/25/12 28/30 Future Plans ● Production grade by end of year ● Commercial licenses with support ● “Real digital signatures, compression, encryption ● Other security enhancements ● Much more discovery ● GUI ● Alerting ● Reporting ● Add Statistical Monitoring ● Best Practice Audits ● Dynamic (aka cloud) specialization ● Hundreds more ideas – See: https://trello.com/b/OpaED3AT

Slide 29

Slide 29 text

6/25/12 29/30 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking project Needs for every kind of skill ● Awesome User Interfaces (UI/UX) ● Test Code (simulate 106 servers!) ● Packaging, Continuous Integration ● Python, C, script coding ● Evangelism, community building ● Documentation ● Feedback: Testing, Ideas, Plans ● Integration with OpenStack ● Many others!

Slide 30

Slide 30 text

6/25/12 30/30 Resistance Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web Site http://assimproj.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation