Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2013 August BLUG - Discovery and Monitoring Wit...

Alan Robertson
August 08, 2013
50

2013 August BLUG - Discovery and Monitoring Without Limit

A presentation of the Assimilation Project to the Boulder Linux User's Group

Alan Robertson

August 08, 2013
Tweet

Transcript

  1. Discovery and Monitoring Without Limit using The Assimilation Project #AssimProj

    @OSSAlanR http://assimproj.org/ Alan Robertson <[email protected]> Assimilation Systems Limited http://assimilationsystems.com
  2. 8 August 2013 © 2013 Assimilation Systems Limited 2/33 Biography

    • Founded Linux-HA project - led 1998-2007 - now called Pacemaker • Founded Assimilation Project in 2010 • Founded Assimilation Systems Limited in 2013 • Alumnus of Bell Labs, SuSE, IBM
  3. 8 August 2013 © 2013 Assimilation Systems Limited 3/33 Project

    background • Available as GPL (or commercial) – your fork/exec scripts not required to be GPL • Founded in late 2010 • Now my full time endeavor – Assimilation Systems Limited • Currently around 25K lines of code • First release: April 2013
  4. 8 August 2013 © 2013 Assimilation Systems Limited 4/33 T.A.N.S.T.A.A.F.L.

    What I need from you... • Feedback on the project/product – Is it useful – why or why not? – Would it sell to management? • Feedback on my approach to presenting it • Other presentation feedback – Clarity, Style, etc...
  5. 8 August 2013 © 2013 Assimilation Systems Limited 5/33 Project

    Scope Zero-network-footprint continuous Discovery integrated with extreme-scale Monitoring • Continuous extensible discovery – systems, switches, services, dependencies – zero network footprint • Extensible exception monitoring – more than 100K systems • All data goes into central graph database
  6. 8 August 2013 © 2013 Assimilation Systems Limited 6/33 Questions

    • How many of you have monitoring? – Open or closed source? – How many of you are happy with it? • How many of you have discovery? – Open or closed source? – Is it continuous? – How many of you are happy with it?
  7. 8 August 2013 © 2013 Assimilation Systems Limited 7/33 Why

    Assimilation Software? • Management Perspective • DevOps Perspective
  8. 8 August 2013 © 2013 Assimilation Systems Limited 8/33 Risk

    Management/Mitigation • Intrusions • Licensed Software • Audit Risk • Outages • System management
  9. 8 August 2013 © 2013 Assimilation Systems Limited 9/33 Why

    Discovery? (DevOps) • Documentation: incomplete, incorrect • Dependencies: unknown • Planning: Needs accurate data • Best Practices: Verification needs data • ITIL CMDB (Configuration Mgmt DataBase)
  10. 8 August 2013 © 2013 Assimilation Systems Limited 10/33 Why

    Our Monitoring? • Simpler to configure (in theory) • Growth is non-issue • Extremely low network traffic • Ideal for cross-WAN monitoring • Highlight cascading failure root causes • Not confused by switch failures • Most switches get monitored “for free”
  11. 8 August 2013 © 2013 Assimilation Systems Limited 11/33 This

    all sounds unreasonable... • Huge scalability without complexity? • Discovery without sending packets? Really?
  12. 8 August 2013 © 2013 Assimilation Systems Limited 12/33 Architectural

    Overview Collective Management Authority • One CMA per installation Nanoprobes • One nanoprobe per OS image Data Storage • Central Neo4j graph database General Rule: “No News Is Good News”
  13. 8 August 2013 © 2013 Assimilation Systems Limited 13/33 Simple

    Scalability • I can explain how we scale so your grandmother would understand
  14. 8 August 2013 © 2013 Assimilation Systems Limited 14/33 Massive

    Scalability – or “I see dead servers in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Current Implementation
  15. 8 August 2013 © 2013 Assimilation Systems Limited 15/33 Decreasing

    Network Footprint (planned) • Support diagnosing switch issues • Minimize network traffic • Ideal for multi-site arrangements
  16. 8 August 2013 © 2013 Assimilation Systems Limited 16/33 Service

    Monitoring Based on Linux-HA LRM • LRM == Local Resource Manager • Well-proven architecture: – “no news is good news” AKA management by exception • Implements Open Cluster Framework standard (and others) • Each system monitors own services • Can also start, stop, migrate services
  17. 8 August 2013 © 2013 Assimilation Systems Limited 18/33 Monitoring

    Pros and Cons Pros Simple & Scalable Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on
  18. 8 August 2013 © 2013 Assimilation Systems Limited 19/33 How

    does this apply to clouds? • Fits nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can control VMs • Can monitor customer VMs – Add nanoprobe to base image – bottom level of rings disappear without LLDP or CDP
  19. 8 August 2013 © 2013 Assimilation Systems Limited 21/33 Nanoprobe

    Functions ('C') Announce self to CMA • Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
  20. 8 August 2013 © 2013 Assimilation Systems Limited 22/33 Basic

    CMA Functions (python) Nanoprobe management • Configure & direct • Hear alerts & discovery • Update rings: join/leave Update database Issue alerts
  21. 8 August 2013 © 2013 Assimilation Systems Limited 24/33 Why

    a graph database? (Neo4j) • Dependency & Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries  graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment
  22. 8 August 2013 © 2013 Assimilation Systems Limited 25/33 How

    does discovery work? Nanoprobe scripts perform discovery • Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
  23. 8 August 2013 © 2013 Assimilation Systems Limited 26/33 sshd

    Service JSON Snippet (from netstat and /proc) "sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "0.0.0.0:22": { "proto": "tcp", "addr": "0.0.0.0", "port": 22 }, and so on...
  24. 8 August 2013 © 2013 Assimilation Systems Limited 27/33 ssh

    Client JSON Snippet (from netstat and /proc) "ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "10.10.10.5:22": { "proto": "tcp", "addr": "10.10.10.5", "port": 22 }, and so on...
  25. 8 August 2013 © 2013 Assimilation Systems Limited 29/33 Switch

    Discovery Data from LLDP (or CDP) CRM transforms LLDP (CDP) Data to JSON
  26. 8 August 2013 © 2013 Assimilation Systems Limited 30/33 Current

    State • First release was April 2013 • Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacking real digital signatures, encryption, compression • Reliable UDP comm code all working • CMA code works, much more to go • Several discovery methods written • Licensed under the GPL
  27. 8 August 2013 © 2013 Assimilation Systems Limited 31/33 Future

    Plans • Production grade by end of year • Support, commercial licenses • “Real digital signatures, compression, encryption • Other security enhancements • Much more discovery • GUI • Alerting • Reporting • Add Statistical Monitoring • Best Practice Audits • Dynamic (aka cloud) specialization • Hundreds more ideas – See: https://trello.com/b/OpaED3AT
  28. 8 August 2013 © 2013 Assimilation Systems Limited 32/33 Get

    Involved! Powerful Ideas and Infrastucture Fun, ground-breaking project Looking for early adopters, testers!! Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Evangelism, community building • Test Code (simulate 106 servers!) • Python, C, script coding • Documentation • Feedback: Testing, Ideas, Plans • Many others!
  29. 8 August 2013 © 2013 Assimilation Systems Limited 33/33 Resistance

    Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web Site http://assimproj.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation