Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2013 July Boulder DevOps - Assimilation Project...

2013 July Boulder DevOps - Assimilation Project introduction

An overview of the Assimilation Project - providing IT discovery and monitoring

Alan Robertson

July 15, 2013
Tweet

More Decks by Alan Robertson

Other Decks in Technology

Transcript

  1. IT Discovery and Monitoring Without Limit using The Assimilation Project

    #AssimProj #AssimMon @OSSAlanR http://assimproj.org/ This presentation: http://bit.ly/1bh1iO8 Alan Robertson <[email protected]> Assimilation Systems Limited
  2. 6/25/12 2/30 Biography • Founded Linux-HA project - led 1998-2007

    - now called Pacemaker • Founded Assimilation Project in 2010 • Founded Assimilation Systems Limited in 2013 • Alumnus of Bell Labs, SuSE, IBM
  3. 6/25/12 3/30 Project background • Available as GPL (or commercial)

    • Founded in late 2010 • Now my full time endeavor – Assimilation Systems Limited • Currently around 25K lines of code • First release: April 2013
  4. 6/25/12 4/30 T.A.N.S.T.A.A.F.L. What I need from you... • Feedback

    on the project/product – Is it useful – why or why not? – Would it sell to management? • Feedback on my approach to presenting it • Other presentation feedback – Clarity, Style, etc...
  5. 6/25/12 5/30 Project Scope Zero-network-footprint continuous Discovery integrated with extreme-scale

    Monitoring • Extensible discovery of systems, switches, services, and dependencies – without setting off network alarms • Extensible monitoring of > 100K systems • All data goes into central graph database
  6. 6/25/12 6/30 Questions for Audience • How many of you

    have monitoring? – Open or closed source? – How many of you are happy with it? • How many of you have discovery? – Open or closed source? – How many of you are happy with it?
  7. 6/25/12 7/30 Risk Management • We monitor systems and services

    to reduce the risk of extended outages • We discover systems to reduce the risk of intrusions • We discover services to reduce the risk of extended outages • We discover switch connections, dependencies, etc to decrease risk of system maintenance, growth and management • Reducing risk is good for everyone
  8. 6/25/12 8/30 Why Discovery? • 30% of intrusions come from

    unknown or forgotten systems – We discover them • Most documentation is incomplete, incorrect • Dependencies often unknown • Licensed software & lawsuits are expensive • Auditibility: you know if you got it all • Improves planning, understanding, mgmt • Creates opportunities to check best practices • Discovery database is an ITIL CMDB
  9. 6/25/12 9/30 Why Our Monitoring? • Much simpler to configure

    (in theory) • Growth unlikely to ever be an issue – No need for proxies, multiple servers • Dependencies help diagnose problems • Extremely low network traffic • Ideal for cross-WAN monitoring • Highlight cascading failure root causes • Not confused by switch failures • Most switches get monitored “for free”
  10. 6/25/12 10/30 Problems Addressed • Discovery without setting off alarms

    • Find “forgotten” servers and services • Discover uses and installs of licensed software • Scale up monitoring indefinitely • Know that everything is monitored • Minimize & simplify monitoring configuration • Keep monitoring up-to-date • Distinguish switch vs system failures • Highlight root causes of cascading failures
  11. 6/25/12 11/30 Architectural Overview Collective Monitoring Authority (CMA) • One

    CMA per installation Nanoprobes • One nanoprobe per OS image Data Storage • Central Neo4j graph database General Rule: “No News Is Good News”
  12. 6/25/12 12/30 Massive Scalability – or “I see dead servers

    in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation
  13. 6/25/12 13/30 Massive Scalability – or “I see dead servers

    in O(1) time” Planned Topology-Aware Architecture Multiple levels of rings: • Support diagnosing switch issues • Minimize network traffic • Ideal for multi-site arrangements
  14. 6/25/12 14/30 Who will watch the watchers? • CMA in

    HA cluster • Services watched by scripts • Scripts watched by nanoprobe • nanoprobe watch each other • CMA runs nanoprobes
  15. 6/25/12 15/30 Service Monitoring Based on Linux-HA LRM ideas •

    LRM == Local Resource Manager • Well-proven architecture: “no news is good news” • Implements Open Cluster Framework standard (and others) • Each system monitors own services
  16. 6/25/12 16/30 Monitoring Pros and Cons Pros Simple & Scalable

    Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on
  17. 6/25/12 17/30 How does this apply to clouds? • Fits

    nicely into a cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs • Can also monitor VMs – bottom level of rings disappear without LLDP or CDP – If you add this to your base image, with one configuration file per customer, then no need to configure anything else for basic monitoring.
  18. 6/25/12 18/30 Basic CMA Functions (python) Nanoprobe management • Configure

    & direct • Hear alerts & discovery • Update rings: join/leave Update database Issue alerts
  19. 6/25/12 19/30 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental

    Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security
  20. 6/25/12 20/30 Why a graph database? (Neo4j) • Dependency &

    Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries  graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment
  21. 6/25/12 21/30 Nanoprobe Functions ('C') Announce self to CMA •

    Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
  22. 6/25/12 22/30 How does discovery work? Nanoprobe scripts perform discovery

    • Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
  23. 6/25/12 23/30 sshd Service JSON Snippet (from netstat and /proc)

    "sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "0.0.0.0:22": { "proto": "tcp", "addr": "0.0.0.0", "port": 22 }, and so on...
  24. 6/25/12 24/30 ssh Client JSON Snippet (from netstat and /proc)

    "ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "10.10.10.5:22": { "proto": "tcp", "addr": "10.10.10.5", "port": 22 }, and so on...
  25. 6/25/12 27/30 Current State • First release was April 2013

    • Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacking real digital signatures, encryption, compression • Reliable UDP comm code all working • CMA code works, much more to go • Several discovery methods written • Licensed under the GPL
  26. 6/25/12 28/30 Future Plans • Production grade by end of

    year • Commercial licenses with support • “Real digital signatures, compression, encryption • Other security enhancements • Much more discovery • GUI • Alerting • Reporting • Add Statistical Monitoring • Best Practice Audits • Dynamic (aka cloud) specialization • Hundreds more ideas – See: https://trello.com/b/OpaED3AT
  27. 6/25/12 29/30 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking

    project Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Test Code (simulate 106 servers!) • Packaging, Continuous Integration • Python, C, script coding • Evangelism, community building • Documentation • Feedback: Testing, Ideas, Plans • Integration with OpenStack • Many others!
  28. 6/25/12 30/30 Resistance Is Futile! #AssimProj @OSSAlanR #AssimMon Project Web

    Site http://assimproj.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation