Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LTC: Cloud monitoring without limit

LTC: Cloud monitoring without limit

Clouds are outgrowing the capacity of existing monitoring systems. This talk covers the Assimilation Monitoring Project which scales to hundreds of thousands of systems without breathing hard.

Alan Robertson

June 19, 2013

More Decks by Alan Robertson

Other Decks in Technology


  1. Cloud Monitoring Without Limit using The Assimilation Monitoring Project #AssimMon

    @OSSAlanR http://assimmon.org/ Alan Robertson <[email protected]> Project Founder
  2. 6/25/12 2/24 Project background • Sub-project of the Linux-HA project

    • Personal-time open source project • Soon to become my full time endeavor • Currently around 25K lines of code • A work-in-progress
  3. 6/25/12 3/24 Project Scope Discovery-Driven Exception Monitoring of Systems and

    Services – cloud and non-cloud • EXTREME monitoring scalability 100K systems without breathing hard • Integrated Continuous Stealth DiscoveryTM: systems, switches, services, and dependencies – without setting off network alarms
  4. 6/25/12 4/24 Problems Addressed • Scale up monitoring indefinitely •

    Minimize & simplify configuration • Keep monitoring up-to-date • Know that everything is monitored • Distinguish switch vs system failures • Discovery without setting off alarms • Highlight root causes of cascading failures • Find “forgotten” servers and services • Discover uses and installs of licensed software
  5. 6/25/12 5/24 Architectural Overview Collective Monitoring Authority (CMA) • One

    CMA per installation Nanoprobes • One nanoprobe per OS image Data Storage • Central Neo4j graph database General Rule: “No News Is Good News”
  6. 6/25/12 6/24 Massive Scalability – or “I see dead servers

    in O(1) time” • Adding systems does not increase the monitoring work on any system • Each server monitors 2 (or 4) neighbors • Each server monitors its own services • Ring repair and alerting is O(n) – but a very small amount of work • Ring repair for a million nodes is less than 10K packets per day (approximately 1 packet per 9 seconds) Today's Implementation
  7. 6/25/12 7/24 Massive Scalability – or “I see dead servers

    in O(1) time” Planned Topology-Aware Architecture Multiple levels of rings: • Support diagnosing switch issues • Minimize network traffic • Ideal for multi-site arrangements
  8. 6/25/12 8/24 How does this apply to clouds? • Fits

    nicely into a huge cloud infrastructure – Should integrate into OpenStack, et al – Can also control VMs – already knows how to start, stop and migrate VMs • Can also monitor customer VMs – bottom level of rings probably disappear – unless LLDP or CDP is provided – If you add this to your base image, with one configuration file per customer, then no need to configure anything for basic monitoring.
  9. 6/25/12 9/24 Why Discovery? • Simplifies monitoring configuration • Finds

    things that aren't monitored • Dependencies simplify root cause analysis • Simplifies understanding the environment • “Forgotten” systems are implicated in about 1/3 of all break-ins • Simplifies license management for proprietary software (saves $$)
  10. 6/25/12 10/24 Continuous Integrated Stealth Discovery Continuous - Ongoing, incremental

    Integrated - Monitoring does discovery; stored in same database Stealth - No network privileges needed - no port scans or pings Discovery - Systems, switches, clients, services and dependencies ➔Up-to-date picture of pieces & how they work w/o “network security jail” :-D
  11. 6/25/12 11/24 Service Monitoring Based on Linux-HA LRM ideas •

    LRM == Local Resource Manager • Well-proven architecture: “no news is good news” • Implements Open Cluster Framework standard (and others) • Each system monitors own services
  12. 6/25/12 12/24 Monitoring Pros and Cons Pros Simple & Scalable

    Uniform work distribution No single point of failure Distinguishes switch vs host failure Easy on LAN, WAN Cons Active agents Potential slowness at power-on
  13. 6/25/12 13/24 Basic CMA Functions (python) Nanoprobe management • Configure

    & direct • Hear alerts & discovery • Update rings: join/leave Update database Issue alerts
  14. 6/25/12 14/24 Nanoprobe Functions ('C') Announce self to CMA •

    Reserved multicast address (can be unicast address or name if no multicast) Do what CMA says • receive configuration information – CMA addresses, ports, defaults • send/expect heartbeats • perform discovery actions • perform monitoring actions No persistent state across reboots
  15. 6/25/12 15/24 Why a graph database? (Neo4j) • Dependency &

    Discovery information: graph • Speed of graph traversals depends on size of subgraph, not total graph size • Root cause queries  graph traversals – notoriously slow in relational databases • Visualization of relationships • Schema-less design: good for constantly changing heterogeneous environment
  16. 6/25/12 16/24 How does discovery work? Nanoprobe scripts perform discovery

    • Each discovers one kind of information • Can take arguments (in environment) • Output JSON CMA stores Discovery Information • JSON stored in Neo4j database • CMA discovery plugins => graph nodes and relationships
  17. 6/25/12 17/24 sshd Service JSON Snippet (from netstat and /proc)

    "sshd": { "exe": "/usr/sbin/sshd", "cmdline": [ "/usr/sbin/sshd", "-D" ], "uid": "root", "gid": "root", "cwd": "/", "listenaddrs": { "": { "proto": "tcp", "addr": "", "port": 22 }, and so on...
  18. 6/25/12 18/24 ssh Client JSON Snippet (from netstat and /proc)

    "ssh": { "exe": "/usr/sbin/ssh", "cmdline": [ "ssh", "servidor" ], "uid": "alanr", "gid": "alanr", "cwd": "/home/alanr/monitor/src", "clientaddrs": { "": { "proto": "tcp", "addr": "", "port": 22 }, and so on...
  19. 6/25/12 21/24 Current State • First release was April 2013

    • Great unit test infrastructure • Nanoprobe code – works well • Service monitoring works • Lacking digital signatures, encryption, compression • Reliable UDP comm code all working • CMA code works, much more to go • Several discovery methods written • Licensed under the GPL
  20. 6/25/12 22/24 Future Plans • Production grade by end of

    year • Commercial licenses with support • “Real digital signatures, compression, encryption • Other security enhancements • Dynamic (aka cloud) specialization • Much more discovery • Alerting • GUI • Reporting • Create/audit an ITIL CMDB • Add Statistical Monitoring • Best Practice Audits • Hundreds more ideas – See: https://trello.com/b/OpaED3AT
  21. 6/25/12 23/24 Get Involved! Powerful Ideas and Infrastucture Fun, ground-breaking

    project Needs for every kind of skill • Awesome User Interfaces (UI/UX) • Test Code (simulate 106 servers!) • Packaging, Continuous Integration • Python, C, script coding • Evangelism, community building • Documentation • Integration with OpenStack • Feedback: Testing, Ideas, Plans • Many others!
  22. 6/25/12 24/24 Resistance Is Futile! #AssimMon @OSSAlanR Project Web Site

    http://assimmon.org Blog techthoughts.typepad.com lists.community.tummy.com/cgi-bin/mailman/admin/assimilation