Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2013 July Boulder DevOps - Assimilation Project introduction

2013 July Boulder DevOps - Assimilation Project introduction

An overview of the Assimilation Project - providing IT discovery and monitoring

Alan Robertson

July 15, 2013
Tweet

More Decks by Alan Robertson

Other Decks in Technology

Transcript

  1. IT Discovery and Monitoring
    Without Limit
    using
    The Assimilation Project
    #AssimProj #AssimMon @OSSAlanR
    http://assimproj.org/
    This presentation:
    http://bit.ly/1bh1iO8
    Alan Robertson
    Assimilation Systems Limited

    View Slide

  2. 6/25/12
    2/30
    Biography

    Founded Linux-HA project - led 1998-2007 -
    now called Pacemaker

    Founded Assimilation Project in 2010

    Founded Assimilation Systems Limited in
    2013

    Alumnus of Bell Labs, SuSE, IBM

    View Slide

  3. 6/25/12
    3/30
    Project background

    Available as GPL (or commercial)

    Founded in late 2010

    Now my full time endeavor
    – Assimilation Systems Limited

    Currently around 25K lines of code

    First release: April 2013

    View Slide

  4. 6/25/12
    4/30
    T.A.N.S.T.A.A.F.L.
    What I need from you...

    Feedback on the project/product
    – Is it useful – why or why not?
    – Would it sell to management?

    Feedback on my approach to presenting it

    Other presentation feedback
    – Clarity, Style, etc...

    View Slide

  5. 6/25/12
    5/30
    Project Scope
    Zero-network-footprint continuous Discovery
    integrated with extreme-scale Monitoring

    Extensible discovery of systems, switches,
    services, and dependencies – without
    setting off network alarms

    Extensible monitoring of > 100K systems

    All data goes into central graph database

    View Slide

  6. 6/25/12
    6/30
    Questions for Audience

    How many of you have monitoring?
    – Open or closed source?
    – How many of you are happy with it?

    How many of you have discovery?
    – Open or closed source?
    – How many of you are happy with it?

    View Slide

  7. 6/25/12
    7/30
    Risk Management

    We monitor systems and services to
    reduce the risk of extended outages

    We discover systems to reduce the risk
    of intrusions

    We discover services to reduce the risk
    of extended outages

    We discover switch connections,
    dependencies, etc to decrease risk of
    system maintenance, growth and
    management

    Reducing risk is good for everyone

    View Slide

  8. 6/25/12
    8/30
    Why Discovery?

    30% of intrusions come from unknown
    or forgotten systems
    – We discover them

    Most documentation is incomplete, incorrect

    Dependencies often unknown

    Licensed software & lawsuits are expensive

    Auditibility: you know if you got it all

    Improves planning, understanding, mgmt

    Creates opportunities to check best practices

    Discovery database is an ITIL CMDB

    View Slide

  9. 6/25/12
    9/30
    Why Our Monitoring?

    Much simpler to configure (in theory)

    Growth unlikely to ever be an issue
    – No need for proxies, multiple servers

    Dependencies help diagnose problems

    Extremely low network traffic

    Ideal for cross-WAN monitoring

    Highlight cascading failure root causes

    Not confused by switch failures

    Most switches get monitored “for free”

    View Slide

  10. 6/25/12
    10/30
    Problems Addressed

    Discovery without setting off alarms

    Find “forgotten” servers and services

    Discover uses and installs of licensed software

    Scale up monitoring indefinitely

    Know that everything is monitored

    Minimize & simplify monitoring configuration

    Keep monitoring up-to-date

    Distinguish switch vs system failures

    Highlight root causes of cascading failures

    View Slide

  11. 6/25/12
    11/30
    Architectural Overview
    Collective Monitoring Authority (CMA)

    One CMA per installation
    Nanoprobes

    One nanoprobe per OS image
    Data Storage

    Central Neo4j graph database
    General Rule: “No News Is Good News”

    View Slide

  12. 6/25/12
    12/30
    Massive Scalability – or
    “I see dead servers in O(1) time”

    Adding systems does not increase the monitoring work on any
    system

    Each server monitors 2 (or 4) neighbors

    Each server monitors its own services

    Ring repair and alerting is O(n) – but a very small amount of work

    Ring repair for a million nodes is less than 10K packets per day
    (approximately 1 packet per 9 seconds)
    Today's Implementation

    View Slide

  13. 6/25/12
    13/30
    Massive Scalability – or
    “I see dead servers in O(1) time”
    Planned Topology-Aware Architecture
    Multiple levels of rings:

    Support diagnosing switch issues

    Minimize network traffic

    Ideal for multi-site arrangements

    View Slide

  14. 6/25/12
    14/30
    Who will watch the watchers?

    CMA in HA cluster

    Services watched
    by scripts

    Scripts watched by
    nanoprobe

    nanoprobe watch
    each other

    CMA runs
    nanoprobes

    View Slide

  15. 6/25/12
    15/30
    Service Monitoring
    Based on Linux-HA LRM ideas

    LRM == Local Resource Manager

    Well-proven architecture: “no news is
    good news”

    Implements Open Cluster Framework
    standard (and others)

    Each system monitors own services

    View Slide

  16. 6/25/12
    16/30
    Monitoring Pros and Cons
    Pros
    Simple & Scalable
    Uniform work
    distribution
    No single point of
    failure
    Distinguishes switch
    vs host failure
    Easy on LAN, WAN
    Cons
    Active agents
    Potential slowness at
    power-on

    View Slide

  17. 6/25/12
    17/30
    How does this apply to clouds?

    Fits nicely into a cloud infrastructure
    – Should integrate into OpenStack, et al
    – Can also control VMs – already knows
    how to start, stop and migrate VMs

    Can also monitor VMs
    – bottom level of rings disappear without
    LLDP or CDP
    – If you add this to your base image, with
    one configuration file per customer, then
    no need to configure anything else for
    basic monitoring.

    View Slide

  18. 6/25/12
    18/30
    Basic CMA Functions (python)
    Nanoprobe management

    Configure & direct

    Hear alerts & discovery

    Update rings: join/leave
    Update database
    Issue alerts

    View Slide

  19. 6/25/12
    19/30
    Continuous Integrated
    Stealth Discovery
    Continuous - Ongoing, incremental
    Integrated - Monitoring does discovery;
    stored in same database
    Stealth - No network privileges needed -
    no port scans or pings
    Discovery - Systems, switches, clients, services
    and dependencies
    ➔Up-to-date picture of pieces & how
    they work w/o “network security

    View Slide

  20. 6/25/12
    20/30
    Why a graph database? (Neo4j)

    Dependency & Discovery information: graph

    Speed of graph traversals depends on size
    of subgraph, not total graph size

    Root cause queries  graph traversals –
    notoriously slow in relational databases

    Visualization of relationships

    Schema-less design: good for constantly
    changing heterogeneous environment

    View Slide

  21. 6/25/12
    21/30
    Nanoprobe Functions ('C')
    Announce self to CMA

    Reserved multicast address (can be
    unicast address or name if no multicast)
    Do what CMA says

    receive configuration information
    – CMA addresses, ports, defaults

    send/expect heartbeats

    perform discovery actions

    perform monitoring actions
    No persistent state across reboots

    View Slide

  22. 6/25/12
    22/30
    How does discovery work?
    Nanoprobe scripts perform discovery

    Each discovers one kind of information

    Can take arguments (in environment)

    Output JSON
    CMA stores Discovery Information

    JSON stored in Neo4j database

    CMA discovery plugins => graph nodes and
    relationships

    View Slide

  23. 6/25/12
    23/30
    sshd Service JSON Snippet
    (from netstat and /proc)
    "sshd": {
    "exe": "/usr/sbin/sshd",
    "cmdline": [ "/usr/sbin/sshd", "-D" ],
    "uid": "root",
    "gid": "root",
    "cwd": "/",
    "listenaddrs": {
    "0.0.0.0:22": {
    "proto": "tcp",
    "addr": "0.0.0.0",
    "port": 22
    }, and so on...

    View Slide

  24. 6/25/12
    24/30
    ssh Client JSON Snippet
    (from netstat and /proc)
    "ssh": {
    "exe": "/usr/sbin/ssh",
    "cmdline": [ "ssh", "servidor" ],
    "uid": "alanr",
    "gid": "alanr",
    "cwd": "/home/alanr/monitor/src",
    "clientaddrs": {
    "10.10.10.5:22": {
    "proto": "tcp",
    "addr": "10.10.10.5",
    "port": 22
    }, and so on...

    View Slide

  25. 6/25/12
    25/30
    ssh -> sshd dependency graph

    View Slide

  26. 6/25/12
    26/30
    Switch Discovery Data
    from LLDP (or CDP)
    CRM transforms LLDP (CDP) Data to JSON

    View Slide

  27. 6/25/12
    27/30
    Current State

    First release was April 2013

    Great unit test infrastructure

    Nanoprobe code – works well

    Service monitoring works

    Lacking real digital signatures, encryption,
    compression

    Reliable UDP comm code all working

    CMA code works, much more to go

    Several discovery methods written

    Licensed under the GPL

    View Slide

  28. 6/25/12
    28/30
    Future Plans

    Production grade by end of year

    Commercial licenses with support

    “Real digital signatures, compression, encryption

    Other security enhancements

    Much more discovery

    GUI

    Alerting

    Reporting

    Add Statistical Monitoring

    Best Practice Audits

    Dynamic (aka cloud) specialization

    Hundreds more ideas
    – See: https://trello.com/b/OpaED3AT

    View Slide

  29. 6/25/12
    29/30
    Get Involved!
    Powerful Ideas and Infrastucture
    Fun, ground-breaking project
    Needs for every kind of skill

    Awesome User Interfaces (UI/UX)

    Test Code (simulate 106 servers!)

    Packaging, Continuous Integration

    Python, C, script coding

    Evangelism, community building

    Documentation

    Feedback: Testing, Ideas, Plans

    Integration with OpenStack

    Many others!

    View Slide

  30. 6/25/12
    30/30
    Resistance Is Futile!
    #AssimProj @OSSAlanR
    #AssimMon
    Project Web Site
    http://assimproj.org
    Blog
    techthoughts.typepad.com
    lists.community.tummy.com/cgi-bin/mailman/admin/assimilation

    View Slide