Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Server Administration in Python with Fabric, Cu...

sebastien
February 01, 2011
170

Server Administration in Python with Fabric, Cuisine and Watchdog

sebastien

February 01, 2011
Tweet

Transcript

  1. ffunction inc. How to use Python for Server Administration Thanks

    to Fabric Cuisine* & monitoring* *custom tools
  2. ffunction inc. WEB SERVER The era of dedicated servers DATABASE

    SERVER EMAIL SERVER Hosted in your server room or in colocation
  3. ffunction inc. WEB SERVER The era of dedicated servers DATABASE

    SERVER EMAIL SERVER Hosted in your server room or in colocation Sysadmins typically SSH and configure the servers live Sysadmins typically SSH and configure the servers live
  4. ffunction inc. WEB SERVER The era of dedicated servers DATABASE

    SERVER EMAIL SERVER Hosted in your server room or in colocation The servers are conservatively managed, updates are risky The servers are conservatively managed, updates are risky
  5. ffunction inc. SLICE 1 The era of slices/VPS SLICE 10

    Linode.com SLICE 11 SLICE 9 SLICE 1 SLICE 1 SLICE 1 SLICE 1 SLICE 6 Amazon Ec2 We now have multiple small virtual servers (slices/VPS) We now have multiple small virtual servers (slices/VPS)
  6. ffunction inc. SLICE 1 The era of slices/VPS SLICE 10

    Linode.com SLICE 11 SLICE 9 SLICE 1 SLICE 1 SLICE 1 SLICE 1 SLICE 6 Amazon Ec2 Often located in different data-centers Often located in different data-centers
  7. ffunction inc. SLICE 1 The era of slices/VPS SLICE 10

    Linode.com SLICE 11 SLICE 9 SLICE 1 SLICE 1 SLICE 1 SLICE 1 SLICE 6 Amazon Ec2 ...and sometimes with different providers ...and sometimes with different providers
  8. ffunction inc. SLICE 1 The era of slices/VPS SLICE 10

    Linode.com SLICE 11 SLICE 9 SLICE 1 SLICE 1 SLICE 1 SLICE 1 SLICE 6 Amazon Ec2 DEDICATED SERVER 1 DEDICATED SERVER 2 IWeb.com We even sometimes still have physical, dedicated servers We even sometimes still have physical, dedicated servers
  9. ffunction inc. The challenge ORDER SERVER SETUP SERVER Create users,

    groups Customize config files Install base packages Create users, groups Customize config files Install base packages
  10. ffunction inc. The challenge ORDER SERVER SETUP SERVER DEPLOY APPLICATION

    Install app-specific packages deploy application start services Install app-specific packages deploy application start services
  11. ffunction inc. The challenge ORDER SERVER SETUP SERVER DEPLOY APPLICATION

    MAKE THIS PROCESS AS FAST (AND SIMPLE) AS POSSIBLE
  12. ffunction inc. The challenge Quickly integrate your new server in

    the existing architecture Quickly integrate your new server in the existing architecture
  13. ffunction inc. Today's menu FABRIC CUISINE Interact with your remote

    machines as if they were local Takes care of users, group, packages and configuration of your new machine
  14. ffunction inc. Today's menu FABRIC CUISINE monitoring Interact with your

    remote machines as if they were local Takes care of users, group, packages and configuration of your new machine Ensures that your servers and services are up and running
  15. ffunction inc. Today's menu FABRIC CUISINE monitoring Interact with your

    remote machines as if they were local Takes care of users, group, packages and configuration of your new machine Ensures that your servers and services are up and running Made by Made by
  16. ffunction inc. Fabric is a Python library and command-line tool

    for streamlining the use of SSH for application deployment or systems administration tasks.
  17. ffunction inc. Fabric is a Python library and command-line tool

    for streamlining the use of SSH for application deployment or systems administration tasks. Wait... what does that mean ? Wait... what does that mean ?
  18. ffunction inc. Streamlining SSH version = os.popen(“ssh myserver 'cat /proc/version').read()

    from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) By hand: Using Fabric:
  19. ffunction inc. Streamlining SSH version = os.popen(“ssh myserver 'cat /proc/version').read()

    from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) By hand: Using Fabric: You can specify multiple hosts and run the same commands across them You can specify multiple hosts and run the same commands across them
  20. ffunction inc. Streamlining SSH version = os.popen(“ssh myserver 'cat /proc/version').read()

    from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) By hand: Using Fabric: Connections will be lazily created and pooled Connections will be lazily created and pooled
  21. ffunction inc. Streamlining SSH version = os.popen(“ssh myserver 'cat /proc/version').read()

    from fabric.api import * env.hosts = [“myserver”] version = run(“cat /proc/version”) By hand: Using Fabric: Failures ($STATUS) will be handled just like in Make Failures ($STATUS) will be handled just like in Make
  22. ffunction inc. Example: Installing packages sudo(“aptitude install nginx”) if run("dpkg

    -s %s | grep 'Status:' ; true" % package).find("installed") == -1: sudo("aptitude install '%s'" % (package)
  23. ffunction inc. Example: Installing packages sudo(“aptitude install nginx”) if run("dpkg

    -s %s | grep 'Status:' ; true" % package).find("installed") == -1: sudo("aptitude install '%s'" % (package) It's easy to take action depending on the result It's easy to take action depending on the result
  24. ffunction inc. Example: Installing packages sudo(“aptitude install nginx”) if run("dpkg

    -s %s | grep 'Status:' ; true" % package).find("installed") == -1: sudo("aptitude install '%s'" % (package) Note that we add true so that the run() always succeeds* * there are other ways... Note that we add true so that the run() always succeeds* * there are other ways...
  25. ffunction inc. Example: retrieving system status disk_usage = run(“df -kP”)

    mem_usage = run(“cat /proc/meminfo”) cpu_usage = run(“cat /proc/stat” print disk_usage, mem_usage, cpu_info
  26. ffunction inc. Example: retrieving system status disk_usage = run(“df -kP”)

    mem_usage = run(“cat /proc/meminfo”) cpu_usage = run(“cat /proc/stat” print disk_usage, mem_usage, cpu_info Very useful for getting live information from many different servers Very useful for getting live information from many different servers
  27. ffunction inc. Fabfile.py from fabric.api import * from mysetup import

    * env.host = [“server1.myapp.com”] def setup(): install_packages(“...”) update_configuration() create_users() start_daemons() $ fab setup
  28. ffunction inc. Fabfile.py from fabric.api import * from mysetup import

    * env.host = [“server1.myapp.com”] def setup(): install_packages(“...”) update_configuration() create_users() start_daemons() $ fab setup Just like Make, you write rules that do something Just like Make, you write rules that do something
  29. ffunction inc. Fabfile.py from fabric.api import * from mysetup import

    * env.host = [“server1.myapp.com”] def setup(): install_packages(“...”) update_configuration() create_users() start_daemons() $ fab setup ...and you can specify on which servers the rules will run ...and you can specify on which servers the rules will run
  30. ffunction inc. Multiple hosts @hosts(“db1.myapp”) def backup_db(): run(...) env.hosts =

    [ “db1.myapp.com”, “db2.myapp.com”, “db3.myapp.com” ]
  31. ffunction inc. Roles $ fab -R web setup env.roledefs =

    { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2'] }
  32. ffunction inc. Roles $ fab -R web setup env.roledefs =

    { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2'] } Will run the setup rule only on hosts members of the web role. Will run the setup rule only on hosts members of the web role.
  33. ffunction inc. Some facts about Fabric Fabric 1.0 just released!

    On March, 4th 2011 3 years of development First commit 1161 days ago (on March 10th, 2011) Related Projects Opscode's Chef and Puppet
  34. ffunction inc. What's good about Fabric? Low-level Basically an ssh()

    command that returns the result Simple primitives run(), sudo(), get(), put(), local(), prompt(), reboot() No magic No DSL, no abstraction, just a remote command API
  35. ffunction inc. What could be improved ? Ease common admin

    tasks User, group creation. Files, directory operations. Abstract primitives Like install package, so that it works with different OS Templates To make creating/updating configuration files easy
  36. ffunction inc. What is Opscode's Chef? Recipes Scripts/packages to install

    and configure services and applications API A DSL-like Ruby API to interact with the OS (create users, groups, install packages, etc) Architecture Client-server or “solo” mode to push and deploy your new configurations http://wiki.opscode.com/display/chef/Home
  37. ffunction inc. What I liked about Chef Flexible You can

    use the API or shell commands Structured Helped me have a clear decomposition of the services installed per machine Community Lots of recipes already available from http://cookbooks.opscode.com/
  38. ffunction inc. What I didn't like Too many files and

    directories Code is spread out, hard to get the big picture Abstraction overload API not very well documented, frequent fall backs to plain shell scripts within the recipe No “smart” recipe Recipes are applied all the time, even when it's not necessary
  39. ffunction inc. The question that kept coming... Django recipe: 5

    files, 2 directories sudo aptitude install apache2 python django- python What it does, in essence
  40. ffunction inc. The question that kept coming... Django recipe: 5

    files, 2 directories sudo aptitude install apache2 python django- python What it does, in essence Is this really necessary for what I want to do ? Is this really necessary for what I want to do ?
  41. ffunction inc. What I loved about Fabric Bare metal ssh()

    function, simple and elegant set of primitives No magic No abstraction, no model, no compilation Two-way communication Easy to change the rule's behaviour according to the output (ex: do not install something that's already installed)
  42. ffunction inc. What I needed Fabric File I/O File I/O

    User/Group Management User/Group Management
  43. ffunction inc. What I needed Fabric File I/O File I/O

    Package Management Package Management User/Group Management User/Group Management
  44. ffunction inc. What I needed Fabric File I/O File I/O

    Package Management Package Management User/Group Management User/Group Management Text processing & Templates Text processing & Templates
  45. ffunction inc. How I wanted it Simple “flat” API [object]_[operation]

    where operation is something in “create”, “read”, “update”, “write”, “remove”, “ensure”, etc... Driven by need Only implement a feature if I have a real need for it No magic Everything is implemented using sh-compatible commands No unnecessary structure Everything fits in one file, no imposed file layout
  46. ffunction inc. Cuisine: Example fabfile.py from cuisine import * env.host

    = [“server1.myapp.com”] def setup(): package_ensure(“python”, “apache2”, “python-django”) user_ensure(“admin”, uid=2000) upstart_ensure(“django”) $ fab setup
  47. ffunction inc. Cuisine: Example fabfile.py from cuisine import * env.host

    = [“server1.myapp.com”] def setup(): package_ensure(“python”, “apache2”, “python-django”) user_ensure(“admin”, uid=2000) upstart_ensure(“django”) $ fab setup Fabric's core functions are already imported Fabric's core functions are already imported
  48. ffunction inc. Cuisine: Example fabfile.py from cuisine import * env.host

    = [“server1.myapp.com”] def setup(): package_ensure(“python”, “apache2”, “python-django”) user_ensure(“admin”, uid=2000) upstart_ensure(“django”) $ fab setup Cuisine's API calls Cuisine's API calls
  49. ffunction inc. Cuisine : File I/O • file_exists does remote

    file exists? • file_read reads remote file • file_write write data to remote file • file_append appends data to remote file • file_attribs chmod & chown • file_remove
  50. ffunction inc. Cuisine : File I/O • file_exists does remote

    file exists? • file_read reads remote file • file_write write data to remote file • file_append appends data to remote file • file_attribs chmod & chown • file_remove Supports owner/group and mode change Supports owner/group and mode change
  51. ffunction inc. Cuisine : File I/O (directories) • dir_exists does

    remote file exists? • dir_ensure ensures that a directory exists • dir_attribs chmod & chown • dir_remove
  52. ffunction inc. Cuisine : File I/O + • file_update(location, updater=lambda

    _:_) package_ensure("mongodb-snapshot") def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res) file_update("/etc/mongodb.conf", update_configuration)
  53. ffunction inc. Cuisine : File I/O + • file_update(location, updater=lambda

    _:_) package_ensure("mongodb-snapshot") def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res) file_update("/etc/mongodb.conf", update_configuration) This replaces the values for configuration entries dbpath and logpath This replaces the values for configuration entries dbpath and logpath
  54. ffunction inc. Cuisine : File I/O + • file_update(location, updater=lambda

    _:_) package_ensure("mongodb-snapshot") def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res) file_update("/etc/mongodb.conf", update_configuration) The remote file will only be changed if the content is different The remote file will only be changed if the content is different
  55. ffunction inc. Cuisine: User Management • user_exists does the user

    exists? • user_create create the user • user_ensure create the user if it doesn't exist
  56. ffunction inc. Cuisine: Group Management • group_exists does the group

    exists? • group_create create the group • group_ensure create the group if it doesn't exist • group_user_exists does the user belong to the group? • group_user_add adds the user to the group • group_user_ensure
  57. ffunction inc. Cuisine: Package Management • package_exists is the package

    available ? • package_installed is it installed ? • package_install install the package • package_ensure ... only if it's not installed • package_upgrade upgrades the/all package(s)
  58. ffunction inc. Cuisine: Text transformation text_ensure_line(text, lines) file_update( "/home/user/.profile", lambda

    _:text_ensure_line(_, "PYTHONPATH=/opt/lib/python:${PYTHONPATH};" "export PYTHONPATH" ))
  59. ffunction inc. Cuisine: Text transformation text_ensure_line(text, lines) file_update( "/home/user/.profile", lambda

    _:text_ensure_line(_, "PYTHONPATH=/opt/lib/python:${PYTHONPATH};" "export PYTHONPATH" )) Ensures that the PYTHONPATH variable is set and exported, If not, these lines will be appended. Ensures that the PYTHONPATH variable is set and exported, If not, these lines will be appended.
  60. ffunction inc. Cuisine: Text transformation text_replace_line(text, old, new, find=.., process=...)

    configuration = local_read("server.conf") for key, value in variables.items(): configuration, replaced = text_replace_line( configuration, key + "=", key + "=" + repr(value), process=lambda text:text.split("=")[0].strip() )
  61. ffunction inc. Cuisine: Text transformation text_replace_line(text, old, new, find=.., process=...)

    configuration = local_read("server.conf") for key, value in variables.items(): configuration, replaced = text_replace_line( configuration, key + "=", key + "=" + repr(value), process=lambda text:text.split("=")[0].strip() ) Replaces lines that look like VARIABLE=VALUE with the actual values from the variables dictionary. Replaces lines that look like VARIABLE=VALUE with the actual values from the variables dictionary.
  62. ffunction inc. Cuisine: Text transformation text_replace_line(text, old, new, find=.., process=...)

    configuration = local_read("server.conf") for key, value in variables.items(): configuration, replaced = text_replace_line( configuration, key + "=", key + "=" + repr(value), process=lambda text:text.split("=")[0].strip() ) The process lambda transforms input lines before comparing them. Here the lines are stripped of spaces and of their value. The process lambda transforms input lines before comparing them. Here the lines are stripped of spaces and of their value.
  63. ffunction inc. Cuisine: Text transformation text_strip_margin(text) file_write(".profile", text_strip_margin( """ |export

    PATH="$HOME/bin":$PATH |set -o vi """ )) Everything after the | separator will be output as content. It allows to easily embed text templates within functions. Everything after the | separator will be output as content. It allows to easily embed text templates within functions.
  64. ffunction inc. Cuisine: Text transformation text_template(text, variables) text_template(text_strip_margin( """ |cd

    ${DAEMON_PATH} |exec ${DAEMON_EXEC_PATH} """ ), dict( DAEMON_PATH="/opt/mongodb", DAEMON_EXEC_PATH="/opt/mongodb/mongod" ))
  65. ffunction inc. Cuisine: Text transformation text_template(text, variables) text_template(text_strip_margin( """ |cd

    ${DAEMON_PATH} |exec ${DAEMON_EXEC_PATH} """ ), dict( DAEMON_PATH="/opt/mongodb", DAEMON_EXEC_PATH="/opt/mongodb/mongod" )) This is a simple wrapper around Python (safe) string.template() function This is a simple wrapper around Python (safe) string.template() function
  66. ffunction inc. Cuisine: Goodies • ssh_keygen generates DSA keys •

    ssh_authorize authorizes your key on the remote server • mode_sudo run() always uses sudo • upstart_ensure ensures the given daemon is running & more!
  67. ffunction inc. Cuisine Tips: Structuring your rules BOOTSTRAP You just

    received your new VPS, and you want to set it up so that you have a base system that you can access without typing a password You just received your new VPS, and you want to set it up so that you have a base system that you can access without typing a password
  68. ffunction inc. Cuisine Tips: Structuring your rules BOOTSTRAP SETUP You

    install your users, groups, preferred packages and configuration. You also install you applications. You install your users, groups, preferred packages and configuration. You also install you applications.
  69. ffunction inc. Cuisine Tips: Structuring your rules BOOTSTRAP SETUP UPDATE

    You want to deploy the new version of the application you just built You want to deploy the new version of the application you just built
  70. ffunction inc. Cuisine Tips: Structuring your rules BOOTSTRAP SETUP UPDATE

    def bootstrap(): # Secure SSH, create admin user # Authorize SSH public keys # Remove unwanted packages
  71. ffunction inc. Cuisine Tips: Structuring your rules BOOTSTRAP SETUP UPDATE

    def setup(): # Create directories (ex: /opt/data, /opt/services, etc) # Create user/groups (ex: apps, services, etc) # Install base tools (ex: screen, fail2ban, zsh, etc) # Edit configuration (ex: profile, inputrc, etc) # Install and run your application
  72. ffunction inc. Cuisine Tips: Structuring your rules BOOTSTRAP SETUP UPDATE

    def update(): # Download your application update # Freeze/stop the running application # Install the update # Reload/restart your application # Test that everything is OK
  73. ffunction inc. Why use Cuisine ? • Simple API for

    remote-server manipulation Files, users, groups, packages • Shell commands for specific tasks only Avoid problems with your shell commands by only using run() for very specific tasks • Cuisine tasks are not stupid *_ensure() commands won't do anything if it's not necessary
  74. ffunction inc. Limitations • Limited to sh-shells Operations will not

    work under csh • Only written/tested for Ubuntu Linux Contributors could easily port commands
  75. ffunction inc. (Some of the) existing solutions Monit, God, Supervisord,

    Upstart Focus on starting/restarting daemons and services Munin, Cacti Focus on visualization of RRDTool data Collectd Focus on collecting and publishing data
  76. ffunction inc. The ideal tool Wide spectrum Data collection, service

    monitoring, actions Easy setup and deployment No complex installation or configuration Flexible server architecture Can monitor local or remote processes Customizable and extensible From restarting deamons to monitoring whole servers
  77. ffunction inc. Hello, monitoring! RULE SERVICE A service is a

    collection of RULES A service is a collection of RULES
  78. ffunction inc. Hello, monitoring! RULE SERVICE HTTP Request CPU, Disk,

    Mem % Process status I/O Bandwidth Each rule retrieves data and processes it. Rules can SUCCEED or FAIL Each rule retrieves data and processes it. Rules can SUCCEED or FAIL
  79. ffunction inc. Hello, monitoring! RULE ACTION SERVICE HTTP Request CPU,

    Disk, Mem % Process status I/O Bandwidth Logging XMPP, Email notifications Start/stop process ….
  80. ffunction inc. Hello, monitoring! RULE ACTION SERVICE HTTP Request CPU,

    Disk, Mem % Process status I/O Bandwidth Logging XMPP, Email notifications Start/stop process …. Actions are bound to rule, triggered on rule SUCCESS or FAILURE Actions are bound to rule, triggered on rule SUCCESS or FAILURE
  81. ffunction inc. Execution Model MONITOR RULE (frequency in ms) SERVICE

    DEFINITION Services are registered in the monitor Services are registered in the monitor
  82. ffunction inc. Execution Model MONITOR RULE (frequency in ms) SERVICE

    DEFINITION Rules defined in the service are executed every N ms (frequency) Rules defined in the service are executed every N ms (frequency) Rules defined in the service are executed every N ms (frequency) Rules defined in the service are executed every N ms (frequency)
  83. ffunction inc. Execution Model MONITOR RULE (frequency in ms) ACTION

    ACTION ACTION SERVICE DEFINITION SUCCESS FAILURE
  84. ffunction inc. Execution Model MONITOR RULE (frequency in ms) ACTION

    ACTION ACTION SERVICE DEFINITION If the rule SUCCEEDS actions will be sequentially executed If the rule SUCCEEDS actions will be sequentially executed SUCCESS FAILURE
  85. ffunction inc. Execution Model MONITOR RULE (frequency in ms) ACTION

    ACTION ACTION SERVICE DEFINITION If the rule FAIL failure actions will be sequentially executed If the rule FAIL failure actions will be sequentially executed SUCCESS FAILURE
  86. ffunction inc. Monitoring a remote machine #!/usr/bin/env python from monitoring

    import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run()
  87. ffunction inc. Monitoring a remote machine #!/usr/bin/env python from monitoring

    import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() A monitor is like the “main” for monitoring. It actively monitors services. A monitor is like the “main” for monitoring. It actively monitors services.
  88. ffunction inc. Monitoring a remote machine #!/usr/bin/env python from monitoring

    import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() Don't forget to call run() on it Don't forget to call run() on it
  89. ffunction inc. Monitoring a remote machine #!/usr/bin/env python from monitoring

    import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() The service monitors the rules The service monitors the rules
  90. ffunction inc. Monitoring a remote machine #!/usr/bin/env python from monitoring

    import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() The HTTP rule allows to test an URL The HTTP rule allows to test an URL And we display a message in case of failure And we display a message in case of failure
  91. ffunction inc. Monitoring a remote machine #!/usr/bin/env python from monitoring

    import * Monitor( Service( name = "google-search-latency", monitor = ( HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Print("Google search query took more than 50ms") ] ) ) ) ).run() If it there is a 4XX or it timeouts, the rule will fail and display an error message If it there is a 4XX or it timeouts, the rule will fail and display an error message
  92. ffunction inc. Monitoring a remote machine $ python example-service-monitoring.py 2011-02-27T22:33:18

    monitoring --- #0 (runners=1,threads=2,duration=0.57s) 2011-02-27T22:33:18 monitoring [!] Failure on HTTP(GET="www.google.ca:80/search? q=monitoring",timeout=0.08) : Socket error: timed out Google search query took more than 50ms 2011-02-27T22:33:19 monitoring --- #1 (runners=1,threads=2,duration=0.73s) 2011-02-27T22:33:20 monitoring --- #2 (runners=1,threads=2,duration=0.54s) 2011-02-27T22:33:21 monitoring --- #3 (runners=1,threads=2,duration=0.69s) 2011-02-27T22:33:22 monitoring --- #4 (runners=1,threads=2,duration=0.77s) 2011-02-27T22:33:23 monitoring --- #5 (runners=1,threads=2,duration=0.70s)
  93. ffunction inc. Sending Email Notification send_email = Email( "[email protected]", "[monitoring]Google

    Search Latency Error", "Latency was over 80ms" "smtp.gmail.com", "myusername", "mypassword" ) […] HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email ] )
  94. ffunction inc. Sending Email Notification send_email = Email( "[email protected]", "[monitoring]Google

    Search Latency Error", "Latency was over 80ms" "smtp.gmail.com", "myusername", "mypassword" ) […] HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email ] ) The Email rule will send an email to [email protected] when triggered The Email rule will send an email to [email protected] when triggered
  95. ffunction inc. Sending Email Notification send_email = Email( "[email protected]", "[monitoring]Google

    Search Latency Error", "Latency was over 80ms" "smtp.gmail.com", "myusername", "mypassword" ) […] HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email ] ) This is how we bind the action to the rule failure This is how we bind the action to the rule failure
  96. ffunction inc. Sending Email+Jabber Notification send_xmpp = XMPP( "[email protected]", "monitoring:

    Google search latency over 80ms", "[email protected]", "myspassword" ) […] HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ send_email, send_xmpp ] )
  97. ffunction inc. Monitoring incident: when something fails repeatedly during a

    given period of time You don't want to be notified all the time, only when it really matters. You don't want to be notified all the time, only when it really matters.
  98. ffunction inc. Detecting incidents HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident(

    errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) An incident is a “smart” action : it will only do something when the condition is met An incident is a “smart” action : it will only do something when the condition is met
  99. ffunction inc. Detecting incidents HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident(

    errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) When at least 5 errors... When at least 5 errors...
  100. ffunction inc. Detecting incidents HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident(

    errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) ...happen over a 10 seconds period ...happen over a 10 seconds period
  101. ffunction inc. Detecting incidents HTTP( GET="http://www.google.ca/search?q=monitoring", freq=Time.s(1), timeout=Time.ms(80), fail=[ Incident(

    errors = 5, during = Time.s(10), actions = [send_email,send_xmpp] ) ] ) The Incident action will trigger the given actions The Incident action will trigger the given actions
  102. ffunction inc. Example: Ensuring a service is running from monitoring

    import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run()
  103. ffunction inc. Example: Ensuring a service is running from monitoring

    import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() We test if we can GET http://localhost:8000 within 500ms We test if we can GET http://localhost:8000 within 500ms
  104. ffunction inc. Example: Ensuring a service is running from monitoring

    import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() If we can't reach it during 5 seconds If we can't reach it during 5 seconds
  105. ffunction inc. Example: Ensuring a service is running from monitoring

    import * Monitor( Service( name="myservice-ensure-up", monitor=( HTTP( GET="http://localhost:8000/", freq=Time.ms(500), fail=[ Incident( errors=5, during=Time.s(5), actions=[ Restart("myservice-start.py") ])] )))).run() We kill and restart myservice-start.py We kill and restart myservice-start.py
  106. ffunction inc. Example: Monitoring system health from monitoring import *

    Monitor ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run()
  107. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run()
  108. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() SystemInfo will retrieve system information and return it as a dictionary SystemInfo will retrieve system information and return it as a dictionary
  109. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() We log each result by extracting the given value from the result dictionary (memoryUsage, diskUsage,cpuUsage) We log each result by extracting the given value from the result dictionary (memoryUsage, diskUsage,cpuUsage)
  110. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda v:v["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() Bandwidth collects network interface live traffic information Bandwidth collects network interface live traffic information
  111. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() But we don't want the total amount, we just want the difference. Delta does just that. But we don't want the total amount, we just want the difference. Delta does just that.
  112. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent=")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() We print the result as before We print the result as before
  113. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent=")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() SystemHealth will fail whenever the usage is above the given thresholds SystemHealth will fail whenever the usage is above the given thresholds
  114. ffunction inc. Monitoring system health from monitoring import * Monitor

    ( Service( name = "system-health", monitor = ( SystemInfo(freq=Time.s(1), success = ( LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]), LogResult("myserver.system.disk=", extract=lambda r,_:reduce(max,r["diskUsage"].values())), LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]), ) ), Delta( Bandwidth("eth0", freq=Time.s(1)), extract = lambda _:_["total"]["bytes"]/1000.0/1000.0, success = [LogResult("myserver.system.eth0.sent=")] ), SystemHealth( cpu=0.90, disk=0.90, mem=0.90, freq=Time.s(60), fail=[Log(path="monitoring-system-failures.log")] ), ) ) ).run() We'll log failures in a log file We'll log failures in a log file
  115. ffunction inc. monitoring: Decentralized architecture APP SERVER W STATIC FILE

    SERVER DB SERVER SERVER Ensures the App is running (pid & HTTP test) Ensures the App is running (pid & HTTP test)
  116. ffunction inc. monitoring: Decentralized architecture APP SERVER W STATIC FILE

    SERVER W DB SERVER SERVER Ensures the static file server is running an has low latency Ensures the static file server is running an has low latency
  117. ffunction inc. monitoring: Decentralized architecture APP SERVER W STATIC FILE

    SERVER W DB SERVER SERVER W Ensures the DB is running and that queries are not too slow. Ensures the DB is running and that queries are not too slow.
  118. ffunction inc. monitoring: Centralized Architecture APP SERVER STATIC FILE SERVER

    DB SERVER SERVER PLATFORM SERVER W Does high-level (HTTP, SQL) queries on the servers and execute actions remotely when problems are detected Does high-level (HTTP, SQL) queries on the servers and execute actions remotely when problems are detected
  119. ffunction inc. monitoring: Deploying on Ubuntu # upstart - monitoring

    Configuration File # ===================================== # updated: 2011-02-28 description "monitoring - service monitoring daemon" author "Sebastien Pierre <[email protected]>" start on (net-device-up and local-filesystems) stop on runlevel [016] respawn script # NOTE: Change this to wherever the monitoring is installed monitoring_HOME=/opt/services/monitoring cd $monitoring_HOME # NOTE: Change this to wherever your custom monitoring script is installed python monitoring.py end script console output # EOF
  120. ffunction inc. monitoring: Deploying on Ubuntu # upstart - monitoring

    Configuration File # ===================================== # updated: 2011-02-28 description "monitoring - service monitoring daemon" author "Sebastien Pierre <[email protected]>" start on (net-device-up and local-filesystems) stop on runlevel [016] respawn script # NOTE: Change this to wherever the monitoring is installed monitoring_HOME=/opt/services/monitoring cd $monitoring_HOME # NOTE: Change this to wherever your custom monitoring script is installed python monitoring.py end script console output # EOF Save this file as /etc/init/monitoring.conf Save this file as /etc/init/monitoring.conf
  121. ffunction inc. monitoring: Overview Monitoring DSL Declarative programming to define

    monitoring strategy Wide spectrum From data collection to incident detection Flexible Does not impose a specific architecture
  122. ffunction inc. monitoring: Use cases Ensure service availability Test and

    stop/restart when problems Collect system statistics Log or send data through the network Alert on system or service health Take actions when the system stats is above threshold
  123. ffunction inc. monitoring: What's coming? ZeroMQ channels Data streaming and

    inter-monitoring comm. Documentation Only the basics, need more love! Contributors? Codebase is small and clear, start hacking!