Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Proposal to revamp logging infrastructure

9ad2a5355d8cfa842e24b7a4322b2535?s=47 Martin Smith
August 13, 2013

Proposal to revamp logging infrastructure


Martin Smith

August 13, 2013


  1. Logging revamp for OSG Linux infrastructure smithmb+devnull@ufl.edu

  2. Basic information about syslog - IETF documented status quo in

    RFC 3164 - Later obsoleted by RFC 5424 - Originally used for sendmail - Major point is to keep logs locally, and optionally send a copy off to the server - Implementations: syslog-ng, rsyslog (4.x default in RHEL6)
  3. Facilities Facility Number Keyword Facility Description 0 kern kernel messages

    1 user user-level messages 2 mail mail system 3 daemon system daemons 4 auth security/authorization messages 5 syslog messages generated internally by syslogd 6 lpr line printer subsystem 7 news network news subsystem 8 uucp UUCP subsystem 9 clock daemon 10 authpriv security/authorization messages 11 ftp FTP daemon 12 - NTP subsystem 13 - log audit 14 - log alert 15 cron clock daemon 16 local0 local use 0 (local0) 17 local1 local use 1 (local1) 18 local2 local use 2 (local2) 19 local3 local use 3 (local3) 20 local4 local use 4 (local4) 21 local5 local use 5 (local5) 22 local6 local use 6 (local6) 23 local7 local use 7 (local7)
  4. Severities Code Severity Keyword Description General Description 0 Emergency emerg

    (panic) System is unusable. A "panic" condition usually affecting multiple apps/servers/sites. At this level it would usually notify all tech staff on call. 1 Alert alert Action must be taken immediately. Should be corrected immediately, therefore notify staff who can fix the problem. An example would be the loss of a primary ISP connection. 2 Critical crit Critical conditions. Should be corrected immediately, but indicates failure in a primary system, an example is a loss of a backup ISP connection. 3 Error err (error) Error conditions. Non-urgent failures, these should be relayed to developers or admins; each item must be resolved within a given time. 4 Warning warning (warn) Warning conditions. Warning messages, not an error, but indication that an error will occur if action is not taken, e.g. file system 85% full - each item must be resolved within a given time. 5 Notice notice Normal but significant condition. Events that are unusual but not error conditions - might be summarized in an email to developers or admins to spot potential problems - no immediate action required. 6 Informational info Informational messages. Normal operational messages - may be harvested for reporting, measuring throughput, etc. - no action required. 7 Debug debug Debug-level messages. Info useful to developers for debugging the application, not useful during operations.
  5. Anatomy of a syslog message ABNF in 5424 for message

    format, BUT messages have PRIority <NNN>, HEADER (ts and source ip/host), MSG (total <1024 bytes RFC revised to 480 octets, ':[ ' terminated TAG < 32 chars): "<PRIORITY> TIMESTAMP HOSTNAME MTAG MCONTENT". SYSLOG-MSG = HEADER SP STRUCTURED-DATA [SP MSG] HEADER = PRI VERSION SP TIMESTAMP SP HOSTNAME SP APP-NAME SP PROCID SP MSGID PRI = "<" PRIVAL ">" PRIVAL = 1*3DIGIT ; range 0 .. 191 VERSION = NONZERO-DIGIT 0*2DIGIT HOSTNAME = NILVALUE / 1*255PRINTUSASCII APP-NAME = NILVALUE / 1*48PRINTUSASCII PROCID = NILVALUE / 1*128PRINTUSASCII MSGID = NILVALUE / 1*32PRINTUSASCII The TIMESTAMP field is the local time and is in the format of "Mmm dd hh:mm:ss" (without the quote marks) where (no YYYY!, must do in message until RFC rev adds it) The format of "TAG[pid]:" - without the quote marks - is common. The left square bracket is used to terminate the TAG field in this case and is then the first character in the CONTENT field. If the process id is immaterial, it may be left off.
  6. Anatomy of a syslog message (pt 2) The Priority value

    is calculated by first: 1. multiplying the Facility number by 8 and then 2. adding the numerical value of the Severity For example, a kernel message (Facility=0) with a Severity of Emergency (Severity=0) would have a Priority value of 0. Also, a "local use 4" message (Facility=20) with a Severity of Notice (Severity=5) would have a Priority value of 165. In the PRI part of a syslog message, these values would be placed between the angle brackets as <0> and <165> respectively. The only time a value of "0" will follow the "<" is for the Priority value of "0". Otherwise, leading "0" s MUST NOT be used.
  7. Diversion: What logs do we have? - tsm/ship - What

    are these? - imapd, mail (sendmail) - shib (idp-*, shibd, transaction, native) - torque, listserv (listserv.log), catalina.out - www (access, error, suexec) - kern, messages, syslog, auth, local, yum, transaction, anaconda, up2date - mod_jk (cm), slapd, net-snmpd (xen) - handful of others, but legacy
  8. Phase 1: Convert everything to remote syslog architecture - Local

    files with remote syslog - Local files rotate 'last N days' + permanent archive copy can be kept on the remote system - Remote files could be a rough facsimile of current /nerdc/log setup - No more log bale (do we really need this???) - Sane permissions on /nerdc/log subdirectories for everything possible. Or not, and just allow searching.
  9. Phase 2: Parse deep! - Logstash is an abstraction layer

    over logs coming out of rsyslog - It can parse and annotate logs, add context - It can also be used to load log data into fancier backends like ElasticSearch - ElasticSearch for the last N days of logs. - Kibana as a nice UI for ElasticSearch.
  10. Kibana + Elastic Search

  11. Kibana + Elastic Search