Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Proposal to revamp logging infrastructure

Martin Smith
August 13, 2013
30

Proposal to revamp logging infrastructure

Martin Smith

August 13, 2013
Tweet

Transcript

  1. Logging revamp for OSG
    Linux infrastructure
    [email protected]

    View Slide

  2. Basic information about syslog
    - IETF documented status quo in RFC 3164
    - Later obsoleted by RFC 5424
    - Originally used for sendmail
    - Major point is to keep logs locally, and
    optionally send a copy off to the server
    - Implementations: syslog-ng, rsyslog (4.x
    default in RHEL6)

    View Slide

  3. Facilities
    Facility Number Keyword Facility Description
    0 kern kernel messages
    1 user user-level messages
    2 mail mail system
    3 daemon system daemons
    4 auth security/authorization messages
    5 syslog messages generated internally by syslogd
    6 lpr line printer subsystem
    7 news network news subsystem
    8 uucp UUCP subsystem
    9 clock daemon
    10 authpriv security/authorization messages
    11 ftp FTP daemon
    12 - NTP subsystem
    13 - log audit
    14 - log alert
    15 cron clock daemon
    16 local0 local use 0 (local0)
    17 local1 local use 1 (local1)
    18 local2 local use 2 (local2)
    19 local3 local use 3 (local3)
    20 local4 local use 4 (local4)
    21 local5 local use 5 (local5)
    22 local6 local use 6 (local6)
    23 local7 local use 7 (local7)

    View Slide

  4. Severities
    Code Severity Keyword Description General Description
    0 Emergency emerg (panic) System is unusable. A "panic" condition
    usually affecting multiple apps/servers/sites. At this level it would usually notify all
    tech staff on call.
    1 Alert alert Action must be taken immediately. Should be corrected
    immediately, therefore notify staff who can fix the problem. An example would be the
    loss of a primary ISP connection.
    2 Critical crit Critical conditions. Should be corrected immediately,
    but indicates failure in a primary system, an example is a loss of a backup ISP
    connection.
    3 Error err (error) Error conditions. Non-urgent failures, these should
    be relayed to developers or admins; each item must be resolved within a given time.
    4 Warning warning (warn) Warning conditions. Warning messages, not an
    error, but indication that an error will occur if action is not taken, e.g. file system
    85% full - each item must be resolved within a given time.
    5 Notice notice Normal but significant condition. Events that are
    unusual but not error conditions - might be summarized in an email to developers or
    admins to spot potential problems - no immediate action required.
    6 Informational info Informational messages. Normal operational messages
    - may be harvested for reporting, measuring throughput, etc. - no action required.
    7 Debug debug Debug-level messages. Info useful to developers for
    debugging the application, not useful during operations.

    View Slide

  5. Anatomy of a syslog message
    ABNF in 5424 for message format, BUT messages have PRIority , HEADER (ts and source ip/host), MSG (total <1024 bytes
    RFC revised to 480 octets, ':[ ' terminated TAG < 32 chars): " TIMESTAMP HOSTNAME MTAG MCONTENT".
    SYSLOG-MSG = HEADER SP STRUCTURED-DATA [SP MSG]
    HEADER = PRI VERSION SP TIMESTAMP SP HOSTNAME
    SP APP-NAME SP PROCID SP MSGID
    PRI = "<" PRIVAL ">"
    PRIVAL = 1*3DIGIT ; range 0 .. 191
    VERSION = NONZERO-DIGIT 0*2DIGIT
    HOSTNAME = NILVALUE / 1*255PRINTUSASCII
    APP-NAME = NILVALUE / 1*48PRINTUSASCII
    PROCID = NILVALUE / 1*128PRINTUSASCII
    MSGID = NILVALUE / 1*32PRINTUSASCII
    The TIMESTAMP field is the local time and is in the format of "Mmm dd hh:mm:ss" (without the quote marks) where (no YYYY!,
    must do in message until RFC rev adds it)
    The format of "TAG[pid]:" - without the quote marks - is common. The left square bracket is used to terminate the
    TAG field in this case and is then the first character in the CONTENT
    field. If the process id is immaterial, it may be left off.

    View Slide

  6. Anatomy of a syslog message (pt 2)
    The Priority value is calculated by first:
    1. multiplying the Facility number by 8 and then
    2. adding the numerical value of the Severity
    For example, a kernel message (Facility=0) with a Severity of Emergency (Severity=0) would have
    a Priority value of 0. Also, a "local use 4" message (Facility=20) with a Severity of Notice
    (Severity=5) would have a Priority value of 165. In the PRI part of a syslog message, these
    values would be placed between the angle brackets as <0> and <165> respectively. The only
    time a value of "0" will follow the "<" is for the Priority value of "0". Otherwise, leading "0"
    s MUST NOT be used.

    View Slide

  7. Diversion: What logs do we have?
    - tsm/ship - What are these?
    - imapd, mail (sendmail)
    - shib (idp-*, shibd, transaction, native)
    - torque, listserv (listserv.log), catalina.out
    - www (access, error, suexec)
    - kern, messages, syslog, auth, local, yum,
    transaction, anaconda, up2date
    - mod_jk (cm), slapd, net-snmpd (xen)
    - handful of others, but legacy

    View Slide

  8. Phase 1: Convert everything to
    remote syslog architecture
    - Local files with remote syslog
    - Local files rotate 'last N days' + permanent
    archive copy can be kept on the remote system
    - Remote files could be a rough facsimile of
    current /nerdc/log setup
    - No more log bale (do we really need this???)
    - Sane permissions on /nerdc/log subdirectories
    for everything possible. Or not, and just allow
    searching.

    View Slide

  9. Phase 2: Parse deep!
    - Logstash is an abstraction layer over
    logs coming out of rsyslog
    - It can parse and annotate logs, add context
    - It can also be used to load log data into
    fancier backends like ElasticSearch
    - ElasticSearch for the last N days of logs.
    - Kibana as a nice UI for ElasticSearch.

    View Slide

  10. Kibana + Elastic Search

    View Slide

  11. Kibana + Elastic Search

    View Slide