Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from High Velocity Logging

Lessons from High Velocity Logging

Talk for YAPC::NA 2015 discussing the lessons learned from shipping, parsing, storing, and using logs!

Brad Lhotsky

June 09, 2015
Tweet

More Decks by Brad Lhotsky

Other Decks in Technology

Transcript

  1. Lessons from High
    Velocity Logging
    Presented by Brad Lhotsky

    View Slide

  2. Brad Lhotsky
    • Infrastructure Monitoring & Security

    at Booking.com
    • Recovering
    • Perl Programmer
    • Linux Systems Admin
    • Network Security Specialist
    • PostgreSQL Administrator
    • ElasticSearch Janitor
    • DNS Voyeur
    • Starving Author
    • OSSEC Core Team Member
    https://github.com/reyjrar
    https://twitter.com/reyjrar

    View Slide

  3. Amazing People
    Wouter Clarie, Willem Gerber, Lucian Grosu,
    Stefan Boronea, Wijnand Modderman-Lensta,
    Mickey Nasriachi, Mattia Barbon, Alexey
    Shatlovsky, Vadim Sukhanov, Daniel Ostermeier,
    Samit Badle, Tim Pickett, Nick Perez

    View Slide

  4. Expectations
    ‣Lessons from Shipping Logs
    ‣Lessons from Parsing Logs
    ‣Lessons from Storing Logs
    ‣Lessons from Using Logs

    View Slide

  5. Assumptions
    ‣Business Intelligence covered
    ‣Running Linux servers
    ‣Developed by Security Engineers
    Your Mileage May Vary

    View Slide

  6. SHIPPING LOGS
    at WEB SCALE!

    View Slide

  7. Syslog Transport
    ‣Syslog is well-known transport
    ‣Syslog-ng installed on all servers
    ‣Don’t do that, use rsyslog
    ‣Remember syslog() syscall blocks

    View Slide

  8. Syslog-ng
    source s_apache { file("/logs/access.log" follow_freq(1)); };
    filter f_junk { not(match("favicon.ico")); };
    destination d_remote { tcp("1.2.3.4" port(514)); };
    log { source(s_apache); filter(f_junk); destination(d_remote); };
    So Pretty! Any one see any bugs?
    ‣“favicon.ico” are silently discarded
    ‣Reads “/logs/access.log” once per second
    ‣Only reads log_fetch_limit() lines each read
    log_fetch_limit() default is 10

    View Slide

  9. Configure rsyslog to:
    ‣Queue messages locally
    if the remote server dies
    ‣Sends all data over
    TCP ...
    ‣Encrypted ...
    ‣With peer validation

    View Slide

  10. rsyslog
    # Configure GNU/TLS
    $DefaultNetstreamDriver gtls
    $DefaultNetstreamDriverCAFile /etc/pki/rsyslog/ca.pem
    $DefaultNetstreamDriverCertFile /etc/pki/rsyslog/client-cert.pem
    $DefaultNetstreamDriverKeyFile /etc/pki/rsyslog/client-key.pem
    $ActionQueueType LinkedList
    $ActionQueueFileName centralwork
    $ActionResumeRetryCount -1
    $ActionQueueSaveOnShutdown on
    $ActionSendStreamDriverAuthMode x509/name
    $ActionSendStreamDriverPermittedPeer log.example.com
    $ActionSendStreamDriverMode 1
    *.* @@1.2.3.4:514

    View Slide

  11. rsyslog outputs
    ‣Forwarding, Files, Sockets
    ‣Programs, SNMP, Email
    ‣PostgreSQL, MySQL, Oracle,
    DBI
    ‣ElasticSearch, MongoDB,
    Kafka, HDFS
    Now that's webscale!

    View Slide

  12. View Slide

  13. Syslog Isn’t Web Scale
    1. Apache Kafka
    2. nanomsg
    3. ZeroMQ, RabbitMQ, ActiveMQ, “${HOT}MQ”
    4. Redis (just don’t tell anyone it’s not a queue)
    Syslog isn’t a Pub/Sub Queue

    View Slide

  14. Syslog Isn’t Web Scale
    ‣ Apache Flume
    ‣ Facebook’s Scribe
    Syslog isn’t a Logging Pipeline

    View Slide

  15. Syslog Isn’t Web Scale
    1. Heka
    2. Fluentd
    3. LogStash
    Syslog isn’t a Logging Platform

    View Slide

  16. Syslog Isn’t Web Scale
    Syslog is boring technology.
    We need more boring technology.
    Simplify our end points.
    Understand our Infrastructure.

    View Slide

  17. Perl Isn’t Web Scale
    We did this in Perl with POE.
    It outperformed LogStash.
    60,000 Messages/Second.
    Limited by ElasticSearch indexing speed.

    View Slide

  18. PARSING LOGS

    View Slide

  19. What Syslog Gets You
    ‣ When did the log happen?
    ‣ Which server sent the log? (use FQDN!)
    ‣ What level was the log?
    ‣ What program generated the log?
    ‣ What was the content of the log?

    View Slide

  20. Parse::Syslog::Line
    ‣ Order from chaos, logs to documents
    ‣ Fastest syslog parser on the CPAN
    ‣ Handle many variations on syslog formats
    ‣ Determine how much I can hate vendors
    Goals

    View Slide

  21. Parse::Syslog::Line
    Turns this:
    <11>Jan 1 00:00:00 router.example.com dhcpd[3428]: DHCPINFORM from
    172.16.2.137 via vlan3
    Into this:
    ---
    content: DHCPINFORM from 172.16.2.137 via vlan3
    date_str: 2015-01-01 00:00:00
    datetime_str: 2015-01-01 00:00:00
    domain: example.com
    epoch: 1420066800
    facility: user
    facility_int: 8
    host: router
    message: 'dhcpd[3428]: DHCPINFORM from 172.16.2.137 via vlan3'
    preamble: 11
    priority: err
    priority_int: 3
    program_name: dhcpd
    program_pid: 3428

    View Slide

  22. Parse::Syslog::Line
    ‣ Vendors never comply with standards
    ‣ Always Const::Fast, never Readonly.pm
    ‣ Good regexes are fast
    ‣ Untested regexes are dangerously slow
    ‣ Sometimes, perldoc perlre lies
    Lessons
    https://xkcd.com/1171/

    View Slide

  23. Regular Expressions?
    #!/usr/bin/env perl
    #
    use strict;
    use warnings;
    use Benchmark qw(cmpthese);
    my $re = qr/\d+/;
    my %RE = ( d => qr/\d+/ );
    my @tests = ;
    printf "Running regex benchmark with Perl %s\n", $^V;
    cmpthese(500_000, {
    'with o' => sub { /$RE{d}/o for @tests },
    'without o' => sub { /$RE{d}/ for @tests },
    'variable' => sub { /$re/ for @tests },
    'static' => sub { /\d+/ for @tests, },
    });

    View Slide

  24. Regular Expressions?
    Running regex benchmark with Perl v5.22.0
    Rate variable without o with o static
    variable 301205/s -- -4% -46% -54%
    without o 314465/s 4% -- -43% -52%
    with o 555556/s 84% 77% -- -14%
    static 649351/s 116% 106% 17% --
    ‣ The %RE regex table is bad for performance
    ‣ The ‘o’ modifier can introduce bugs, but it can
    help optimize regular expressions in stored in
    variables

    View Slide

  25. Regular Expressions?
    for(@things) {
    if ( /$RE{IPv4}/o ) {
    print “Found IPv4!\n”;
    }
    }
    OK
    for(@things) {
    if ( /$RE{$type}/o ) {
    print “Found $type!\n”;
    }
    }
    NOT OK

    View Slide

  26. Logs as Documents
    <134>Jun 2 07:49:17 bob-03.lhr4.prod.booking.com www.booking.com:
    107.150.42.90 - - [02/Jun/2015:07:49:17 +0200] \"GET /plus/
    search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!
    50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!
    50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!
    50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`
    %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/
    +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a
    HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/
    search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!
    50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!
    50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!
    50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`
    %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/
    +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a\"
    \"baidu; baiduspider;\" - 556D43DD00922CB5

    View Slide

  27. Adding Context
    ‣ Using FQDN, get data center?
    ‣ Well known log formats can be tokenized
    ‣ Find embedded JSON and parse it
    Do’s
    Shouldn’ts
    ‣ Bad (read: untested) Regex
    ‣ HTTP API Calls
    ‣ DNS Lookups

    View Slide

  28. Logs as Documents
    ---
    '@fields':
    crit: 404
    dc: LHR4
    domain: lhr4.prod.booking.com
    dst: www.booking.com
    facility: local0
    file: /plus/search.php
    filetype: php
    hostname: bob-03
    method: GET
    out_bytes: 172775
    program: apache
    protocol: HTTP/1.1
    referer: "http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!
    50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`
    %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a"
    resource: "/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!
    50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit
    +0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a"
    response_ms: 368
    service: web
    severity: info
    src_ip: 107.150.42.90
    status: error
    upstream_ms: 368
    useragent: baidu; baiduspider;
    zone: prod
    '@message': "www.booking.com: 107.150.42.90 - - [02/Jun/2015:07:49:17 +0200] \"GET /plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!
    50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/
    +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/
    +by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!
    50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/
    +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/
    +by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5"
    '@source_host': bob-03.lhr4.prod.booking.com
    '@timestamp': 2015-06-02T07:49:17+0200

    View Slide

  29. Extract More Meta
    ‣ Parse URI’s for their parts
    ‣ Extract relevant fields from URI parameters
    ‣ Simple attack detection using common words
    ‣ GeoIP from a local DB

    View Slide

  30. ---
    '@fields':
    attack_generic_score: 16
    attack_score: 340
    attack_sqli_score: 18
    attack_triggers: '*/ | */ | */ | */ | */ | */ | */ | */ | /* | /* | /
    * | /* | /* | /* | /* | /* | concat | concat | rand | select | select |
    select'
    referer_domain: www.booking.com
    referer_file: /plus/search.php
    src_ip_country: US
    src_ip_location: '39.0997,-94.5786'
    Now That’s Meta?

    View Slide

  31. Lessons Learned
    ‣ Test the performance of the system
    ‣ Work with the data you have
    ‣ Tokenize the data into a set of well defined fields

    View Slide

  32. STORING LOGS

    View Slide

  33. Lessons from Breaking
    ElasticSearch
    ‣ 30 GB Heap
    ‣ Leave 25% Memory for FS Cache
    ‣ Google "Call Me Maybe ElasticSearch”
    ‣ Separate master and data nodes
    ‣ Name space indexes based on retention

    View Slide

  34. ElasticSearch Shards
    ‣ Transaction Log
    ‣ Lucene Segments
    ‣ Optimize Condenses
    Segments

    View Slide

  35. Optimized Shards
    ‣ Fewer segments
    ‣ Less Mapping
    ‣ Less Reducing
    ‣ Moar Web Scale

    View Slide

  36. elasticsearch.yml
    # Disable the use of Memory-mapped files
    index.store.type: niofs
    # Indexing Settings for Writes
    index.refresh_interval: 30s
    index.merge.scheduler.max_thread_count: 1
    index.translog.flush_threshold_size: 1gb
    # Field Data Cache
    indices.fielddata.cache.size: 40%

    View Slide

  37. Index Layouts

    View Slide

  38. Index Optimizations
    ‣ Split Indexes in functional namespaces
    ‣ Use Dynamic Mappings (Schemas)
    ‣ Never use auto_expand_replicas
    ‣ Fewer replicas yield faster writes

    View Slide

  39. Document Optimizations
    ‣ Each doc uses ~8 bytes of memory
    ‣ Analyzing fields is expensive
    ‣ Agree on a “field set” for your documents
    ‣ CEE, CEF, XDAS, GELF, RELP, SDEE, etc
    ‣ Invent your own

    View Slide

  40. Monitor and Maintain
    ‣ Delete old indexes
    ‣ Degrade replicas
    ‣ Close unused indexes
    ‣ Disable bloom filters
    ‣ Monitor key metrics

    View Slide

  41. $ es-daily-index-maintenance.pl --base logstash \
    --close --close-days 90 \
    --delete --delete-days 180 \
    --replicas --replicas-min 0 \
    --replicas-max 2 \
    --replicas-age 7
    App::ElasticSearch::Utilities
    Maintain Indexes
    https://github.com/reyjrar/es-utils

    View Slide

  42. App::ElasticSearch::Utilities
    Maintain Aliases
    # es-alias-manager.pl --config /etc/es-aliases.yml
    ---
    logstash:
    pattern: \*-logstash-{{DATE}}
    daily: logstash-{{DATE}}
    relative:
    alias: logstash-{{PERIOD}}
    periods:
    weekly:
    from:
    days: 7
    to:
    days: 0
    https://github.com/reyjrar/es-utils

    View Slide

  43. Index Aliases

    View Slide

  44. $ es-graphite-dynamic.pl
    App::ElasticSearch::Utilities
    Monitor
    https://github.com/reyjrar/es-utils

    View Slide

  45. USING LOGS

    View Slide

  46. General Web Overview

    View Slide

  47. Web Attacks Scanners

    View Slide

  48. Sudoers Wall of Shame

    View Slide

  49. Graphite Abusers

    View Slide

  50. Slow Pages

    View Slide

  51. App::ElasticSearch::Utilities
    Search Stuff!
    = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03
    @timestamp src_ip src_ip_country file
    2015-06-03T11:39:27+0200 217.36.201.217 GB /B1D671CF-
    E532-4481-99AA-19F420D90332/netdefender/hui/ndhui.css
    2015-06-03T11:39:26+0200 92.56.217.84 ES /hotel/es/
    null.es.html
    # Search Parameters:
    # {"query_string":{"query":"dst:www.booking.com AND crit:404"}}
    # Displaying 3 of (CENSORED) in 0 seconds.
    # Indexes (2 of 4) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
    https://github.com/reyjrar/es-utils
    $ es-search.pl --base access dst:www.booking.com and crit:404 \
    --show src_ip,src_ip_country,file --size 2

    View Slide

  52. Aggregate Stuff
    https://github.com/reyjrar/es-utils
    $ es-search.pl --base access --days 1 dst:www.booking.com \
    --top src_ip --size 3
    = Querying Indexes: ams4-access-2015.06.03,lhr4-access-2015.06.03
    count src_ip
    (CENSORED) 66.249.92.71
    (CENSORED) 66.249.92.59
    (CENSORED) 66.249.92.65
    # Search Parameters:
    # {"query_string":{"query":"dst:www.booking.com"}}
    # Displaying 3 of (CENSORED) in 5 seconds.
    # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
    #
    # Totals across batch
    #
    count src_ip
    (CENSORED) 66.249.92.71
    (CENSORED) 66.249.92.59
    (CENSORED) 66.249.92.65

    View Slide

  53. Find Pages Viewed by Most Countries
    https://github.com/reyjrar/es-utils
    $ es-search.pl --base access --days 1 dst:www.booking.com \
    --top file --by cardinality:src_ip_country --size 3
    = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03
    cardinality:src_ip_country count file
    239 (CENSORED) /
    236 (CENSORED) /rt_data/city_bookings
    234 (CENSORED) /wishlist/get
    # Search Parameters:
    # {"query_string":{"query":"dst:www.booking.com"}}
    # Displaying 3 of (CENSORED) in 21 seconds.
    # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03

    View Slide

  54. Pipeline Queries
    https://github.com/reyjrar/es-utils
    $ es-search.pl --base access --days 1 dst:www.booking.com \
    --top src_ip --by sum:attack_score --size 3 \
    --data-file top_attackers.dat
    $ es-search.pl --base access --days 1 dst:www.booking.com \
    src_ip:top_attackers.dat[-1] --size 3\
    --show attack_score,src_ip,crit,dst,method,resource \
    --sort attack_score:desc

    View Slide

  55. Pipeline Queries
    https://github.com/reyjrar/es-utils
    = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03
    @timestamp attack_score src_ip crit dst method resource
    2015-06-03T04:20:59+0200 340 107.150.42.90 404 www.booking.com GET /plus/
    search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!
    50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/
    +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!
    50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a
    2015-06-03T00:50:43+0200 340 107.150.42.90 404 www.booking.com GET /plus/
    search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!
    50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/
    +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!
    50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a
    2015-06-03T05:18:19+0200 340 107.150.42.90 404 www.booking.com GET /plus/
    search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!
    50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/
    +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!
    50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a
    # Search Parameters:
    # {"terms":{"src_ip":["107.150.42.90","37.59.7.157","74.84.138.120"]}}
    # {"query_string":{"query":"dst:www.booking.com"}}
    # Displaying 3 of 793 in 0 seconds.
    # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03

    View Slide

  56. Thank You!
    Presented by:
    Brad Lhotsky
    https://twitter.com/reyjrar
    https://github.com/reyjrar
    Made possible by the support of
    and viewers like you!

    View Slide