Lessons from High Velocity Logging

Lessons from High Velocity Logging

Talk for YAPC::NA 2015 discussing the lessons learned from shipping, parsing, storing, and using logs!

8d96f5c273062cb617255e630fe0705c?s=128

Brad Lhotsky

June 09, 2015
Tweet

Transcript

  1. Lessons from High Velocity Logging Presented by Brad Lhotsky

  2. Brad Lhotsky • Infrastructure Monitoring & Security
 at Booking.com •

    Recovering • Perl Programmer • Linux Systems Admin • Network Security Specialist • PostgreSQL Administrator • ElasticSearch Janitor • DNS Voyeur • Starving Author • OSSEC Core Team Member https://github.com/reyjrar https://twitter.com/reyjrar
  3. Amazing People Wouter Clarie, Willem Gerber, Lucian Grosu, Stefan Boronea,

    Wijnand Modderman-Lensta, Mickey Nasriachi, Mattia Barbon, Alexey Shatlovsky, Vadim Sukhanov, Daniel Ostermeier, Samit Badle, Tim Pickett, Nick Perez
  4. Expectations ‣Lessons from Shipping Logs ‣Lessons from Parsing Logs ‣Lessons

    from Storing Logs ‣Lessons from Using Logs
  5. Assumptions ‣Business Intelligence covered ‣Running Linux servers ‣Developed by Security

    Engineers Your Mileage May Vary
  6. SHIPPING LOGS at WEB SCALE!

  7. Syslog Transport ‣Syslog is well-known transport ‣Syslog-ng installed on all

    servers ‣Don’t do that, use rsyslog ‣Remember syslog() syscall blocks
  8. Syslog-ng source s_apache { file("/logs/access.log" follow_freq(1)); }; filter f_junk {

    not(match("favicon.ico")); }; destination d_remote { tcp("1.2.3.4" port(514)); }; log { source(s_apache); filter(f_junk); destination(d_remote); }; So Pretty! Any one see any bugs? ‣“favicon.ico” are silently discarded ‣Reads “/logs/access.log” once per second ‣Only reads log_fetch_limit() lines each read log_fetch_limit() default is 10
  9. Configure rsyslog to: ‣Queue messages locally if the remote server

    dies ‣Sends all data over TCP ... ‣Encrypted ... ‣With peer validation
  10. rsyslog # Configure GNU/TLS $DefaultNetstreamDriver gtls $DefaultNetstreamDriverCAFile /etc/pki/rsyslog/ca.pem $DefaultNetstreamDriverCertFile /etc/pki/rsyslog/client-cert.pem

    $DefaultNetstreamDriverKeyFile /etc/pki/rsyslog/client-key.pem $ActionQueueType LinkedList $ActionQueueFileName centralwork $ActionResumeRetryCount -1 $ActionQueueSaveOnShutdown on $ActionSendStreamDriverAuthMode x509/name $ActionSendStreamDriverPermittedPeer log.example.com $ActionSendStreamDriverMode 1 *.* @@1.2.3.4:514
  11. rsyslog outputs ‣Forwarding, Files, Sockets ‣Programs, SNMP, Email ‣PostgreSQL, MySQL,

    Oracle, DBI ‣ElasticSearch, MongoDB, Kafka, HDFS Now that's webscale!
  12. None
  13. Syslog Isn’t Web Scale 1. Apache Kafka 2. nanomsg 3.

    ZeroMQ, RabbitMQ, ActiveMQ, “${HOT}MQ” 4. Redis (just don’t tell anyone it’s not a queue) Syslog isn’t a Pub/Sub Queue
  14. Syslog Isn’t Web Scale ‣ Apache Flume ‣ Facebook’s Scribe

    Syslog isn’t a Logging Pipeline
  15. Syslog Isn’t Web Scale 1. Heka 2. Fluentd 3. LogStash

    Syslog isn’t a Logging Platform
  16. Syslog Isn’t Web Scale Syslog is boring technology. We need

    more boring technology. Simplify our end points. Understand our Infrastructure.
  17. Perl Isn’t Web Scale We did this in Perl with

    POE. It outperformed LogStash. 60,000 Messages/Second. Limited by ElasticSearch indexing speed.
  18. PARSING LOGS

  19. What Syslog Gets You ‣ When did the log happen?

    ‣ Which server sent the log? (use FQDN!) ‣ What level was the log? ‣ What program generated the log? ‣ What was the content of the log?
  20. Parse::Syslog::Line ‣ Order from chaos, logs to documents ‣ Fastest

    syslog parser on the CPAN ‣ Handle many variations on syslog formats ‣ Determine how much I can hate vendors Goals
  21. Parse::Syslog::Line Turns this: <11>Jan 1 00:00:00 router.example.com dhcpd[3428]: DHCPINFORM from

    172.16.2.137 via vlan3 Into this: --- content: DHCPINFORM from 172.16.2.137 via vlan3 date_str: 2015-01-01 00:00:00 datetime_str: 2015-01-01 00:00:00 domain: example.com epoch: 1420066800 facility: user facility_int: 8 host: router message: 'dhcpd[3428]: DHCPINFORM from 172.16.2.137 via vlan3' preamble: 11 priority: err priority_int: 3 program_name: dhcpd program_pid: 3428
  22. Parse::Syslog::Line ‣ Vendors never comply with standards ‣ Always Const::Fast,

    never Readonly.pm ‣ Good regexes are fast ‣ Untested regexes are dangerously slow ‣ Sometimes, perldoc perlre lies Lessons https://xkcd.com/1171/
  23. Regular Expressions? #!/usr/bin/env perl # use strict; use warnings; use

    Benchmark qw(cmpthese); my $re = qr/\d+/; my %RE = ( d => qr/\d+/ ); my @tests = <DATA>; printf "Running regex benchmark with Perl %s\n", $^V; cmpthese(500_000, { 'with o' => sub { /$RE{d}/o for @tests }, 'without o' => sub { /$RE{d}/ for @tests }, 'variable' => sub { /$re/ for @tests }, 'static' => sub { /\d+/ for @tests, }, });
  24. Regular Expressions? Running regex benchmark with Perl v5.22.0 Rate variable

    without o with o static variable 301205/s -- -4% -46% -54% without o 314465/s 4% -- -43% -52% with o 555556/s 84% 77% -- -14% static 649351/s 116% 106% 17% -- ‣ The %RE regex table is bad for performance ‣ The ‘o’ modifier can introduce bugs, but it can help optimize regular expressions in stored in variables
  25. Regular Expressions? for(@things) { if ( /$RE{IPv4}/o ) { print

    “Found IPv4!\n”; } } OK for(@things) { if ( /$RE{$type}/o ) { print “Found $type!\n”; } } NOT OK
  26. Logs as Documents <134>Jun 2 07:49:17 bob-03.lhr4.prod.booking.com www.booking.com: 107.150.42.90 -

    - [02/Jun/2015:07:49:17 +0200] \"GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5
  27. Adding Context ‣ Using FQDN, get data center? ‣ Well

    known log formats can be tokenized ‣ Find embedded JSON and parse it Do’s Shouldn’ts ‣ Bad (read: untested) Regex ‣ HTTP API Calls ‣ DNS Lookups
  28. Logs as Documents --- '@fields': crit: 404 dc: LHR4 domain:

    lhr4.prod.booking.com dst: www.booking.com facility: local0 file: /plus/search.php filetype: php hostname: bob-03 method: GET out_bytes: 172775 program: apache protocol: HTTP/1.1 referer: "http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" resource: "/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit +0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" response_ms: 368 service: web severity: info src_ip: 107.150.42.90 status: error upstream_ms: 368 useragent: baidu; baiduspider; zone: prod '@message': "www.booking.com: 107.150.42.90 - - [02/Jun/2015:07:49:17 +0200] \"GET /plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5" '@source_host': bob-03.lhr4.prod.booking.com '@timestamp': 2015-06-02T07:49:17+0200
  29. Extract More Meta ‣ Parse URI’s for their parts ‣

    Extract relevant fields from URI parameters ‣ Simple attack detection using common words ‣ GeoIP from a local DB
  30. --- '@fields': attack_generic_score: 16 attack_score: 340 attack_sqli_score: 18 attack_triggers: '*/

    | */ | */ | */ | */ | */ | */ | */ | /* | /* | / * | /* | /* | /* | /* | /* | concat | concat | rand | select | select | select' referer_domain: www.booking.com referer_file: /plus/search.php src_ip_country: US src_ip_location: '39.0997,-94.5786' Now That’s Meta?
  31. Lessons Learned ‣ Test the performance of the system ‣

    Work with the data you have ‣ Tokenize the data into a set of well defined fields
  32. STORING LOGS

  33. Lessons from Breaking ElasticSearch ‣ 30 GB Heap ‣ Leave

    25% Memory for FS Cache ‣ Google "Call Me Maybe ElasticSearch” ‣ Separate master and data nodes ‣ Name space indexes based on retention
  34. ElasticSearch Shards ‣ Transaction Log ‣ Lucene Segments ‣ Optimize

    Condenses Segments
  35. Optimized Shards ‣ Fewer segments ‣ Less Mapping ‣ Less

    Reducing ‣ Moar Web Scale
  36. elasticsearch.yml # Disable the use of Memory-mapped files index.store.type: niofs

    # Indexing Settings for Writes index.refresh_interval: 30s index.merge.scheduler.max_thread_count: 1 index.translog.flush_threshold_size: 1gb # Field Data Cache indices.fielddata.cache.size: 40%
  37. Index Layouts

  38. Index Optimizations ‣ Split Indexes in functional namespaces ‣ Use

    Dynamic Mappings (Schemas) ‣ Never use auto_expand_replicas ‣ Fewer replicas yield faster writes
  39. Document Optimizations ‣ Each doc uses ~8 bytes of memory

    ‣ Analyzing fields is expensive ‣ Agree on a “field set” for your documents ‣ CEE, CEF, XDAS, GELF, RELP, SDEE, etc ‣ Invent your own
  40. Monitor and Maintain ‣ Delete old indexes ‣ Degrade replicas

    ‣ Close unused indexes ‣ Disable bloom filters ‣ Monitor key metrics
  41. $ es-daily-index-maintenance.pl --base logstash \ --close --close-days 90 \ --delete

    --delete-days 180 \ --replicas --replicas-min 0 \ --replicas-max 2 \ --replicas-age 7 App::ElasticSearch::Utilities Maintain Indexes https://github.com/reyjrar/es-utils
  42. App::ElasticSearch::Utilities Maintain Aliases # es-alias-manager.pl --config /etc/es-aliases.yml --- logstash: pattern:

    \*-logstash-{{DATE}} daily: logstash-{{DATE}} relative: alias: logstash-{{PERIOD}} periods: weekly: from: days: 7 to: days: 0 https://github.com/reyjrar/es-utils
  43. Index Aliases

  44. $ es-graphite-dynamic.pl App::ElasticSearch::Utilities Monitor https://github.com/reyjrar/es-utils

  45. USING LOGS

  46. General Web Overview

  47. Web Attacks Scanners

  48. Sudoers Wall of Shame

  49. Graphite Abusers

  50. Slow Pages

  51. App::ElasticSearch::Utilities Search Stuff! = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp src_ip src_ip_country

    file 2015-06-03T11:39:27+0200 217.36.201.217 GB /B1D671CF- E532-4481-99AA-19F420D90332/netdefender/hui/ndhui.css 2015-06-03T11:39:26+0200 92.56.217.84 ES /hotel/es/ null.es.html # Search Parameters: # {"query_string":{"query":"dst:www.booking.com AND crit:404"}} # Displaying 3 of (CENSORED) in 0 seconds. # Indexes (2 of 4) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 https://github.com/reyjrar/es-utils $ es-search.pl --base access dst:www.booking.com and crit:404 \ --show src_ip,src_ip_country,file --size 2
  52. Aggregate Stuff https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --size 3 = Querying Indexes: ams4-access-2015.06.03,lhr4-access-2015.06.03 count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65 # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 5 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 # # Totals across batch # count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65
  53. Find Pages Viewed by Most Countries https://github.com/reyjrar/es-utils $ es-search.pl --base

    access --days 1 dst:www.booking.com \ --top file --by cardinality:src_ip_country --size 3 = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 cardinality:src_ip_country count file 239 (CENSORED) / 236 (CENSORED) /rt_data/city_bookings 234 (CENSORED) /wishlist/get # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 21 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
  54. Pipeline Queries https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --by sum:attack_score --size 3 \ --data-file top_attackers.dat $ es-search.pl --base access --days 1 dst:www.booking.com \ src_ip:top_attackers.dat[-1] --size 3\ --show attack_score,src_ip,crit,dst,method,resource \ --sort attack_score:desc
  55. Pipeline Queries https://github.com/reyjrar/es-utils = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp attack_score src_ip

    crit dst method resource 2015-06-03T04:20:59+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T00:50:43+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T05:18:19+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a # Search Parameters: # {"terms":{"src_ip":["107.150.42.90","37.59.7.157","74.84.138.120"]}} # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of 793 in 0 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
  56. Thank You! Presented by: Brad Lhotsky https://twitter.com/reyjrar https://github.com/reyjrar Made possible

    by the support of and viewers like you!