Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from High Velocity Logging

Lessons from High Velocity Logging

Talk for YAPC::NA 2015 discussing the lessons learned from shipping, parsing, storing, and using logs!

Brad Lhotsky

June 09, 2015
Tweet

More Decks by Brad Lhotsky

Other Decks in Technology

Transcript

  1. Brad Lhotsky • Infrastructure Monitoring & Security
 at Booking.com •

    Recovering • Perl Programmer • Linux Systems Admin • Network Security Specialist • PostgreSQL Administrator • ElasticSearch Janitor • DNS Voyeur • Starving Author • OSSEC Core Team Member https://github.com/reyjrar https://twitter.com/reyjrar
  2. Amazing People Wouter Clarie, Willem Gerber, Lucian Grosu, Stefan Boronea,

    Wijnand Modderman-Lensta, Mickey Nasriachi, Mattia Barbon, Alexey Shatlovsky, Vadim Sukhanov, Daniel Ostermeier, Samit Badle, Tim Pickett, Nick Perez
  3. Syslog Transport ‣Syslog is well-known transport ‣Syslog-ng installed on all

    servers ‣Don’t do that, use rsyslog ‣Remember syslog() syscall blocks
  4. Syslog-ng source s_apache { file("/logs/access.log" follow_freq(1)); }; filter f_junk {

    not(match("favicon.ico")); }; destination d_remote { tcp("1.2.3.4" port(514)); }; log { source(s_apache); filter(f_junk); destination(d_remote); }; So Pretty! Any one see any bugs? ‣“favicon.ico” are silently discarded ‣Reads “/logs/access.log” once per second ‣Only reads log_fetch_limit() lines each read log_fetch_limit() default is 10
  5. Configure rsyslog to: ‣Queue messages locally if the remote server

    dies ‣Sends all data over TCP ... ‣Encrypted ... ‣With peer validation
  6. rsyslog # Configure GNU/TLS $DefaultNetstreamDriver gtls $DefaultNetstreamDriverCAFile /etc/pki/rsyslog/ca.pem $DefaultNetstreamDriverCertFile /etc/pki/rsyslog/client-cert.pem

    $DefaultNetstreamDriverKeyFile /etc/pki/rsyslog/client-key.pem $ActionQueueType LinkedList $ActionQueueFileName centralwork $ActionResumeRetryCount -1 $ActionQueueSaveOnShutdown on $ActionSendStreamDriverAuthMode x509/name $ActionSendStreamDriverPermittedPeer log.example.com $ActionSendStreamDriverMode 1 *.* @@1.2.3.4:514
  7. rsyslog outputs ‣Forwarding, Files, Sockets ‣Programs, SNMP, Email ‣PostgreSQL, MySQL,

    Oracle, DBI ‣ElasticSearch, MongoDB, Kafka, HDFS Now that's webscale!
  8. Syslog Isn’t Web Scale 1. Apache Kafka 2. nanomsg 3.

    ZeroMQ, RabbitMQ, ActiveMQ, “${HOT}MQ” 4. Redis (just don’t tell anyone it’s not a queue) Syslog isn’t a Pub/Sub Queue
  9. Syslog Isn’t Web Scale Syslog is boring technology. We need

    more boring technology. Simplify our end points. Understand our Infrastructure.
  10. Perl Isn’t Web Scale We did this in Perl with

    POE. It outperformed LogStash. 60,000 Messages/Second. Limited by ElasticSearch indexing speed.
  11. What Syslog Gets You ‣ When did the log happen?

    ‣ Which server sent the log? (use FQDN!) ‣ What level was the log? ‣ What program generated the log? ‣ What was the content of the log?
  12. Parse::Syslog::Line ‣ Order from chaos, logs to documents ‣ Fastest

    syslog parser on the CPAN ‣ Handle many variations on syslog formats ‣ Determine how much I can hate vendors Goals
  13. Parse::Syslog::Line Turns this: <11>Jan 1 00:00:00 router.example.com dhcpd[3428]: DHCPINFORM from

    172.16.2.137 via vlan3 Into this: --- content: DHCPINFORM from 172.16.2.137 via vlan3 date_str: 2015-01-01 00:00:00 datetime_str: 2015-01-01 00:00:00 domain: example.com epoch: 1420066800 facility: user facility_int: 8 host: router message: 'dhcpd[3428]: DHCPINFORM from 172.16.2.137 via vlan3' preamble: 11 priority: err priority_int: 3 program_name: dhcpd program_pid: 3428
  14. Parse::Syslog::Line ‣ Vendors never comply with standards ‣ Always Const::Fast,

    never Readonly.pm ‣ Good regexes are fast ‣ Untested regexes are dangerously slow ‣ Sometimes, perldoc perlre lies Lessons https://xkcd.com/1171/
  15. Regular Expressions? #!/usr/bin/env perl # use strict; use warnings; use

    Benchmark qw(cmpthese); my $re = qr/\d+/; my %RE = ( d => qr/\d+/ ); my @tests = <DATA>; printf "Running regex benchmark with Perl %s\n", $^V; cmpthese(500_000, { 'with o' => sub { /$RE{d}/o for @tests }, 'without o' => sub { /$RE{d}/ for @tests }, 'variable' => sub { /$re/ for @tests }, 'static' => sub { /\d+/ for @tests, }, });
  16. Regular Expressions? Running regex benchmark with Perl v5.22.0 Rate variable

    without o with o static variable 301205/s -- -4% -46% -54% without o 314465/s 4% -- -43% -52% with o 555556/s 84% 77% -- -14% static 649351/s 116% 106% 17% -- ‣ The %RE regex table is bad for performance ‣ The ‘o’ modifier can introduce bugs, but it can help optimize regular expressions in stored in variables
  17. Regular Expressions? for(@things) { if ( /$RE{IPv4}/o ) { print

    “Found IPv4!\n”; } } OK for(@things) { if ( /$RE{$type}/o ) { print “Found $type!\n”; } } NOT OK
  18. Logs as Documents <134>Jun 2 07:49:17 bob-03.lhr4.prod.booking.com www.booking.com: 107.150.42.90 -

    - [02/Jun/2015:07:49:17 +0200] \"GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5
  19. Adding Context ‣ Using FQDN, get data center? ‣ Well

    known log formats can be tokenized ‣ Find embedded JSON and parse it Do’s Shouldn’ts ‣ Bad (read: untested) Regex ‣ HTTP API Calls ‣ DNS Lookups
  20. Logs as Documents --- '@fields': crit: 404 dc: LHR4 domain:

    lhr4.prod.booking.com dst: www.booking.com facility: local0 file: /plus/search.php filetype: php hostname: bob-03 method: GET out_bytes: 172775 program: apache protocol: HTTP/1.1 referer: "http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" resource: "/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit +0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" response_ms: 368 service: web severity: info src_ip: 107.150.42.90 status: error upstream_ms: 368 useragent: baidu; baiduspider; zone: prod '@message': "www.booking.com: 107.150.42.90 - - [02/Jun/2015:07:49:17 +0200] \"GET /plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5" '@source_host': bob-03.lhr4.prod.booking.com '@timestamp': 2015-06-02T07:49:17+0200
  21. Extract More Meta ‣ Parse URI’s for their parts ‣

    Extract relevant fields from URI parameters ‣ Simple attack detection using common words ‣ GeoIP from a local DB
  22. --- '@fields': attack_generic_score: 16 attack_score: 340 attack_sqli_score: 18 attack_triggers: '*/

    | */ | */ | */ | */ | */ | */ | */ | /* | /* | / * | /* | /* | /* | /* | /* | concat | concat | rand | select | select | select' referer_domain: www.booking.com referer_file: /plus/search.php src_ip_country: US src_ip_location: '39.0997,-94.5786' Now That’s Meta?
  23. Lessons Learned ‣ Test the performance of the system ‣

    Work with the data you have ‣ Tokenize the data into a set of well defined fields
  24. Lessons from Breaking ElasticSearch ‣ 30 GB Heap ‣ Leave

    25% Memory for FS Cache ‣ Google "Call Me Maybe ElasticSearch” ‣ Separate master and data nodes ‣ Name space indexes based on retention
  25. elasticsearch.yml # Disable the use of Memory-mapped files index.store.type: niofs

    # Indexing Settings for Writes index.refresh_interval: 30s index.merge.scheduler.max_thread_count: 1 index.translog.flush_threshold_size: 1gb # Field Data Cache indices.fielddata.cache.size: 40%
  26. Index Optimizations ‣ Split Indexes in functional namespaces ‣ Use

    Dynamic Mappings (Schemas) ‣ Never use auto_expand_replicas ‣ Fewer replicas yield faster writes
  27. Document Optimizations ‣ Each doc uses ~8 bytes of memory

    ‣ Analyzing fields is expensive ‣ Agree on a “field set” for your documents ‣ CEE, CEF, XDAS, GELF, RELP, SDEE, etc ‣ Invent your own
  28. Monitor and Maintain ‣ Delete old indexes ‣ Degrade replicas

    ‣ Close unused indexes ‣ Disable bloom filters ‣ Monitor key metrics
  29. $ es-daily-index-maintenance.pl --base logstash \ --close --close-days 90 \ --delete

    --delete-days 180 \ --replicas --replicas-min 0 \ --replicas-max 2 \ --replicas-age 7 App::ElasticSearch::Utilities Maintain Indexes https://github.com/reyjrar/es-utils
  30. App::ElasticSearch::Utilities Maintain Aliases # es-alias-manager.pl --config /etc/es-aliases.yml --- logstash: pattern:

    \*-logstash-{{DATE}} daily: logstash-{{DATE}} relative: alias: logstash-{{PERIOD}} periods: weekly: from: days: 7 to: days: 0 https://github.com/reyjrar/es-utils
  31. App::ElasticSearch::Utilities Search Stuff! = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp src_ip src_ip_country

    file 2015-06-03T11:39:27+0200 217.36.201.217 GB /B1D671CF- E532-4481-99AA-19F420D90332/netdefender/hui/ndhui.css 2015-06-03T11:39:26+0200 92.56.217.84 ES /hotel/es/ null.es.html # Search Parameters: # {"query_string":{"query":"dst:www.booking.com AND crit:404"}} # Displaying 3 of (CENSORED) in 0 seconds. # Indexes (2 of 4) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 https://github.com/reyjrar/es-utils $ es-search.pl --base access dst:www.booking.com and crit:404 \ --show src_ip,src_ip_country,file --size 2
  32. Aggregate Stuff https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --size 3 = Querying Indexes: ams4-access-2015.06.03,lhr4-access-2015.06.03 count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65 # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 5 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 # # Totals across batch # count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65
  33. Find Pages Viewed by Most Countries https://github.com/reyjrar/es-utils $ es-search.pl --base

    access --days 1 dst:www.booking.com \ --top file --by cardinality:src_ip_country --size 3 = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 cardinality:src_ip_country count file 239 (CENSORED) / 236 (CENSORED) /rt_data/city_bookings 234 (CENSORED) /wishlist/get # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 21 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
  34. Pipeline Queries https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --by sum:attack_score --size 3 \ --data-file top_attackers.dat $ es-search.pl --base access --days 1 dst:www.booking.com \ src_ip:top_attackers.dat[-1] --size 3\ --show attack_score,src_ip,crit,dst,method,resource \ --sort attack_score:desc
  35. Pipeline Queries https://github.com/reyjrar/es-utils = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp attack_score src_ip

    crit dst method resource 2015-06-03T04:20:59+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T00:50:43+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T05:18:19+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a # Search Parameters: # {"terms":{"src_ip":["107.150.42.90","37.59.7.157","74.84.138.120"]}} # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of 793 in 0 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03