Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from High Veolcity Logging (2016 Edition)

Lessons from High Veolcity Logging (2016 Edition)

Do you need to send, parse, store, and search large volumes of logging data? Well, step right up! We'll briefly stumble over the landscape and then embark on lessons learned about Perl, ElasticSearch, life, and ultimately ourselves! We'll talk about sending logging data using retro protocols like syslog. We'll discuss the up, down, left, and right sides of log parsing! We'll investigate my love-hate relationship with scaling and stabilizing ElasticSearch. Finally, we'll dive into interacting with your data and what unexpected things it can teach you!

Brad Lhotsky

November 15, 2016
Tweet

More Decks by Brad Lhotsky

Other Decks in Technology

Transcript

  1. Brad Lhotsky • Security and Systems Administrator
 at Craigslist
 •

    Previously: Infrastructure Monitoring & Security
 at Booking.com • Recovering • Perl Programmer • Linux Systems Admin • Network Security Specialist • PostgreSQL Administrator • ElasticSearch Janitor • DNS Voyeur • Starving Author • OSSEC Core Team Member https://github.com/reyjrar https://twitter.com/reyjrar
  2. Amazing People Johnny Lopes, Cully Bennefield, Wouter Clarie, Willem Gerber,

    Lucian Grosu, Stefan Boronea, Wijnand Modderman-Lensta, Mickey Nasriachi, Mattia Barbon, Alexey Shatlovsky, Vadim Sukhanov, Daniel Ostermeier, Samit Badle, Tim Pickett, Nick Perez
  3. Syslog Transport ‣Syslog is well-known transport ‣Syslog-ng installed on all

    servers ‣Don’t do that, use rsyslog ‣Remember syslog() syscall blocks
  4. Syslog-ng source s_apache { file("/logs/access.log" follow_freq(1)); }; filter f_junk {

    not(match("favicon.ico")); }; destination d_remote { tcp("1.2.3.4" port(514)); }; log { source(s_apache); filter(f_junk); destination(d_remote); }; So Pretty! Any one see any bugs? ‣“favicon.ico” are silently discarded ‣Reads “/logs/access.log” once per second ‣Only reads log_fetch_limit() lines each read log_fetch_limit() default is 10
  5. Configure rsyslog to: ‣Queue messages locally if the remote server

    dies ‣Sends all data over TCP ... ‣Encrypted ... ‣With peer validation
  6. rsyslog # Configure GNU/TLS $DefaultNetstreamDriver gtls $DefaultNetstreamDriverCAFile /etc/pki/rsyslog/ca.pem $DefaultNetstreamDriverCertFile /etc/pki/rsyslog/client-cert.pem

    $DefaultNetstreamDriverKeyFile /etc/pki/rsyslog/client-key.pem $ActionQueueType LinkedList $ActionQueueFileName centralwork $ActionResumeRetryCount -1 $ActionQueueSaveOnShutdown on $ActionSendStreamDriverAuthMode x509/name $ActionSendStreamDriverPermittedPeer log.example.com $ActionSendStreamDriverMode 1 *.* @@1.2.3.4:514
  7. rsyslog outputs ‣Forwarding, Files, Sockets ‣Programs, SNMP, Email ‣PostgreSQL, MySQL,

    Oracle, DBI ‣ElasticSearch, MongoDB, Kafka, HDFS Now that's webscale!
  8. Syslog Isn’t Web Scale 1. Apache Kafka 2. nanomsg 3.

    ZeroMQ, RabbitMQ, ActiveMQ, “${HOT}MQ” 4. Redis (just don’t tell anyone it’s not a queue) Syslog isn’t a Pub/Sub Queue
  9. Syslog Isn’t Web Scale Syslog is boring technology. We need

    more boring technology. Simplify our end points. Understand our Infrastructure.
  10. Perl Isn’t Web Scale We did this in Perl with

    POE. It outperformed LogStash. 60,000 Messages/Second. Limited by ElasticSearch indexing speed.
  11. Syslog Contents ‣ When did the log happen? ‣ Which

    server sent the log? (use FQDN!) ‣ What level was the log? ‣ What program generated the log? ‣ What was the content of the log?
  12. Parse::Syslog::Line ‣ Order from chaos, logs to documents ‣ Fastest

    syslog parser on the CPAN ‣ Handle many variations on syslog formats ‣ Determine how much I can hate vendors Goals
  13. Parse::Syslog::Line Turns this: <11>Jan 1 00:00:00 router.example.com dhcpd[3428]: DHCPINFORM from

    172.16.2.137 via vlan3 Into this: --- content: DHCPINFORM from 172.16.2.137 via vlan3 date_str: 2015-01-01 00:00:00 datetime_str: 2015-01-01 00:00:00 domain: example.com epoch: 1420066800 facility: user facility_int: 8 host: router message: 'dhcpd[3428]: DHCPINFORM from 172.16.2.137 via vlan3' preamble: 11 priority: err priority_int: 3 program_name: dhcpd program_pid: 3428
  14. Parse::Syslog::Line ‣ Vendors never comply with standards ‣ Always Const::Fast,

    never Readonly.pm ‣ Good regexes are fast ‣ Untested regexes are dangerously slow ‣ Sometimes, perldoc perlre lies Lessons https://xkcd.com/1171/
  15. Regular Expressions? #!/usr/bin/env perl # use strict; use warnings; use

    Benchmark qw(cmpthese); my %RE = ( d => qr/\d+/ ); my $re = $RE{d}; my @tests = <DATA>; printf "Running regex benchmark with Perl %s\n", $^V; cmpthese(500_000, { 'with o' => sub { /$RE{d}/o for @tests }, 'without o' => sub { /$RE{d}/ for @tests }, 'variable' => sub { /$re/ for @tests }, 'static' => sub { /\d+/ for @tests, }, });
  16. Regular Expressions? Running regex benchmark with Perl v5.24.0 Rate without

    o variable with o static without o 387597/s -- -7% -50% -54% variable 416667/s 7% -- -46% -51% with o 769231/s 98% 85% -- -9% static 847458/s 119% 103% 10% -- ‣ The %RE regex table is bad for performance ‣ perldoc perlre: o - pretend to optimize your code, but actually introduce bugs
  17. Regular Expressions? for(@things) { if ( /$RE{IPv4}/o ) { print

    “Found IPv4!\n”; } } OK for(@things) { if ( /$RE{$type}/o ) { print “Found $type!\n”; } } NOT OK
  18. Logs as Documents <134>Jun 2 07:49:17 bob-03.lhr4.prod.booking.com www.booking.com: 107.150.42.90 -

    - [02/Jun/2015:07:49:17 +0200] \"GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5
  19. Adding Context ‣ Using FQDN, get data center? ‣ Well

    known log formats can be tokenized ‣ Find embedded JSON and parse it Do’s Shouldn’ts ‣ Bad (read: untested) Regex ‣ HTTP API Calls ‣ DNS Lookups
  20. Logs as Documents --- '@fields': crit: 404 dc: LHR4 domain:

    lhr4.prod.booking.com dst: www.booking.com facility: local0 file: /plus/search.php filetype: php hostname: bob-03 method: GET out_bytes: 172775 program: apache protocol: HTTP/1.1 referer: "http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" resource: "/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit +0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" response_ms: 368 service: web severity: info src_ip: 107.150.42.90 status: error upstream_ms: 368 useragent: baidu; baiduspider; zone: prod '@message': "www.booking.com: 107.150.42.90 - - [02/Jun/2015:07:49:17 +0200] \"GET /plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5" '@source_host': bob-03.lhr4.prod.booking.com '@timestamp': 2015-06-02T07:49:17+0200
  21. Extract More Meta ‣ Parse URI’s for their parts ‣

    Extract relevant fields from URI parameters ‣ Simple attack detection using common words ‣ GeoIP from a local DB
  22. --- '@fields': attack_generic_score: 16 attack_score: 340 attack_sqli_score: 18 attack_triggers: '*/

    | */ | */ | */ | */ | */ | */ | */ | /* | /* | / * | /* | /* | /* | /* | /* | concat | concat | rand | select | select | select' referer_domain: www.booking.com referer_file: /plus/search.php src_ip_country: US src_ip_location: '39.0997,-94.5786' Now That’s Meta?
  23. Lessons Learned ‣ Test the performance of the system ‣

    Work with the data you have ‣ Tokenize the data into a set of well defined fields
  24. Lessons from Breaking ElasticSearch ‣ 30 GB Heap Maximum ‣

    Leave 33%+ Memory for FS Cache ‣ Google "Call Me Maybe ElasticSearch” ‣ Separate master and data nodes ‣ Name space indexes based on retention
  25. Optimized Shards ‣ Fewer segments ‣ Less Mapping ‣ Less

    Reducing ‣ Moar Web Scale ‣ Expensive operation
  26. elasticsearch.yml # Disable the use of Memory-mapped files index.store.type: niofs

    # Indexing Settings for Writes index.refresh_interval: 30s index.merge.scheduler.max_thread_count: 1 index.translog.flush_threshold_size: 1gb # Field Data Cache indices.fielddata.cache.size: 40%
  27. Index Optimizations ‣ Split Indexes in functional namespaces ‣ Use

    Dynamic Mappings (Schemas) ‣ Never use auto_expand_replicas ‣ Fewer replicas yield faster writes
  28. Document Optimizations ‣ Each doc uses ~8 bytes of memory

    ‣ Analyzing fields is expensive ‣ Agree on a “field set” for your documents ‣ CEE, CEF, XDAS, GELF, RELP, SDEE, etc ‣ Invent your own
  29. Schemaless Data ‣ Can't optimize storage ‣ Can't optimize search

    ‣ But it's easy! ‣ Bad for performance ‣ Read ‣ Search ‣ Index / Create
  30. Dates in ES { "date_template": { "match": "*_date", "mapping": {

    "type": "date", "format": "date_time||epoch_second", "ignore_malformed": true } } }, Matches things like: access_date, login_date
  31. IP Addresses in ES { "ip_template": { "match": "*_ip", "mapping":

    { "type": "ip", "ignore_malformed": true, "fields": { "raw": { "type" : "string", "index": "not_analyzed" } } } } } Matches src_ip and creates src_ip.raw
  32. Dates in ES { "int_template": { "match": "*_int", "mapping": {

    "type": "long", "ignore_malformed": true } } },
  33. Geo-data in ES { "geo_template" : { "mapping" : {

    "type" : "geo_point" }, "match" : "*_geoip" } }, Rejects docs where *_geoip don't cast properly.
  34. Strings { "string_fields" : { "mapping": { "type": "string", },

    "match": "*", "match_mapping_type": "string" } }
  35. Strings in ES 2.x { "string_fields" : { "mapping": {

    "type": "string", "index": "not_analyzed", "ignore_above": 256 }, "match": "*", "match_mapping_type": "string" } }
  36. Strings in ES 5.x { "string_fields" : { "mapping": {

    "type": "keyword", "index": true, "ignore_above": 256 }, "match": "*", "match_mapping_type": "string" } } For k/v fields, ie: program:"apache"
  37. Strings in ES 5.x { "message" : { "mapping": {

    "type": "text", "index": true } } } For bodies of text, like the whole log message
  38. Strings in ES ‣Analyzers ‣ Store and Search ‣ Affect

    indexing of tokens ‣Tokenizers ‣ Boundaries between tokens ‣Token Filters ‣ Remove tokens ‣Character Filters ‣ Removing data from tokens https://xkcd.com/171/
  39. Analyzer Example "cl_words_analyzer": { "tokenizer": "whitespace", "char_filter": "kill_numbers", "filter": [

    "lowercase", "asciifolding", "insensitive_stop", "min_length", "kstem" ] }
  40. Putting it Together message": { "type": "text", "index": true, "analyzer":

    "cl_words_analyzer", "norms": false, "fielddata": false, "fields": { "raw": { "type": "keyword", "index": "no", "doc_values": false, "norms": { "enabled": false }, "ignore_above": 1024 } } }
  41. Object Storage "rule": { "type": "object", "enabled": true, "dynamic": false,

    "properties": { "name": { "type": "keyword", "ignore_above": 256 }, "id": { "type": "keyword", "ignore_above": 256 }, "tags": { "type": "keyword", "ignore_above": 256 }, "cve": { "type": "keyword", "ignore_above": 256 }, "info": { "type": "keyword", "ignore_above": 1024 }, "level": { "type": "long", "ignore_malformed": true } } },
  42. Nested Object Storage "rules": { "type": "nested", "enabled": true, "dynamic":

    false, "properties": { "name": { "type": "keyword", "index": true }, "filter": { "type": "keyword", "index": true }, "condition": { "type": "keyword", "index": true }, "match": { "type": "keyword", "index": true }, "args": { "type": "object", "enabled": false } } }
  43. Defensive Mapping index.mapping.total_fields.limit: 1000 index.mapping.depth.limit: 20 index.mapping.nested_fields.limit:50 ‣ You need

    to know your data ‣ Even when you do, safe guards are necessary ‣ Luckily, other people learned from this pain
  44. Monitor and Maintain ‣ Delete old indexes ‣ Degrade replicas

    ‣ Close unused indexes ‣ Disable bloom filters ‣ Monitor key metrics
  45. $ es-daily-index-maintenance.pl --base logstash \ --close --close-days 90 \ --delete

    --delete-days 180 \ --replicas --replicas-min 0 \ --replicas-max 2 \ --replicas-age 7 App::ElasticSearch::Utilities Maintain Indexes https://github.com/reyjrar/es-utils
  46. App::ElasticSearch::Utilities Maintain Aliases # es-alias-manager.pl --config /etc/es-aliases.yml --- logstash: pattern:

    \*-logstash-{{DATE}} daily: logstash-{{DATE}} relative: alias: logstash-{{PERIOD}} periods: weekly: from: days: 7 to: days: 0 https://github.com/reyjrar/es-utils
  47. App::ElasticSearch::Utilities Search Stuff! = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp src_ip src_ip_country

    file 2015-06-03T11:39:27+0200 217.36.201.217 GB /B1D671CF- E532-4481-99AA-19F420D90332/netdefender/hui/ndhui.css 2015-06-03T11:39:26+0200 92.56.217.84 ES /hotel/es/ null.es.html # Search Parameters: # {"query_string":{"query":"dst:www.booking.com AND crit:404"}} # Displaying 3 of (CENSORED) in 0 seconds. # Indexes (2 of 4) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 https://github.com/reyjrar/es-utils $ es-search.pl --base access dst:www.booking.com and crit:404 \ --show src_ip,src_ip_country,file --size 2
  48. Aggregate Stuff https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --size 3 = Querying Indexes: ams4-access-2015.06.03,lhr4-access-2015.06.03 count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65 # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 5 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 # # Totals across batch # count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65
  49. Find Pages Viewed by Most Countries https://github.com/reyjrar/es-utils $ es-search.pl --base

    access --days 1 dst:www.booking.com \ --top file --by cardinality:src_ip_country --size 3 = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 cardinality:src_ip_country count file 239 (CENSORED) / 236 (CENSORED) /rt_data/city_bookings 234 (CENSORED) /wishlist/get # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 21 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
  50. Pipeline Queries https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --by sum:attack_score --size 3 \ --data-file top_attackers.dat $ es-search.pl --base access --days 1 dst:www.booking.com \ src_ip:top_attackers.dat[-1] --size 3\ --show attack_score,src_ip,crit,dst,method,resource \ --sort attack_score:desc
  51. Pipeline Queries https://github.com/reyjrar/es-utils = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp attack_score src_ip

    crit dst method resource 2015-06-03T04:20:59+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T00:50:43+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T05:18:19+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a # Search Parameters: # {"terms":{"src_ip":["107.150.42.90","37.59.7.157","74.84.138.120"]}} # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of 793 in 0 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03