Lessons from High Veolcity Logging (2016 Edition)

Lessons from High Veolcity Logging (2016 Edition)

Do you need to send, parse, store, and search large volumes of logging data? Well, step right up! We'll briefly stumble over the landscape and then embark on lessons learned about Perl, ElasticSearch, life, and ultimately ourselves! We'll talk about sending logging data using retro protocols like syslog. We'll discuss the up, down, left, and right sides of log parsing! We'll investigate my love-hate relationship with scaling and stabilizing ElasticSearch. Finally, we'll dive into interacting with your data and what unexpected things it can teach you!

8d96f5c273062cb617255e630fe0705c?s=128

Brad Lhotsky

November 15, 2016
Tweet

Transcript

  1. Lessons from High Velocity Logging Presented by Brad Lhotsky

  2. Brad Lhotsky • Security and Systems Administrator
 at Craigslist
 •

    Previously: Infrastructure Monitoring & Security
 at Booking.com • Recovering • Perl Programmer • Linux Systems Admin • Network Security Specialist • PostgreSQL Administrator • ElasticSearch Janitor • DNS Voyeur • Starving Author • OSSEC Core Team Member https://github.com/reyjrar https://twitter.com/reyjrar
  3. Amazing People Johnny Lopes, Cully Bennefield, Wouter Clarie, Willem Gerber,

    Lucian Grosu, Stefan Boronea, Wijnand Modderman-Lensta, Mickey Nasriachi, Mattia Barbon, Alexey Shatlovsky, Vadim Sukhanov, Daniel Ostermeier, Samit Badle, Tim Pickett, Nick Perez
  4. Expectations ‣Lessons from Shipping Logs ‣Lessons from Parsing Logs ‣Lessons

    from Storing Logs ‣Lessons from Using Logs
  5. Assumptions ‣Business Intelligence covered ‣Running Linux servers ‣Developed by Security

    Engineers Your Mileage May Vary
  6. Shipping Logs at WEB SCALE!

  7. Syslog Transport ‣Syslog is well-known transport ‣Syslog-ng installed on all

    servers ‣Don’t do that, use rsyslog ‣Remember syslog() syscall blocks
  8. Syslog-ng source s_apache { file("/logs/access.log" follow_freq(1)); }; filter f_junk {

    not(match("favicon.ico")); }; destination d_remote { tcp("1.2.3.4" port(514)); }; log { source(s_apache); filter(f_junk); destination(d_remote); }; So Pretty! Any one see any bugs? ‣“favicon.ico” are silently discarded ‣Reads “/logs/access.log” once per second ‣Only reads log_fetch_limit() lines each read log_fetch_limit() default is 10
  9. Configure rsyslog to: ‣Queue messages locally if the remote server

    dies ‣Sends all data over TCP ... ‣Encrypted ... ‣With peer validation
  10. rsyslog # Configure GNU/TLS $DefaultNetstreamDriver gtls $DefaultNetstreamDriverCAFile /etc/pki/rsyslog/ca.pem $DefaultNetstreamDriverCertFile /etc/pki/rsyslog/client-cert.pem

    $DefaultNetstreamDriverKeyFile /etc/pki/rsyslog/client-key.pem $ActionQueueType LinkedList $ActionQueueFileName centralwork $ActionResumeRetryCount -1 $ActionQueueSaveOnShutdown on $ActionSendStreamDriverAuthMode x509/name $ActionSendStreamDriverPermittedPeer log.example.com $ActionSendStreamDriverMode 1 *.* @@1.2.3.4:514
  11. rsyslog outputs ‣Forwarding, Files, Sockets ‣Programs, SNMP, Email ‣PostgreSQL, MySQL,

    Oracle, DBI ‣ElasticSearch, MongoDB, Kafka, HDFS Now that's webscale!
  12. None
  13. Syslog Isn’t Web Scale 1. Apache Kafka 2. nanomsg 3.

    ZeroMQ, RabbitMQ, ActiveMQ, “${HOT}MQ” 4. Redis (just don’t tell anyone it’s not a queue) Syslog isn’t a Pub/Sub Queue
  14. Syslog Isn’t Web Scale ‣ Apache Flume ‣ Facebook’s Scribe

    Syslog isn’t a Logging Pipeline
  15. Syslog Isn’t Web Scale 1. Fluentd 2. LogStash 3. Hindsight

    Syslog isn’t a Logging Platform
  16. Syslog Isn’t Web Scale Syslog is boring technology. We need

    more boring technology. Simplify our end points. Understand our Infrastructure.
  17. Perl Isn’t Web Scale We did this in Perl with

    POE. It outperformed LogStash. 60,000 Messages/Second. Limited by ElasticSearch indexing speed.
  18. Parsing Logs

  19. Syslog Contents ‣ When did the log happen? ‣ Which

    server sent the log? (use FQDN!) ‣ What level was the log? ‣ What program generated the log? ‣ What was the content of the log?
  20. Parse::Syslog::Line ‣ Order from chaos, logs to documents ‣ Fastest

    syslog parser on the CPAN ‣ Handle many variations on syslog formats ‣ Determine how much I can hate vendors Goals
  21. Parse::Syslog::Line Turns this: <11>Jan 1 00:00:00 router.example.com dhcpd[3428]: DHCPINFORM from

    172.16.2.137 via vlan3 Into this: --- content: DHCPINFORM from 172.16.2.137 via vlan3 date_str: 2015-01-01 00:00:00 datetime_str: 2015-01-01 00:00:00 domain: example.com epoch: 1420066800 facility: user facility_int: 8 host: router message: 'dhcpd[3428]: DHCPINFORM from 172.16.2.137 via vlan3' preamble: 11 priority: err priority_int: 3 program_name: dhcpd program_pid: 3428
  22. Parse::Syslog::Line ‣ Vendors never comply with standards ‣ Always Const::Fast,

    never Readonly.pm ‣ Good regexes are fast ‣ Untested regexes are dangerously slow ‣ Sometimes, perldoc perlre lies Lessons https://xkcd.com/1171/
  23. Regular Expressions? #!/usr/bin/env perl # use strict; use warnings; use

    Benchmark qw(cmpthese); my %RE = ( d => qr/\d+/ ); my $re = $RE{d}; my @tests = <DATA>; printf "Running regex benchmark with Perl %s\n", $^V; cmpthese(500_000, { 'with o' => sub { /$RE{d}/o for @tests }, 'without o' => sub { /$RE{d}/ for @tests }, 'variable' => sub { /$re/ for @tests }, 'static' => sub { /\d+/ for @tests, }, });
  24. Regular Expressions? Running regex benchmark with Perl v5.24.0 Rate without

    o variable with o static without o 387597/s -- -7% -50% -54% variable 416667/s 7% -- -46% -51% with o 769231/s 98% 85% -- -9% static 847458/s 119% 103% 10% -- ‣ The %RE regex table is bad for performance ‣ perldoc perlre: o - pretend to optimize your code, but actually introduce bugs
  25. Regular Expressions? for(@things) { if ( /$RE{IPv4}/o ) { print

    “Found IPv4!\n”; } } OK for(@things) { if ( /$RE{$type}/o ) { print “Found $type!\n”; } } NOT OK
  26. Logs as Documents <134>Jun 2 07:49:17 bob-03.lhr4.prod.booking.com www.booking.com: 107.150.42.90 -

    - [02/Jun/2015:07:49:17 +0200] \"GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*! 50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*! 50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/ +information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5
  27. Adding Context ‣ Using FQDN, get data center? ‣ Well

    known log formats can be tokenized ‣ Find embedded JSON and parse it Do’s Shouldn’ts ‣ Bad (read: untested) Regex ‣ HTTP API Calls ‣ DNS Lookups
  28. Logs as Documents --- '@fields': crit: 404 dc: LHR4 domain:

    lhr4.prod.booking.com dst: www.booking.com facility: local0 file: /plus/search.php filetype: php hostname: bob-03 method: GET out_bytes: 172775 program: apache protocol: HTTP/1.1 referer: "http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+` %23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" resource: "/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*! 50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/+CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit +0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a" response_ms: 368 service: web severity: info src_ip: 107.150.42.90 status: error upstream_ms: 368 useragent: baidu; baiduspider; zone: prod '@message': "www.booking.com: 107.150.42.90 - - [02/Jun/2015:07:49:17 +0200] \"GET /plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a HTTP/1.1\" 404 172775 0.368/0.368 \"http://www.booking.com/plus/search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*! 50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*!50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*!50000fRom*/+information_schema.tables+/*!50000gROUP*/ +by+a)b)%23@`%5C'`+]=a\" \"baidu; baiduspider;\" - 556D43DD00922CB5" '@source_host': bob-03.lhr4.prod.booking.com '@timestamp': 2015-06-02T07:49:17+0200
  29. Extract More Meta ‣ Parse URI’s for their parts ‣

    Extract relevant fields from URI parameters ‣ Simple attack detection using common words ‣ GeoIP from a local DB
  30. --- '@fields': attack_generic_score: 16 attack_score: 340 attack_sqli_score: 18 attack_triggers: '*/

    | */ | */ | */ | */ | */ | */ | */ | /* | /* | / * | /* | /* | /* | /* | /* | concat | concat | rand | select | select | select' referer_domain: www.booking.com referer_file: /plus/search.php src_ip_country: US src_ip_location: '39.0997,-94.5786' Now That’s Meta?
  31. Lessons Learned ‣ Test the performance of the system ‣

    Work with the data you have ‣ Tokenize the data into a set of well defined fields
  32. Storing Logs

  33. Lessons from Breaking ElasticSearch ‣ 30 GB Heap Maximum ‣

    Leave 33%+ Memory for FS Cache ‣ Google "Call Me Maybe ElasticSearch” ‣ Separate master and data nodes ‣ Name space indexes based on retention
  34. ElasticSearch Shards ‣ Transaction Log ‣ Lucene Segments ‣ Optimize

    Condenses Segments
  35. Optimized Shards ‣ Fewer segments ‣ Less Mapping ‣ Less

    Reducing ‣ Moar Web Scale ‣ Expensive operation
  36. elasticsearch.yml # Disable the use of Memory-mapped files index.store.type: niofs

    # Indexing Settings for Writes index.refresh_interval: 30s index.merge.scheduler.max_thread_count: 1 index.translog.flush_threshold_size: 1gb # Field Data Cache indices.fielddata.cache.size: 40%
  37. Index Layouts

  38. Index Optimizations ‣ Split Indexes in functional namespaces ‣ Use

    Dynamic Mappings (Schemas) ‣ Never use auto_expand_replicas ‣ Fewer replicas yield faster writes
  39. Document Optimizations ‣ Each doc uses ~8 bytes of memory

    ‣ Analyzing fields is expensive ‣ Agree on a “field set” for your documents ‣ CEE, CEF, XDAS, GELF, RELP, SDEE, etc ‣ Invent your own
  40. Schemaless Data

  41. Schemaless Data ‣ Can't optimize storage ‣ Can't optimize search

    ‣ But it's easy! ‣ Bad for performance ‣ Read ‣ Search ‣ Index / Create
  42. Dates in ES { "date_template": { "match": "*_date", "mapping": {

    "type": "date", "format": "date_time||epoch_second", "ignore_malformed": true } } }, Matches things like: access_date, login_date
  43. IP Addresses in ES { "ip_template": { "match": "*_ip", "mapping":

    { "type": "ip", "ignore_malformed": true, "fields": { "raw": { "type" : "string", "index": "not_analyzed" } } } } } Matches src_ip and creates src_ip.raw
  44. Dates in ES { "int_template": { "match": "*_int", "mapping": {

    "type": "long", "ignore_malformed": true } } },
  45. Geo-data in ES { "geo_template" : { "mapping" : {

    "type" : "geo_point" }, "match" : "*_geoip" } }, Rejects docs where *_geoip don't cast properly.
  46. Strings { "string_fields" : { "mapping": { "type": "string", },

    "match": "*", "match_mapping_type": "string" } }
  47. Strings in ES 2.x { "string_fields" : { "mapping": {

    "type": "string", "index": "not_analyzed", "ignore_above": 256 }, "match": "*", "match_mapping_type": "string" } }
  48. Strings in ES 5.x { "string_fields" : { "mapping": {

    "type": "keyword", "index": true, "ignore_above": 256 }, "match": "*", "match_mapping_type": "string" } } For k/v fields, ie: program:"apache"
  49. Strings in ES 5.x { "message" : { "mapping": {

    "type": "text", "index": true } } } For bodies of text, like the whole log message
  50. Strings in ES ‣Analyzers ‣ Store and Search ‣ Affect

    indexing of tokens ‣Tokenizers ‣ Boundaries between tokens ‣Token Filters ‣ Remove tokens ‣Character Filters ‣ Removing data from tokens https://xkcd.com/171/
  51. Filter Example "insensitive_stop": { "type": "stop", "stopwords": "_english_", "ignore_case": true

    }, "min_length": { "type": "length", "min": 2 }
  52. Filter Example "kill_numbers" : { "type": "pattern_replace", "pattern": "(\\d+)", "replace":

    "" }
  53. Analyzer Example "cl_words_analyzer": { "tokenizer": "whitespace", "char_filter": "kill_numbers", "filter": [

    "lowercase", "asciifolding", "insensitive_stop", "min_length", "kstem" ] }
  54. Putting it Together message": { "type": "text", "index": true, "analyzer":

    "cl_words_analyzer", "norms": false, "fielddata": false, "fields": { "raw": { "type": "keyword", "index": "no", "doc_values": false, "norms": { "enabled": false }, "ignore_above": 1024 } } }
  55. Object Storage "rule": { "type": "object", "enabled": true, "dynamic": false,

    "properties": { "name": { "type": "keyword", "ignore_above": 256 }, "id": { "type": "keyword", "ignore_above": 256 }, "tags": { "type": "keyword", "ignore_above": 256 }, "cve": { "type": "keyword", "ignore_above": 256 }, "info": { "type": "keyword", "ignore_above": 1024 }, "level": { "type": "long", "ignore_malformed": true } } },
  56. Nested Object Storage "rules": { "type": "nested", "enabled": true, "dynamic":

    false, "properties": { "name": { "type": "keyword", "index": true }, "filter": { "type": "keyword", "index": true }, "condition": { "type": "keyword", "index": true }, "match": { "type": "keyword", "index": true }, "args": { "type": "object", "enabled": false } } }
  57. None
  58. https://www.elastic.co/guide/en/elasticsearch/reference/5.0/ nested.html

  59. Defensive Mapping index.mapping.total_fields.limit: 1000 index.mapping.depth.limit: 20 index.mapping.nested_fields.limit:50 ‣ You need

    to know your data ‣ Even when you do, safe guards are necessary ‣ Luckily, other people learned from this pain
  60. Monitor and Maintain ‣ Delete old indexes ‣ Degrade replicas

    ‣ Close unused indexes ‣ Disable bloom filters ‣ Monitor key metrics
  61. $ es-daily-index-maintenance.pl --base logstash \ --close --close-days 90 \ --delete

    --delete-days 180 \ --replicas --replicas-min 0 \ --replicas-max 2 \ --replicas-age 7 App::ElasticSearch::Utilities Maintain Indexes https://github.com/reyjrar/es-utils
  62. App::ElasticSearch::Utilities Maintain Aliases # es-alias-manager.pl --config /etc/es-aliases.yml --- logstash: pattern:

    \*-logstash-{{DATE}} daily: logstash-{{DATE}} relative: alias: logstash-{{PERIOD}} periods: weekly: from: days: 7 to: days: 0 https://github.com/reyjrar/es-utils
  63. Index Aliases

  64. $ es-graphite-dynamic.pl App::ElasticSearch::Utilities Monitor https://github.com/reyjrar/es-utils

  65. Using Logs

  66. General Web Overview

  67. Web Attacks Scanners

  68. Sudoers Wall of Shame

  69. Slow Pages

  70. App::ElasticSearch::Utilities Search Stuff! = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp src_ip src_ip_country

    file 2015-06-03T11:39:27+0200 217.36.201.217 GB /B1D671CF- E532-4481-99AA-19F420D90332/netdefender/hui/ndhui.css 2015-06-03T11:39:26+0200 92.56.217.84 ES /hotel/es/ null.es.html # Search Parameters: # {"query_string":{"query":"dst:www.booking.com AND crit:404"}} # Displaying 3 of (CENSORED) in 0 seconds. # Indexes (2 of 4) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 https://github.com/reyjrar/es-utils $ es-search.pl --base access dst:www.booking.com and crit:404 \ --show src_ip,src_ip_country,file --size 2
  71. Aggregate Stuff https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --size 3 = Querying Indexes: ams4-access-2015.06.03,lhr4-access-2015.06.03 count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65 # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 5 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03 # # Totals across batch # count src_ip (CENSORED) 66.249.92.71 (CENSORED) 66.249.92.59 (CENSORED) 66.249.92.65
  72. Find Pages Viewed by Most Countries https://github.com/reyjrar/es-utils $ es-search.pl --base

    access --days 1 dst:www.booking.com \ --top file --by cardinality:src_ip_country --size 3 = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 cardinality:src_ip_country count file 239 (CENSORED) / 236 (CENSORED) /rt_data/city_bookings 234 (CENSORED) /wishlist/get # Search Parameters: # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of (CENSORED) in 21 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
  73. Pipeline Queries https://github.com/reyjrar/es-utils $ es-search.pl --base access --days 1 dst:www.booking.com

    \ --top src_ip --by sum:attack_score --size 3 \ --data-file top_attackers.dat $ es-search.pl --base access --days 1 dst:www.booking.com \ src_ip:top_attackers.dat[-1] --size 3\ --show attack_score,src_ip,crit,dst,method,resource \ --sort attack_score:desc
  74. Pipeline Queries https://github.com/reyjrar/es-utils = Querying Indexes: lhr4-access-2015.06.03,ams4-access-2015.06.03 @timestamp attack_score src_ip

    crit dst method resource 2015-06-03T04:20:59+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T00:50:43+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a 2015-06-03T05:18:19+0200 340 107.150.42.90 404 www.booking.com GET /plus/ search.php?keyword=as&typeArr[111%3D@`%5C'`)+/*!50000And*/+(/*!50000SeLECT*/+1+/*!50000frOM*/+(/*! 50000SeLECT*/+/*!50000Count(*)*/,concat(floor(rand(0)*2),(substring((/*!50000SeLECT*/ +CONCAT(0x40,userid,0x7c,substring(pwd,4,16))+from+`%23@__admin`+limit+0,1),1,62)))a+/*! 50000fRom*/+information_schema.tables+/*!50000gROUP*/+by+a)b)%23@`%5C'`+]=a # Search Parameters: # {"terms":{"src_ip":["107.150.42.90","37.59.7.157","74.84.138.120"]}} # {"query_string":{"query":"dst:www.booking.com"}} # Displaying 3 of 793 in 0 seconds. # Indexes (2 of 2) searched: ams4-access-2015.06.03,lhr4-access-2015.06.03
  75. Thank You! Presented by: Brad Lhotsky https://twitter.com/reyjrar https://github.com/reyjrar https://speakerdeck.com/reyjrar Made

    possible by the support of and viewers like you! craigslist