Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding Bad Guys Using Math and Statistics

Elastic Co
October 14, 2016

Finding Bad Guys Using Math and Statistics

Elastic Co

October 14, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Jared McQueen 1 Cyber Security Innovations October 13, 2016 Identifying

    Statistical Anomalies Using Data- Driven Analysis
  2. Finding Bad Guys Using Math Jared McQueen 2 Cyber Security

    Innovations October 13, 2016 Identifying Statistical Anomalies Using Data- Driven Analysis
  3. 3 @goodguyguybrush https://github.com/jaredmcqueen/ https://gist.github.com/jaredmcqueen/ about me WARNING! Lots of code

    to follow! ~15 years cyber security consulting
 work with mostly Federal customers:
 DoD, Intelligence Community, Civilian Agencies software development systems engineering
 immersive 3D data visualizations
  4. 4 An organization’s security posture is greatly improved by implementing

    data-driven analysis. Finding statistical anomalies through data science methods can be more easily achieved through the use of event enrichment. SOC Models > Data Driven Analysis > Event Enrichment > Use Cases //BLUF
  5. 6 Available Security Events Logging Tier SIEM Collection Tier alert-driven

    model Compute & Storage is expensive SIEM is expensive Security vendors define severity Analysts rarely see the larger picture Use Cases “see” fewest events Alert-Driven:
 Analysis of individual alerts
  6. 7 Available Security Events “Big Data” platforms machine learning
 insider

    threat (behavioral heuristics) Hunt Team data driven model Data-Driven:
 Analysis of data sets
  7. 8 garbage in = garbage out Raw Event Collection Analysis

    event lifecycle - traditional collecting low-value events leads to lower quality use cases
  8. 9 Raw Event Collection ENRICHMENT one man’s trash is another

    man’s treasure!
 garbage + enrichment = high value events event lifecycle - enrichment layer Analysis extract and tease out the hidden gems within your security events
  9. 10 Event Sources Logstash
 (enrichment) Elasticsearch
 (or ingest node) Kibana

    data driven architecture - the Elastic Stack enrichment prior to indexing is ideal but not necessary 
 see docs for info on 5.0’s scripting (Painless)
 see docs on reindexing API (for your existing data)
 see docs for info on 5.0’s ingest nodes

  10. 11 Lessons learned:
 
 Only add data useful to downstream

    use cases
 Realize the storage impact on your cluster
 Be aware of performance implications at logstash tier
 Smartly apply index mappings within elasticsearch (carefully) Enrich All THE THINGS!
  11. 15 geoip { source => "dst_ip" } "geoip" => {

    "ip" => "184.73.175.108", "country_code2" => "US", "country_code3" => "USA", "country_name" => "United States", "continent_code" => "NA", "region_name" => "VA", "city_name" => "Ashburn", "latitude" => 39.0437, "longitude" => -77.4875, "dma_code" => 511, "area_code" => 703, "timezone" => "America/New_York", "real_region_name" => "Virginia", "location" => [ [0] -77.4875, [1] 39.0437 ] } *geoip defaults produce too much bloat enrichment - GEOIP using Logstash resulting document Logstash filter geoip : Maxmind’s GeoLiteCity database
 no installation required
  12. 16 geoip { source => "dst_ip" fields => ["country_name", "location"]

    } "geoip" => { "country_name" => "United States", "location" => [ [0] -77.4875, [1] 39.0437 ] } resulting document geoip.location is a GeoJSON field that maps to geo_point datatype in Elasticsearch
 see docs for info Logstash filter (optimized) enrichment - GEOIP using Logstash
  13. 17 https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip artifacts.elastic.co TRD Third Level Domain 
 artifacts SLD

    Second Level Domain 
 elastic TLD Top Level Domain co enrichment - TLD extraction
  14. 18 gTLDs (generic) 
 .com .org .net .edu .gov .mil

    ccTLDs (country code) .us .uk .ru .cn .jp .it Up until 2012, there used to be relatively few TLDs IANA later opened up registration for new public and privately sponsored TLDs… enrichment - TLD extraction history
  15. 20 "tld" => { "tld" => "co", "sld" => "elastic",

    "trd" => "artifacts", "domain" => "elastic.co", "subdomain" => "artifacts.elastic.co" } tld { source => "message" } Logstash filter resulting document defaults are ok, but we can trim out the subdomain
 the URI.host (tld.subdomain) is probably a duplicate enrichment - TLD extraction using Logstash ./logstash-plugin install logstash-filter-tld logstash-filter-tld : publicsuffix_ruby : Mozilla’s Public Suffix List
  16. 21 "tld" => { "tld" => "co", "sld" => "elastic",

    "trd" => "artifacts", "domain" => "elastic.co" } tld { source => "message" remove_field => "[tld][subdomain]" } Logstash filter (optimized) resulting document enrichment - TLD extraction using Logstash now that we have TLD strings, what can we do with them?
  17. 22 "tld" => { "tld" => "co", "sld" => "elastic",

    "trd" => "artifacts", "domain" => "elastic.co", "sld_length" => 7, "trd_length" => 9, } ruby { code => " my_string = event['message'] event[‘message_length'] = my_string.length " } Logstash filter resulting document enrichment - string lengths what else can we do with the TLD string fields?
  18. 23 N X i=1 p(xi) log2 p(xi) “Sum of the

    product of the probability of each value, times the base-2 log of that probability”
 Claude Shannon (1948)
 
 the information content of a message more information content > more bits > higher entropy calculating a float value based on a string values typically range from 1.0 - 6.0+ enrichment - string entropy theory
  19. 24 abcdef => 2.584962500721156
 aaabbbcccdddeeefff => 2.584962500721156 abcdefghijklmnopqrstuvwxyz => 4.700439718141092


    
 abcdefghijklmnopqrstuvwxzy
 ABCDEFGHIJKLMNOPQRSTUVWXYZ => 6.165420190467044
 1234567890
 !@#$%^&*() string length does not yield higher entropy
 character diversity does yield higher entropy enrichment - string entropy examples most use cases do not require high numerical resolution see docs regarding scaled_float with a scaling_factor
  20. hn8AAAAJCAAAAWdlY2RzYS1zaGEyLW5pc3RwMjU2LWNlcnQtdjAxQG9wZW5zc2g.badguy.com => 4.82
 hn8AAAALCHNzaC5jb20sc3NoLWVkMjU1MTktY2VydC12MDFAb3BlbnNzaC5jb20.badguy.com => 4.86
 hn8AAAANCG5zc2guY29tLHNzaC1kc3MtY2VydC12MDBAb3BlbnNzaC5jb20sZWN.badguy.com => 4.88 Congratulations!

    You have won $32,671! => 4.41
 Congradulatoins! Y0u haev won $12,322! => 4.43
 Cungratulations! YOu have w0n $22,190! => 4.45 www.badguy.com/RwMzg0LGVjZGgtc2hhMi1uaXN0cDUyMSxkaWZmaWUtyNTZAbGlic3NoLm9y => 5.09
 www.badguy.com/1zaGExLGRpZmZpZS1oZWxsbWFuLWdyb3VwMS1zaGExuY29tLGVjZHNhLXNo => 5.00
 www.badguy.com/9wZW5zc2guY29tLHNzaC1yc2EtY2Vc3NoLXJzYS1jZXJ0LXYwMUBvcGVuct => 5.00 Email Subject Lines: Proxy / Firewall Request URLs DNS Tunneling enrichment - string entropy real-world examples notice the misspellings
  21. 26 ideal fields for calculating entropy:
 TLD strings (third-level, second-level)


    request & referrer URLs
 email subject lines ruby { code => " s = event['message'] event['entropy'] = s.each_char.group_by(&:to_s).\ values.map{|x|x.length/s.length.to_f}.\ reduce(0){|e,x|e-x*Math.log2(x)} " } enrichment - string entropy using Logstash Logstash filter
  22. 27 enrichment - recap (before) {
 "dst_ip" => "184.73.175.108", "request"

    => "https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip" } example document prior to enrichment
  23. 28 { "request" => "https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip", "dst_ip" => "184.73.175.108", "request_entropy" =>

    4.39,
 "request_length" => 61, "tld" => { "tld" => "co", "sld" => "elastic", "sld_length" => 7, "sld_entropy" => 2.80, "trd" => "artifacts", "trd_length" => 9, "trd_entropy" => 2.72, "domain" => "elastic.co" }, "geoip" => { "country_name" => "United States", "location" => [ [0] -77.4875, [1] 39.0437 ] } } index mapping:
 all strings > not_analyzed see docs for storing numbers enrichment - recap (after)
  24. 30 use cases - persistent threats OPM breach:
 estimated 21.5

    million records SF-86s, clearance information, financials fingerprints of 5.6 million people NASA breach:
 250 GB of data released on pastebin allegedly partially commandeered
 a $222.7 million NASA drone
  25. 31 use cases - overview bubble up anomalous documents using

    enriched fields use visualizations to identify potential anomalies use standard deviation to find statistically significant events reduce false-positives (noise) drive-by malware / browser exploitation DNS tunneling Command and Control (C2) communication data exfiltration Methods (how we’ll find the bad stuff): Use Cases (what we’re looking for):
  26. 32 use case - shady TLDs BlueCoat’s The Web’s Shadiest

    Neighborhoods domain registration is cheap!
  27. 34 third-level domain entropy is relatively high
 third-level domain is

    long use case - suspicious third-level domains example of real-world DNS tunneling (dnscat2)
  28. 35 x-axis: tld.trd_length
 y-axis: count length = 231 length =

    36 use a histogram to identify outliers use case - third-level domain lengths
  29. 36 eej.me has 2500+ unique third-level domains x-axis: tld.domain
 y-axis:

    unique count tld.trd use case - unique counts of third-level domains
  30. 37 aff3a2a735c06eb78d2effa6f30f72fe3.profile.atl50.cloudfront.net effa6f30f72fe3aff3a2a735c06eb78d2.profile.atl50.cloudfront.net c06eb78d2efaff3a2a735fa6f30f72fe3.profile.atl50.cloudfront.net legit traffic - AWS CDN (third-level

    length = 41) malicious traffic (third-level length = 73) f4944a7884a1070f8be3a40e85249c2.1649181aec753e3f1ce7860fa10196c806d3e0.1.eej.me 665db61090dcdaea4db55f4d6bef193.873cf3205e2413a1240338ade3d0db72610b32.1.eej.me e40000883e1230438d9fca44d9f93db.08582b9d.b62ba2f71a574000b5d0665db6273.1.eej.me cloudfront.net’s average third-level length = ~17 (statistically insignificant) eej.me’s average third-level length = ~71 (statistically significant) use case - using averages to reduce false-positives
  31. 38 DNS tunneling using counts .es('dns_action.raw: forwarded').label('forwarded') use case -

    time series data helpful, but still too noisy
 what about the small spikes?
  32. 39 .mvstd(10) .mvstd(30) .mvstd(40) .mvstd(20) use case - moving standard

    deviation values a measure used to quantify the amount of variation of a set of data values
  33. 41 use case - data exfiltration and C2 communication www.badguy.com/RwMzg0LGVjZGgtc2hhMi1uaXN0cDUyMSxkaWZmaWUtyNTZAbGlic3NoLm9y

    => 5.09
 www.badguy.com/1zaGExLGRpZmZpZS1oZWxsbWFuLWdyb3VwMS1zaGExuY29tLGVjZHNhLXNo => 5.00
 www.badguy.com/9wZW5zc2guY29tLHNzaC1yc2EtY2Vc3NoLXJzYS1jZXJ0LXYwMUBvcGVuct => 5.00 example of C2 data exfil web traffic traffic has high entropy
 traffic has high length
 country may be foreign
 sum of bytes_out may be high ports / protocols may not be standard detect by aggregating on average entropy or average length
  34. 42 use case - popularity and bottom-talkers popular > uncommon

    > unpopular x-axis: domains
 y-axis: unique count source_ip
  35. 43 key takeaways Increase SOC resources for data-driven analysis Use

    enrichment to add value to data sources: • GEOIP (mainly for the country names) • TLD extraction on domain fields • String length on significant fields • String entropy on significant fields Use data-driven analysis: • Visualize using Histograms, Charts, Graphs • Use average values (length, entropy) • Use moving standard deviation • Focus on the bottom-talkers (20%)
  36. Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/

    Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders.