Finding Bad Guys Using Math and Statistics

Jared McQueen 1 Cyber Security Innovations October 13, 2016 Identifying
Statistical Anomalies Using Data- Driven Analysis

Finding Bad Guys Using Math Jared McQueen 2 Cyber Security
Innovations October 13, 2016 Identifying Statistical Anomalies Using Data- Driven Analysis

3 @goodguyguybrush https://github.com/jaredmcqueen/ https://gist.github.com/jaredmcqueen/ about me WARNING! Lots of code
to follow! ~15 years cyber security consulting  work with mostly Federal customers:  DoD, Intelligence Community, Civilian Agencies software development systems engineering  immersive 3D data visualizations

4 An organization’s security posture is greatly improved by implementing
data-driven analysis. Finding statistical anomalies through data science methods can be more easily achieved through the use of event enrichment. SOC Models > Data Driven Analysis > Event Enrichment > Use Cases //BLUF

SOC Analysis Models: Alert-Driven vs Data-Driven 5

6 Available Security Events Logging Tier SIEM Collection Tier alert-driven
model Compute & Storage is expensive SIEM is expensive Security vendors define severity Analysts rarely see the larger picture Use Cases “see” fewest events Alert-Driven:  Analysis of individual alerts

7 Available Security Events “Big Data” platforms machine learning  insider
threat (behavioral heuristics) Hunt Team data driven model Data-Driven:  Analysis of data sets

8 garbage in = garbage out Raw Event Collection Analysis
event lifecycle - traditional collecting low-value events leads to lower quality use cases

9 Raw Event Collection ENRICHMENT one man’s trash is another
man’s treasure!  garbage + enrichment = high value events event lifecycle - enrichment layer Analysis extract and tease out the hidden gems within your security events

10 Event Sources Logstash  (enrichment) Elasticsearch  (or ingest node) Kibana
data driven architecture - the Elastic Stack enrichment prior to indexing is ideal but not necessary   see docs for info on 5.0’s scripting (Painless)  see docs on reindexing API (for your existing data)  see docs for info on 5.0’s ingest nodes 

11 Lessons learned:    Only add data useful to downstream
use cases  Realize the storage impact on your cluster  Be aware of performance implications at logstash tier  Smartly apply index mappings within elasticsearch (carefully) Enrich All THE THINGS!

Enrichment Strategies:  GEO, TLD, Length, Entropy 12

13 {  "dst_ip" => "184.73.175.108", "request" => "https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip" } starting
document - low value example proxy / firewall document

14 enrichment - GEOIP (Kibana 5.0)

15 geoip { source => "dst_ip" } "geoip" => {
"ip" => "184.73.175.108", "country_code2" => "US", "country_code3" => "USA", "country_name" => "United States", "continent_code" => "NA", "region_name" => "VA", "city_name" => "Ashburn", "latitude" => 39.0437, "longitude" => -77.4875, "dma_code" => 511, "area_code" => 703, "timezone" => "America/New_York", "real_region_name" => "Virginia", "location" => [ [0] -77.4875, [1] 39.0437 ] } *geoip defaults produce too much bloat enrichment - GEOIP using Logstash resulting document Logstash filter geoip : Maxmind’s GeoLiteCity database  no installation required

16 geoip { source => "dst_ip" fields => ["country_name", "location"]
} "geoip" => { "country_name" => "United States", "location" => [ [0] -77.4875, [1] 39.0437 ] } resulting document geoip.location is a GeoJSON field that maps to geo_point datatype in Elasticsearch  see docs for info Logstash filter (optimized) enrichment - GEOIP using Logstash

17 https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip artifacts.elastic.co TRD Third Level Domain   artifacts SLD
Second Level Domain   elastic TLD Top Level Domain co enrichment - TLD extraction

18 gTLDs (generic)   .com .org .net .edu .gov .mil
ccTLDs (country code) .us .uk .ru .cn .jp .it Up until 2012, there used to be relatively few TLDs IANA later opened up registration for new public and privately sponsored TLDs… enrichment - TLD extraction history

19 7995 TLDs and counting

20 "tld" => { "tld" => "co", "sld" => "elastic",
"trd" => "artifacts", "domain" => "elastic.co", "subdomain" => "artifacts.elastic.co" } tld { source => "message" } Logstash filter resulting document defaults are ok, but we can trim out the subdomain  the URI.host (tld.subdomain) is probably a duplicate enrichment - TLD extraction using Logstash ./logstash-plugin install logstash-filter-tld logstash-filter-tld : publicsuffix_ruby : Mozilla’s Public Suffix List

"trd" => "artifacts", "domain" => "elastic.co" } tld { source => "message" remove_field => "[tld][subdomain]" } Logstash filter (optimized) resulting document enrichment - TLD extraction using Logstash now that we have TLD strings, what can we do with them?

"trd" => "artifacts", "domain" => "elastic.co", "sld_length" => 7, "trd_length" => 9, } ruby { code => " my_string = event['message'] event[‘message_length'] = my_string.length " } Logstash filter resulting document enrichment - string lengths what else can we do with the TLD string fields?

23 N X i=1 p(xi) log2 p(xi) “Sum of the
product of the probability of each value, times the base-2 log of that probability”  Claude Shannon (1948)    the information content of a message more information content > more bits > higher entropy calculating a float value based on a string values typically range from 1.0 - 6.0+ enrichment - string entropy theory

24 abcdef => 2.584962500721156  aaabbbcccdddeeefff => 2.584962500721156 abcdefghijklmnopqrstuvwxyz => 4.700439718141092 
  abcdefghijklmnopqrstuvwxzy  ABCDEFGHIJKLMNOPQRSTUVWXYZ => 6.165420190467044  1234567890  !@#$%^&*() string length does not yield higher entropy  character diversity does yield higher entropy enrichment - string entropy examples most use cases do not require high numerical resolution see docs regarding scaled_float with a scaling_factor

hn8AAAAJCAAAAWdlY2RzYS1zaGEyLW5pc3RwMjU2LWNlcnQtdjAxQG9wZW5zc2g.badguy.com => 4.82  hn8AAAALCHNzaC5jb20sc3NoLWVkMjU1MTktY2VydC12MDFAb3BlbnNzaC5jb20.badguy.com => 4.86  hn8AAAANCG5zc2guY29tLHNzaC1kc3MtY2VydC12MDBAb3BlbnNzaC5jb20sZWN.badguy.com => 4.88 Congratulations!
You have won $32,671! => 4.41  Congradulatoins! Y0u haev won $12,322! => 4.43  Cungratulations! YOu have w0n $22,190! => 4.45 www.badguy.com/RwMzg0LGVjZGgtc2hhMi1uaXN0cDUyMSxkaWZmaWUtyNTZAbGlic3NoLm9y => 5.09  www.badguy.com/1zaGExLGRpZmZpZS1oZWxsbWFuLWdyb3VwMS1zaGExuY29tLGVjZHNhLXNo => 5.00  www.badguy.com/9wZW5zc2guY29tLHNzaC1yc2EtY2Vc3NoLXJzYS1jZXJ0LXYwMUBvcGVuct => 5.00 Email Subject Lines: Proxy / Firewall Request URLs DNS Tunneling enrichment - string entropy real-world examples notice the misspellings

26 ideal fields for calculating entropy:  TLD strings (third-level, second-level) 
request & referrer URLs  email subject lines ruby { code => " s = event['message'] event['entropy'] = s.each_char.group_by(&:to_s).\ values.map{|x|x.length/s.length.to_f}.\ reduce(0){|e,x|e-x*Math.log2(x)} " } enrichment - string entropy using Logstash Logstash filter

27 enrichment - recap (before) {  "dst_ip" => "184.73.175.108", "request"
=> "https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip" } example document prior to enrichment

28 { "request" => "https://artifacts.elastic.co/downloads/x-pack-5.0.0-beta1.zip", "dst_ip" => "184.73.175.108", "request_entropy" =>
4.39,  "request_length" => 61, "tld" => { "tld" => "co", "sld" => "elastic", "sld_length" => 7, "sld_entropy" => 2.80, "trd" => "artifacts", "trd_length" => 9, "trd_entropy" => 2.72, "domain" => "elastic.co" }, "geoip" => { "country_name" => "United States", "location" => [ [0] -77.4875, [1] 39.0437 ] } } index mapping:  all strings > not_analyzed see docs for storing numbers enrichment - recap (after)

Use Cases! 29

30 use cases - persistent threats OPM breach:  estimated 21.5
million records SF-86s, clearance information, financials fingerprints of 5.6 million people NASA breach:  250 GB of data released on pastebin allegedly partially commandeered  a $222.7 million NASA drone

31 use cases - overview bubble up anomalous documents using
enriched fields use visualizations to identify potential anomalies use standard deviation to find statistically significant events reduce false-positives (noise) drive-by malware / browser exploitation DNS tunneling Command and Control (C2) communication data exfiltration Methods (how we’ll find the bad stuff): Use Cases (what we’re looking for):

32 use case - shady TLDs BlueCoat’s The Web’s Shadiest
Neighborhoods domain registration is cheap!

33 use case - shady TLDs .science  (browser exploits) plotting
TLD counts over time in Timelion

34 third-level domain entropy is relatively high  third-level domain is
long use case - suspicious third-level domains example of real-world DNS tunneling (dnscat2)

35 x-axis: tld.trd_length  y-axis: count length = 231 length =
36 use a histogram to identify outliers use case - third-level domain lengths

36 eej.me has 2500+ unique third-level domains x-axis: tld.domain  y-axis:
unique count tld.trd use case - unique counts of third-level domains

37 aff3a2a735c06eb78d2effa6f30f72fe3.profile.atl50.cloudfront.net effa6f30f72fe3aff3a2a735c06eb78d2.profile.atl50.cloudfront.net c06eb78d2efaff3a2a735fa6f30f72fe3.profile.atl50.cloudfront.net legit traffic - AWS CDN (third-level
length = 41) malicious traffic (third-level length = 73) f4944a7884a1070f8be3a40e85249c2.1649181aec753e3f1ce7860fa10196c806d3e0.1.eej.me 665db61090dcdaea4db55f4d6bef193.873cf3205e2413a1240338ade3d0db72610b32.1.eej.me e40000883e1230438d9fca44d9f93db.08582b9d.b62ba2f71a574000b5d0665db6273.1.eej.me cloudfront.net’s average third-level length = ~17 (statistically insignificant) eej.me’s average third-level length = ~71 (statistically significant) use case - using averages to reduce false-positives

38 DNS tunneling using counts .es('dns_action.raw: forwarded').label('forwarded') use case -
time series data helpful, but still too noisy  what about the small spikes?

39 .mvstd(10) .mvstd(30) .mvstd(40) .mvstd(20) use case - moving standard
deviation values a measure used to quantify the amount of variation of a set of data values

40 DNS tunneling using standard deviation .es('dns_action.raw: forwarded').mvstd(25).label('forwarded') use case
- time series data

41 use case - data exfiltration and C2 communication www.badguy.com/RwMzg0LGVjZGgtc2hhMi1uaXN0cDUyMSxkaWZmaWUtyNTZAbGlic3NoLm9y
=> 5.09  www.badguy.com/1zaGExLGRpZmZpZS1oZWxsbWFuLWdyb3VwMS1zaGExuY29tLGVjZHNhLXNo => 5.00  www.badguy.com/9wZW5zc2guY29tLHNzaC1yc2EtY2Vc3NoLXJzYS1jZXJ0LXYwMUBvcGVuct => 5.00 example of C2 data exfil web traffic traffic has high entropy  traffic has high length  country may be foreign  sum of bytes_out may be high ports / protocols may not be standard detect by aggregating on average entropy or average length

42 use case - popularity and bottom-talkers popular > uncommon
> unpopular x-axis: domains  y-axis: unique count source_ip

43 key takeaways Increase SOC resources for data-driven analysis Use
enrichment to add value to data sources: • GEOIP (mainly for the country names) • TLD extraction on domain fields • String length on significant fields • String entropy on significant fields Use data-driven analysis: • Visualize using Histograms, Charts, Graphs • Use average values (length, entropy) • Use moving standard deviation • Focus on the bottom-talkers (20%)

www.elastic.co

Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/
Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders.

Finding Bad Guys Using Math and Statistics

Finding Bad Guys Using Math and Statistics

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript