Clustering and Classification for Cyber Crime

Clustering and Classification for Cyber Crime Jesse Kornblum

Thank You • ICCyber Organizing Committee • Marcos Vinícius and
the other officers of the Brazilian Federal Police • Sando Suffert of Apura • You for attending! 2

3 Outline • Introduction • Similarity • Fuzzy Hashing •
Features • Distance Measures • Feature Selection • Clustering • Classification • Questions

4 Introduction • U.S. Air Force Office of Special Investigations
• U.S. Department of Justice • Now work for Kyrus • Applied Computer Forensics Research • Tools I've written – Memory forensics – md5deep/hashdeep – fuzzy hashing (ssdeep) – Foremost

Introduction • Analyzing an infinite number of programs/documents – Only
five minutes per sample • Which of them are similar to each other? • Which of them fit into existing categories? – Variant of Zeus – Written by $person? – Relevant to this investigation 5

Assumptions • There are too many programs to manually review
– You’re not going to look at them all • Computers are good at computing – Humans are not • Humans are good at categorizing – Computers are not 6

Artificial Intelligence 7

Artificial Intelligence 8 = Mathy Stuff

Clustering • Group similar items together • For example: –
Variants of Zeus are all similar to each other – Documents written by the same person – Documents related to the same topic 9

Similarity • What does it mean for two things to
be similar? 10

Similarity • Depends on: – The kind of things be
compared – How they’re being compared 11

Example 12

Example • Both live in Washington DC • Both like
a good hamburger • Both like dogs • Conclusion: Similar • President Obama is 7 cm taller • Jesse does not have gray hair • Work in different career fields • Conclusion: Not similar 13

Current Tools • Cryptographic Hashing – Exact match – e.g.
SHA-256 • Fuzzy Hashing – Similar blocks of ones and zeros • Manual analysis 14

Fuzzy Hashing 15 A 7 b F d r t
8 5 d N 4 o P A 7 b F X r t 8 5 d N 4 o P

Fuzzy Hashing • Compare signatures: – A 7 b F
d r t 8 5 d N 4 o P – A 7 b F X r t 8 5 d N 4 o P • Compute the edit distance between the signatures – Edit distance is number of changes necessary to turn one string into the other • Small edit distance means more similar • In this example, the edit distance is one 16

Fuzzy Hashing 17 A 7 b F d r t
8 5 d N 4 o P 5 k A j b F 9 2 b 5 @ N q o P Y k

General Approach • Extract simple features from each input •
Compare the features mathematically • The result of the comparison is a similarity score • With fuzzy hashing, the features are the hashes of the blocks of raw bytes 18

Similar Programs • Similarity depends on: – The kind of
things be compared – How they’re being compared 19

Similar Programs • Do the same thing • Have the
same look and feel • Connect to the same servers • Written by the same person • Used in the same intrusion • Run on the same platform 20

Features • What features can we extract from a program?
21

Features • Signed code? • Which APIs are called •
How often APIs are called • Order in which APIs are called • Entropy • DLLs used • Percentage of code coverage • Magic strings • N-grams of instructions • Control-flow graph • IP addresses accessed • … 22 Image courtesy of Flickr user doctor_keats and used under Create Commons license.

Distance Measures • Clustering – Group of inputs which are
close to each other • Closeness depends on distance measure • What it sounds like – How far apart are the input programs? – As measured by our features • Alternatively, how similar are they? 23

Distance Measures 24 Royal Tulip Justice Department

Distance Measures 25 Euclidean Distance

Distance Measures 26 Manhattan Distance

Distance Measures 27 Code coverage Represent features of each program
as a vector Entropy

Distance Measures 28 Code coverage Entropy

Distance Measures 29 Code coverage Entropy We can measure the
Euclidean distance between these points

Cosine Similarity 30 Code coverage Entropy θ Or we could
measure the angle between the vectors The smaller the angle, the more similar the programs

Feature Selection • The Curse of Dimensionality – So many
dimensions (features) that comparisons become too time consuming or too complex • No problem • Select the “important” features – (Insert mathy stuff here) • Example: – Presence of crypto constants – Depends on context 31

Comparisons • Can find programs similar to any query •
Similar to a kind of fuzzy hashing – “Signature” is the set of selected features 32

Computing Clusters • Extract features from all inputs • Compute
distance metric for all pairs of inputs For all inputs a and b: if distance(a,b) < threshold add_cluster(a,b) • Exclusive vs. Non-Exclusive clustering – Assume A~B and B~C – Exclusive: {A,B,C} – Non-Exclusive: {A,B} {B,C} 33

Not Just for Programs • eDiscovery is all over this
• Commercially available now • Uses phrases of text as features – File format independent • Distance metrics are statistical or linguistic or … 34

Clustering vs. Classification 35 Clustering Classification What Makes Things Similar?
Which Things are Similar?

Classification • Also known as: – Predictive Coding – Assisted
Machine Learning • Put inputs into a category – Zeus variant or Not Zeus variant – Written by Microsoft, $person, Other – Relevant to this case or not 36

Classification • Artificial intelligence is just math • There are
many algorithms: – Naïve Bayesian classifier – K-Nearest Neighbor – Locality Sensitive Hashing – Decision Trees – Neural Networks – Hidden Markov Models 37

Classification • User must create a set of training data
• Must identify some documents for each possible outcome • Train the algorithm on this data • "Knowledge" is stored by the algorithm • Which can then be applied to new inputs 38 Image courtesy Flickr user dhillan and used under a Creative Commons license, http://www.flickr.com/photos/dhillan/3848315549/

Naïve Bayes Classification • Let's go through an example –
How your spam detector works • Determine which is greater – Probability message is spam? Or ham? – P(spam) or P(ham) – P(spam given features) or P(ham given features) 39

Training Set • Get a set of emails • Human
labels which are spam, which are ham 40 From: [email protected] To: [email protected] Subject: V1agra!! T0p quailikty V1aagra delieverd direct to you! http://sales.v1agara.biz/ From: [email protected] To: [email protected] Subject: Wear a jacket It will be cold while you are in Brazil. Please wear a jacket so that I do not worry about you. Love, Mom SPAM HAM

Naïve Bayesian Classifier • Based on Bayes Theorem • Probabilities
are based on what‘s in the training set = ∗ (|) () = ∗ (|) () • In other words, count things in the training set and do math on them 41

Naïve Bayesian Classifier = ∗ (|) () ℎ = ℎ
∗ (|ℎ) () Which probability is greater? 42

Decision Tree • Build a flowchart of questions on the
features • Each question should divide the data equally • Blackjack example: 43 Is your total < 11? Have pair? Dealer have < 11? Split hands Hit Stay

Decision Tree • Quick to classify, but slow to construct
• What questions are best at which point in the tree? • [Insert mathy stuff here] • You could make a career out of efficient decision tree generation – And people do 44

Classification Packages • All of these are Free and Open
Source: – Weka – Apache Mahout – Malheur – LibSVM • Which is the best? 45

Classification Systems • Academia – “Solved problem” • For you?
– Some assembly required – Your Agency puts it together 46

Measuring Classifier Performance • There are several measures • Look
at false positives and false negatives – Run the classifier on the training set • When building, reserve some known values for a test set – Not used in training the classifier • There is a problem of over-fitting – Classifier "knows" the training data too well 47

Classification • Training set must be balanced • Roughly equal
numbers of inputs for each category 48 Image courtesy Flickr user andyg and used under a Creative Commons license, http://www.flickr.com/photos/andyg/2642257588/

Application to Child Pornography • Difficult to extract relevant features
• Lots of features in images – File format, size – GPS location – Camera serial number • But difficult to extract features related to the content 49

Conclusion • Analyzing an infinite number of programs – Only
five minutes per sample – Computer time is cheap • Which of them are similar to each other? – Build clusters of programs and documents • Which of them fit into existing categories? – Variant of Zeus – Written by $person? – Relevant to this investigation – Build classifiers for these categories 50

51 Outline • Introduction • Similarity • Features • Distance
Measures • Feature Selection • Clustering • Classification • Questions

Questions? Jesse Kornblum [email protected] 52

Clustering and Classification for Cyber Crime

Clustering and Classification for Cyber Crime

More Decks by Kyrus

Other Decks in Technology

Featured

Transcript