Slide 1

Slide 1 text

Clustering and Classification for Cyber Crime Jesse Kornblum

Slide 2

Slide 2 text

Thank You • ICCyber Organizing Committee • Marcos Vinícius and the other officers of the Brazilian Federal Police • Sando Suffert of Apura • You for attending! 2

Slide 3

Slide 3 text

3 Outline • Introduction • Similarity • Fuzzy Hashing • Features • Distance Measures • Feature Selection • Clustering • Classification • Questions

Slide 4

Slide 4 text

4 Introduction • U.S. Air Force Office of Special Investigations • U.S. Department of Justice • Now work for Kyrus • Applied Computer Forensics Research • Tools I've written – Memory forensics – md5deep/hashdeep – fuzzy hashing (ssdeep) – Foremost

Slide 5

Slide 5 text

Introduction • Analyzing an infinite number of programs/documents – Only five minutes per sample • Which of them are similar to each other? • Which of them fit into existing categories? – Variant of Zeus – Written by $person? – Relevant to this investigation 5

Slide 6

Slide 6 text

Assumptions • There are too many programs to manually review – You’re not going to look at them all • Computers are good at computing – Humans are not • Humans are good at categorizing – Computers are not 6

Slide 7

Slide 7 text

Artificial Intelligence 7

Slide 8

Slide 8 text

Artificial Intelligence 8 = Mathy Stuff

Slide 9

Slide 9 text

Clustering • Group similar items together • For example: – Variants of Zeus are all similar to each other – Documents written by the same person – Documents related to the same topic 9

Slide 10

Slide 10 text

Similarity • What does it mean for two things to be similar? 10

Slide 11

Slide 11 text

Similarity • Depends on: – The kind of things be compared – How they’re being compared 11

Slide 12

Slide 12 text

Example 12

Slide 13

Slide 13 text

Example • Both live in Washington DC • Both like a good hamburger • Both like dogs • Conclusion: Similar • President Obama is 7 cm taller • Jesse does not have gray hair • Work in different career fields • Conclusion: Not similar 13

Slide 14

Slide 14 text

Current Tools • Cryptographic Hashing – Exact match – e.g. SHA-256 • Fuzzy Hashing – Similar blocks of ones and zeros • Manual analysis 14

Slide 15

Slide 15 text

Fuzzy Hashing 15 A 7 b F d r t 8 5 d N 4 o P A 7 b F X r t 8 5 d N 4 o P

Slide 16

Slide 16 text

Fuzzy Hashing • Compare signatures: – A 7 b F d r t 8 5 d N 4 o P – A 7 b F X r t 8 5 d N 4 o P • Compute the edit distance between the signatures – Edit distance is number of changes necessary to turn one string into the other • Small edit distance means more similar • In this example, the edit distance is one 16

Slide 17

Slide 17 text

Fuzzy Hashing 17 A 7 b F d r t 8 5 d N 4 o P 5 k A j b F 9 2 b 5 @ N q o P Y k

Slide 18

Slide 18 text

General Approach • Extract simple features from each input • Compare the features mathematically • The result of the comparison is a similarity score • With fuzzy hashing, the features are the hashes of the blocks of raw bytes 18

Slide 19

Slide 19 text

Similar Programs • Similarity depends on: – The kind of things be compared – How they’re being compared 19

Slide 20

Slide 20 text

Similar Programs • Do the same thing • Have the same look and feel • Connect to the same servers • Written by the same person • Used in the same intrusion • Run on the same platform 20

Slide 21

Slide 21 text

Features • What features can we extract from a program? 21

Slide 22

Slide 22 text

Features • Signed code? • Which APIs are called • How often APIs are called • Order in which APIs are called • Entropy • DLLs used • Percentage of code coverage • Magic strings • N-grams of instructions • Control-flow graph • IP addresses accessed • … 22 Image courtesy of Flickr user doctor_keats and used under Create Commons license.

Slide 23

Slide 23 text

Distance Measures • Clustering – Group of inputs which are close to each other • Closeness depends on distance measure • What it sounds like – How far apart are the input programs? – As measured by our features • Alternatively, how similar are they? 23

Slide 24

Slide 24 text

Distance Measures 24 Royal Tulip Justice Department

Slide 25

Slide 25 text

Distance Measures 25 Euclidean Distance

Slide 26

Slide 26 text

Distance Measures 26 Manhattan Distance

Slide 27

Slide 27 text

Distance Measures 27 Code coverage Represent features of each program as a vector Entropy

Slide 28

Slide 28 text

Distance Measures 28 Code coverage Entropy

Slide 29

Slide 29 text

Distance Measures 29 Code coverage Entropy We can measure the Euclidean distance between these points

Slide 30

Slide 30 text

Cosine Similarity 30 Code coverage Entropy θ Or we could measure the angle between the vectors The smaller the angle, the more similar the programs

Slide 31

Slide 31 text

Feature Selection • The Curse of Dimensionality – So many dimensions (features) that comparisons become too time consuming or too complex • No problem • Select the “important” features – (Insert mathy stuff here) • Example: – Presence of crypto constants – Depends on context 31

Slide 32

Slide 32 text

Comparisons • Can find programs similar to any query • Similar to a kind of fuzzy hashing – “Signature” is the set of selected features 32

Slide 33

Slide 33 text

Computing Clusters • Extract features from all inputs • Compute distance metric for all pairs of inputs For all inputs a and b: if distance(a,b) < threshold add_cluster(a,b) • Exclusive vs. Non-Exclusive clustering – Assume A~B and B~C – Exclusive: {A,B,C} – Non-Exclusive: {A,B} {B,C} 33

Slide 34

Slide 34 text

Not Just for Programs • eDiscovery is all over this • Commercially available now • Uses phrases of text as features – File format independent • Distance metrics are statistical or linguistic or … 34

Slide 35

Slide 35 text

Clustering vs. Classification 35 Clustering Classification What Makes Things Similar? Which Things are Similar?

Slide 36

Slide 36 text

Classification • Also known as: – Predictive Coding – Assisted Machine Learning • Put inputs into a category – Zeus variant or Not Zeus variant – Written by Microsoft, $person, Other – Relevant to this case or not 36

Slide 37

Slide 37 text

Classification • Artificial intelligence is just math • There are many algorithms: – Naïve Bayesian classifier – K-Nearest Neighbor – Locality Sensitive Hashing – Decision Trees – Neural Networks – Hidden Markov Models 37

Slide 38

Slide 38 text

Classification • User must create a set of training data • Must identify some documents for each possible outcome • Train the algorithm on this data • "Knowledge" is stored by the algorithm • Which can then be applied to new inputs 38 Image courtesy Flickr user dhillan and used under a Creative Commons license, http://www.flickr.com/photos/dhillan/3848315549/

Slide 39

Slide 39 text

Naïve Bayes Classification • Let's go through an example – How your spam detector works • Determine which is greater – Probability message is spam? Or ham? – P(spam) or P(ham) – P(spam given features) or P(ham given features) 39

Slide 40

Slide 40 text

Training Set • Get a set of emails • Human labels which are spam, which are ham 40 From: rxbsgw56@qquix.biz To: jessek@kyr.us Subject: V1agra!! T0p quailikty V1aagra delieverd direct to you! http://sales.v1agara.biz/ From: mom@aol.com To: jessek@kyr.us Subject: Wear a jacket It will be cold while you are in Brazil. Please wear a jacket so that I do not worry about you. Love, Mom SPAM HAM

Slide 41

Slide 41 text

Naïve Bayesian Classifier • Based on Bayes Theorem • Probabilities are based on what‘s in the training set = ∗ (|) () = ∗ (|) () • In other words, count things in the training set and do math on them 41

Slide 42

Slide 42 text

Naïve Bayesian Classifier = ∗ (|) () ℎ = ℎ ∗ (|ℎ) () Which probability is greater? 42

Slide 43

Slide 43 text

Decision Tree • Build a flowchart of questions on the features • Each question should divide the data equally • Blackjack example: 43 Is your total < 11? Have pair? Dealer have < 11? Split hands Hit Stay

Slide 44

Slide 44 text

Decision Tree • Quick to classify, but slow to construct • What questions are best at which point in the tree? • [Insert mathy stuff here] • You could make a career out of efficient decision tree generation – And people do 44

Slide 45

Slide 45 text

Classification Packages • All of these are Free and Open Source: – Weka – Apache Mahout – Malheur – LibSVM • Which is the best? 45

Slide 46

Slide 46 text

Classification Systems • Academia – “Solved problem” • For you? – Some assembly required – Your Agency puts it together 46

Slide 47

Slide 47 text

Measuring Classifier Performance • There are several measures • Look at false positives and false negatives – Run the classifier on the training set • When building, reserve some known values for a test set – Not used in training the classifier • There is a problem of over-fitting – Classifier "knows" the training data too well 47

Slide 48

Slide 48 text

Classification • Training set must be balanced • Roughly equal numbers of inputs for each category 48 Image courtesy Flickr user andyg and used under a Creative Commons license, http://www.flickr.com/photos/andyg/2642257588/

Slide 49

Slide 49 text

Application to Child Pornography • Difficult to extract relevant features • Lots of features in images – File format, size – GPS location – Camera serial number • But difficult to extract features related to the content 49

Slide 50

Slide 50 text

Conclusion • Analyzing an infinite number of programs – Only five minutes per sample – Computer time is cheap • Which of them are similar to each other? – Build clusters of programs and documents • Which of them fit into existing categories? – Variant of Zeus – Written by $person? – Relevant to this investigation – Build classifiers for these categories 50

Slide 51

Slide 51 text

51 Outline • Introduction • Similarity • Features • Distance Measures • Feature Selection • Clustering • Classification • Questions

Slide 52

Slide 52 text

Questions? Jesse Kornblum jessek@kyr.us 52