Research towards a systematic signature discovery process

Research towards a systematic signature discovery process

Updated Signature Discovery Initiative presentation

6aace8fbdd01acb47d57d2e5545ac7f8?s=128

Nathan Baker

August 17, 2014
Tweet

Transcript

  1. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Research towards a systematic signature discovery process Nathan Baker1 1Computational and Statistical Analytics Division, Pacific Northwest National Laboratory August 4, 2014 Nathan Baker Signature Discovery (PNNL-SA-101246)
  2. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Acknowledgements Funding: PNNL LDRD Selected team members: Dan Best Nat Beagley (JHU APL) George Bonheyo Paul Bruillard Russ Burtner Court Corley Alex Endert (GA Tech) Luke Gosink Ryan Hafen Cindy Henderson Alejandro Heredia-Langner Nathan Hodas Emilie Hogan Kris Jarman John Johnson Cliff Joslyn Kannan Krishnaswami Jason McDermott Bill Nickless Chris Oehmen Mark Oxley (AFIT) Elena Peterson Rich Quadrel Karin Rodland Landon Sego Andrew Stevens Jana Strasburg Mark Tardiff Sandy Thompson Marvin Warner Bobbie-Jo Webb-Robertson Paul Whitney Adam Wynne Nathan Baker Signature Discovery (PNNL-SA-101246)
  3. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Hierarchical Signature Detection Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  4. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signatures provide a framework for heterogeneous feature integration Signature: A distinguishing collection of features that detects or characterizes a phenomenon of interest: Forensic, diagnostic, and prognostic signatures Modern signatures often involve features from many different data types Example: Bioforensic attribution Nathan Baker Signature Discovery (PNNL-SA-101246)
  5. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Themes in signature discovery Challenges Modern signatures often involve features from many different data types Signatures are most useful when they are easily interpreted by decision-makers Key questions How do we best select features from multiple measurement sources to construct our signature? How do we detect signatures in “real world” environments? How do we assess the quality of a signature and compare different signatures? How do we recognize change and adapt signatures to dynamic phenomena? Is this process generalizable across domains? Nathan Baker Signature Discovery (PNNL-SA-101246)
  6. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature systems as mappings Elements of a signature system [Baker et al., 2013] Dimensional sets Dimensions can have multiple types Boolean Scalar Vector Et cetera Mappings µ - Observation ϑ - Feature extraction δ - Classification Events µ −→ Measurements ϑ −→ Features δ −→ (Labels, Uncertainties) Nathan Baker Signature Discovery (PNNL-SA-101246)
  7. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Hierarchical Signature Detection Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  8. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Developing a model of the signature discovery process Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Is the process of signature discovery generalizable across domains? Developed over the past four years Based on multiple sources Basic and applied research across several domains End-users Ongoing research projects 1000+ literature articles Highly iterative process Nathan Baker Signature Discovery (PNNL-SA-101246)
  9. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Specifying the problem Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development What are we looking for? Specify target phenomenon of interest Defines scope of signature (event space) Currently precludes anomaly detection Specify purpose Goal Detect Characterize Time frame Forensic Diagnostic Prognostic Nathan Baker Signature Discovery (PNNL-SA-101246)
  10. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Inventorying observables Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we observe the events associated with our target? This phase starts with testable hypothesis Hypotheses about the target Observables to test hypotheses Steps rely heavily on social dimension Multidisciplinary teams Brainstorming and related methods Selection of observables drives measurement process Nathan Baker Signature Discovery (PNNL-SA-101246)
  11. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Specifying measurements Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we collect data for our observables? Data collection requires Sampling strategy Measurement principle and platform Useful to characterize noise and drift Statistical design of experiments is important Nathan Baker Signature Discovery (PNNL-SA-101246)
  12. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Assessing and exploring the data Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Do salient features exist in the collected data? Typically results in significant reduction of data dimension and volume Useful tools include Data sub-setting Clustering Exploratory data analysis Resulting features form the basis for the signature Highly iterative process Nathan Baker Signature Discovery (PNNL-SA-101246)
  13. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Developing, deploying, and assessing the signature Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we build and test the signature? Choose and train the classifier Apply the classifier to data Characterize signature behavior in presence of interference, drift, and other dynamics Evaluate signature performance on multiple dimensions Lack of suitable signature drives iterative development Nathan Baker Signature Discovery (PNNL-SA-101246)
  14. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative The Signature Discovery Initiative Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development SDI has developed a formal process for signature discovery that transforms diverse data signature development to become: More efficient: reducing trial-and-error and decreasing the time to discovery a signature More economical: delivering methods and tools that allow users to reuse, rather than reinvent, signature discovery resources More rigorous: providing robust and well-defined processes for signature discovery Nathan Baker Signature Discovery (PNNL-SA-101246)
  15. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Hierarchical Signature Detection Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  16. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Instrumenting the signature discovery process: the SDI project portfolio Nathan Baker Signature Discovery (PNNL-SA-101246)
  17. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Constructing and validating signatures I Signature Quality Metrics Going beyond ROC curves Framework to combine signature fidelity, risk, cost, and utility Fishing for Features Greedy strategy to feature discovery Applied to survey-based disease signatures Applied to microbe-based fuel production Nathan Baker Signature Discovery (PNNL-SA-101246)
  18. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Constructing and validating signatures II Expert- and Data-driven Approaches to Signature Construction Enhancing expert input with data-driven discovery Applications to biomarker discovery Multi-source signatures for bioforensics and nuclear programs Fusing features from diverse sources Providing transparent interpretability Wide-ranging applications Nathan Baker Signature Discovery (PNNL-SA-101246)
  19. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Detecting signatures in dynamic environments I Bioinformatics-Inspired Signature Detection Generalized alphabets for sequence signatures Applications to cyber-security Compressive Sensing for Threat Detection Sparse representation of signatures and data Reduces data requirements for signature detection Nathan Baker Signature Discovery (PNNL-SA-101246)
  20. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Detecting signatures in dynamic environments II Graph Analytics for Dynamic Event Signatures Reduction of large data environments Detection of temporal signatures Hierarchical Signature Detection Multilevel signature detection Supports bandwidth-limited data environments Sensor Degradation and Signature Detection Understand the effect that changes in operating conditions have on signatures Design systems for robust performance Nathan Baker Signature Discovery (PNNL-SA-101246)
  21. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Integrating signature discovery capabilities I Analytic Framework Methods and Architecture A software platform to apply signature discovery methodologies across multiple domains Combines SDI tools and algorithms into a single computational framework Semantic Workflows for Signature Discovery Guides construction of signature discovery workflows Evaluates semantic similarity in signature workflows Signature Discovery Workbench Facilitates user interaction with SDI tools Supports visual analytic approaches to feature selection and signature construction Nathan Baker Signature Discovery (PNNL-SA-101246)
  22. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Signature Quality Metrics (SQM) Fidelity Other attributes Risk Utility analysis Primarily data driven Problem specific, requires SME input Graphics & metrics to compare signature systems Cost SQM model and software Landon Sego, Aimee Holmes, Luke Gosink, Bobbie-Jo Webb-Robertson, Helen Kreuzer, Richard Anderson, Alan Brothers, Courtney Corley, Mark Tardiff Nathan Baker Signature Discovery (PNNL-SA-101246)
  23. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Demonstrated SQM application areas Bioforensics: Identification of culture media and “suspect” institutions [Sego et al., 2013] Radiation Portal Monitoring: Algorithm comparison for gamma-ray PVT scanners at U.S. border crossings [Nobles et al., 2013] Proteomics: Identifying composite measures of quality for proteomic data sets [Amidan et al., 2013] Nathan Baker Signature Discovery (PNNL-SA-101246)
  24. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection A bioforensic signature system The Bayes net estimates the probability that a particular culture medium was used to grow the spores in the sample 15 possible candidate signature systems Nathan Baker Signature Discovery (PNNL-SA-101246)
  25. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Identifying the attributes of signature quality Fidelity For each forensic sample, a Bayes net produces a vector of probabilities indexed by culture medium Fidelity is measured by the log score: the natural log of the probability assigned to the true culture medium Monetary costs were identified for each assay, based on average prices posted by commercial laboratories Sample size: most precious resource is the biological sample $200 $100 $250 $170 Presence of heme Presence of agar C/N isotope ratios Metals (Cu, Zn) 0.1 mg 0.01 mg 1.0 mg 0.3 mg Nathan Baker Signature Discovery (PNNL-SA-101246)
  26. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Pareto frontier of bioforensic signatures Nathan Baker Signature Discovery (PNNL-SA-101246)
  27. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Comparing signatures with expected utility I u = 0.81u1(Prob. correct culture medium) + 0.14u2(Sample size) + 0.04u3(Assay cost) Nathan Baker Signature Discovery (PNNL-SA-101246)
  28. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Comparing signatures with expected utility II These systems included the IRMS assay Nathan Baker Signature Discovery (PNNL-SA-101246)
  29. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection MLSTONES: Machine Learning String Tools for Operational and Network Security Information is encoded and passed down through genes to proteins Proteins consist of ∼20 amino acids that can be mapped to 20 characters Protein sequences can be represented as text strings Proteins with similar sequences often have similar properties; matches do not have to be “exact” Elena Peterson, Chris Oehmen Nathan Baker Signature Discovery (PNNL-SA-101246)
  30. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Tools and models for sequence analysis BLAST (Basic Local Alignment Search Tool) – is the de facto standard for sequence alignment ScalaBlast is a highly parallelized version of the BLAST algorithms that run much faster and improves throughput [Oehmen & Baxter, 2013] Query Sequence Library or ‘subject’ Sequence Alignment and similarity score for sequence pair exact matches SIMLARITY SCORES: Score = 37.7 bits (119), Expect = 4e-04 ALIGNMENT REGION: Query: 6 AAANGEFPIA-CLLLQAACDFAEFPADIAD----HAKDFENG 42 AAANGEFPIA C QAACDFAEFPADIAD AKDFENG Sbjct: 5 AAANGEFPIAAC---QAACDFAEFPADIADAAACQAKDFENG 43 gap insertion mismatch >entity_2 HCAAAAANGEFPIAACQAACDFAEFPADIADAAACQAKDFENGAEAKADFEAFEAAAKCD FEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFE AKAACDFEAFEAKAACDFEAFEAKAACDFEAFENGEAKAACDFEAAACDFEAHIAACQDF PIAACQIHIHAKKAADFPIHIHIHIHIHIHAAIAIAAADFPIAIAADHIHIHIHIHIHIH H >entity_1 HMMMCAAANGEFPIACLLLQAACDFAEFPADIADHAKDFENGAEAKADFEAFEAAAKCDF EAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEA KAACDFEAFEAKAACDFEAFEAKAACDFEAFENGEAKAACDFEAAACDFEAHIHDFPIHI HIHAKKAADFPIHIHIHIHIHIHAAIAIAAADFPIAIAADHIHIHIHIHIHIHH Nathan Baker Signature Discovery (PNNL-SA-101246)
  31. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Application to binary analysis Binaries can be aligned with MLSTONES and clustered based on similarity [Peterson et al., 2013] Nathan Baker Signature Discovery (PNNL-SA-101246)
  32. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Sequence motifs as signatures files1/thing1:HMMM---ANGEFPIACLLLQAACDMNVCDADIADHKPMKWTVRTMKADFEAFEAAAK files1/thing2:HMMM---ANGEFPIACLLLQAACDPOIYTADIADHKPMKWTVRTMKADFEAFEAAAK files1/thing3:HMMM---ANGEFPIACLLLQAACDKKLMNADIADH----------KADFEAFEAAAK files1/thing4:HMMM---ANGEFPIACLLLQAACDSDFGHADIADH----------KADFEAFEAAAK files1/thing5:HMMM---ANGEFPRACLLLQAACDMMNPQADIADH----------KADFEAFEAAAK files1/thing6:HMMM---ANGEFPRACLLLQAACDERVPQADIADH----------KADFEAFEAAAK files1/thing7:HMMM---ANGEFPRACLLLQAACDKMLRTADIADH----------KADFEAFEAAAK files1/thing8:HMMM---ANGEFPRACLLLQAACDDDGKLADIADH----------KADFEAFEAAAK files1/thing9:HMMMLMNANGEFPRACLLLQAACDGHKIDADIADH----------KADFEAFEAAAK **** ******:********** ****** ************ Representative motifs can be identified with multiple alignments using software such as MAFFT Nathan Baker Signature Discovery (PNNL-SA-101246)
  33. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Sequence families as signatures Sequence families can be represented with hidden Markov models using software such as HMMER Nathan Baker Signature Discovery (PNNL-SA-101246)
  34. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection General applicability to sequence-type targets Pattern detection in biological systems provides A rich base of knowledge for detecting correlations A useful vocabulary for describing relationship and inheritance in those correlations Applying biological principles of inheritance makes it possible to Annotate or infer function Provide a framework for forensic analysis inference: create families and motifs based on “developer” or other forensic markers Nathan Baker Signature Discovery (PNNL-SA-101246)
  35. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Hierarchical Signature Detection Objective: Rapidly detect signatures from distributed data streams in bandwidth-limited environments Approach: Generate a hierarchical framework for dis- tributed analysis that Aggregates data and focuses attention at each tier Utilizes lower level output to perform intermediate tasks Allows algorithms to be swapped out based on current needs Benefits: Fast: Massively parallel, alleviates bandwidth limitations, provides early detection. Flexible: Common framework for different domains, algorithms, and data types. Paul Bruillard, Luke Gosink, Kenneth Jarman Nathan Baker Signature Discovery (PNNL-SA-101246)
  36. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Process for Hierarchical Detection 1. Express problem in a graphical framework 2. Analyze the graphical network locally: detect events that occur on edges and nodes 3. Expand analysis (spatially and temporally) to sub-graphs: detect phenomenon of interest that occur within subgraphs 4. Continue to expand and analyze graph neighborhoods p(y | x) ∝ p(y) m i=1 p(xi | y) Filter & Aggregate Learn & Adapt Complex classifiers Intermediate classifiers Simple classifiers Simple classifiers Simple classifiers Intermediate classifiers Simple classifiers Simple classifiers Simple classifiers Nathan Baker Signature Discovery (PNNL-SA-101246)
  37. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Process for Hierarchical Detection 1. Express problem in a graphical framework 2. Analyze the graphical network locally: detect events that occur on edges and nodes 3. Expand analysis (spatially and temporally) to sub-graphs: detect phenomenon of interest that occur within subgraphs 4. Continue to expand and analyze graph neighborhoods p(y | x) ∝ c∈C Ψc (xc | yc ) Filter & Aggregate Learn & Adapt Complex classifiers Intermediate classifiers Simple classifiers Simple classifiers Simple classifiers Intermediate classifiers Simple classifiers Simple classifiers NSimple classifiers Nathan Baker Signature Discovery (PNNL-SA-101246)
  38. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Process for Hierarchical Detection 1. Express problem in a graphical framework 2. Analyze the graphical network locally: detect events that occur on edges and nodes 3. Expand analysis (spatially and temporally) to sub-graphs: detect phenomenon of interest that occur within subgraphs 4. Continue to expand and analyze graph neighborhoods p(y | x) ∝ c∈C Ψc (xc | yc ) Filter & Aggregate Learn & Adapt Complex classifiers Intermediate classifiers Simple classifiers Simple classifiers Simple classifiers Intermediate classifiers Simple classifiers Simple classifiers Simple classifiers Nathan Baker Signature Discovery (PNNL-SA-101246)
  39. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection PILGram Overview PILGram: Proactive Intelligent Learning with Grammar A general purpose genetic algorithm for maximizing the accuracy of a user-specified classifier Accelerates the feature extraction process for any given classifier Capable of automatically integrating discrete data of multiple types PILGram Process Overview Identify applicable atomic features Apply genetic process to determine the optimal feature from atomic features. Iterate process to produce optimal non-redundant features Inspired by: H. Leather, et al, Automatic Feature Generation for Machine Learning Based Optimizing Compilation, 2009 International Symposium on Code Generation and Optimization Nathan Baker Signature Discovery (PNNL-SA-101246)
  40. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection Building features with PILGram Let M be all words of length N over an alphabet A of size A. Features, f : M → Rn, form a finite dimensional algebra Algebra is generated by indicator functions: χa,i (m) = 1 if mi = a 0 otherwise . Features are expressible as f = i bi a,i χa,i with bi ∈ R Trivially extends to multi-dimensional features. Nathan Baker Signature Discovery (PNNL-SA-101246)
  41. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection PILGram Details I Any feature can be approximated to a precision of p digits using the below Back-Naur Form (BNF) grammar and expression trees of depth bounded by 2p + N + n|M|. expr ::= op ( expr , expr ) | *( rat , fun ) | *( fun , fun ) | fun op ::= + | - | * rat ::= int | int / nat int ::= cat( nat int ) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 nat ::= cat( nat int ) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 fun ::= χi,c (M) Example A = {f , o}, M consists of words of length 3. Then the feature f (m) = 1 if m = foo 0 otherwise is given by: ∗ ∗ χf ,0 χo,1 χo,2 Nathan Baker Signature Discovery (PNNL-SA-101246)
  42. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection PILGram Details II PILGram optimizes a fitness function (typically classifier accuracy) over the space of finite precision features through a genetic algorithm. Generation 1 + ∗ χf ,0 χf ,1 10 Mutate Generation 2 ∗ ∗ χf ,0 χf ,1 10 Generation 1 ∗ + χf ,0 χo,1 χo,2 + ∗ χf ,0 χo,1 χo,2 Generation 2 ∗ ∗ χf ,0 χo,1 χo,2 Crossover + + χf ,0 χo,1 χo,2 Nathan Baker Signature Discovery (PNNL-SA-101246)
  43. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection HSD Example: “Jimmy the Fish” Identify a hidden crime network and the network boss based on intercepted communications and topic models. Nathan Baker Signature Discovery (PNNL-SA-101246)
  44. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Hierarchical Signature Detection HSD Components PILGram: automatically identifies message characteristics Na¨ ıve Bayes: tags messages of interest Conditional random fields: incorporates network topology for detection Nathan Baker Signature Discovery (PNNL-SA-101246)
  45. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Hierarchical Signature Detection Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  46. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Expanding the scope of Signature Discovery Initiative applications Currently in the fourth year of six-year investment New investments in applications Three large-scale Challenge Projects Ten small-scale application projects Goals Demonstrate impact and ability to construct, measure, detect, and assess quality for new signatures Evaluate cross-domain generalizability of the developed approaches Quantify the improvement over existing signatures and signature discovery processes Signatures of Environmental Perturbation Microbial Community and Organic Matter Resilience Discover one or more signatures that indicate impaired resilience of the soil carbon biogeochemical system The signature should inform a fundamental and predictive understanding of climate change effects on soils 1 Signatures of Illicit Nuclear Trafficking for Strategic Goods Use the SDI process to discover, exploit, and validate signatures within networks of strategic goods in international commerce Signatures will identify entities, relationships, and products having important implications to nuclear non-proliferation efforts. 3 Nathan Baker Signature Discovery (PNNL-SA-101246)
  47. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Learn more about the Signature Discovery Initiative at http://signatures.pnnl.gov or contact Nathan Baker (nathan.baker@pnnl.gov). Nathan Baker Signature Discovery (PNNL-SA-101246)
  48. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work References I [Amidan et al., 2013] Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Smith DJ, Sego LH, Payne SH, Tardiff MF. Signatures for Mass Spectrometry Data Quality. Molecular and Cellular Proteomics, in revision [Baker et al., 2013] Baker NA, Barr JL, Bonheyo GT, Joslyn CA, Krishnaswami K, Oxley ME, Quadrel R, Sego LH, Tardiff MF, Wynne AS. Research towards a systematic signature discovery process. IEEE Intelligence and Security Informatics Signature Discovery Workshop, 2013. http://dx.doi.org/10.1109/ISI.2013.6578848 [Hogan et al., 2013a] Hogan EA, JR Johnson, III, M Halappanavar, and C Lo. 2013. Graph Analytics for Signature Discovery. IEEE International Conference on Intelligence and Security Informatics (ISI), 2013, 315-320. http://dx.doi.org/10.1109/ISI.2013.6578850 [Hogan et al., 2013b] Hogan EA, JR Johnson, III, and M Halappanavar. Graph Coarsening for Path Finding in Cybersecurity Graphs. Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop (CSIIRW’13), 2013, 7. http://dx.doi.org/10.1145/2459976.2459984 Nathan Baker Signature Discovery (PNNL-SA-101246)
  49. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work References II [Oehmen & Baxter, 2013] Oehmen CS, Baxter DJ. ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems. Bioinformatics, 29, 797-798, 2013. http://dx.doi.org/10.1093/bioinformatics/btt013 [Peterson et al., 2013] Peterson ES, DS Curtis, AR Phillips, JR Teuton, and CS Oehmen. A Generalized Bio-inspired Method for Discovering Sequence-based Signatures IEEE International Conference on Intelligence and Security Informatics (ISI), 2013. http://dx.doi.org/10.1109/ISI.2013.6578853 [Nobles et al., 2013] Nobles MA, Sego LH, Cooley SK, Gosink LJ, Anderson RM, Hays SE, Tardiff MF. A Decision Theoretic Approach to Evaluate Radiation Detection Algorithms. Proceedings of the 2013 IEEE International Conference on Technologies for Homeland Security, to appear. Nathan Baker Signature Discovery (PNNL-SA-101246)
  50. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work References III [Sego et al., 2013] Sego LH, AE Holmes, LJ Gosink, BJM Webb-Robertson, HW Kreuzer, RM Anderson, AJ Brothers, CD Corley, and MF Tardiff. 2013. Assessing the Quality of Bioforensic Signatures. IEEE Intelligence and Security Informatics Signature Discovery Workshop, 2013. http://dx.doi.org/10.1109/ISI.2013.6578856 Nathan Baker Signature Discovery (PNNL-SA-101246)