Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research towards a systematic signature discovery process

Research towards a systematic signature discovery process

Nathan Baker

March 07, 2014
Tweet

More Decks by Nathan Baker

Other Decks in Science

Transcript

  1. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Research towards a systematic signature discovery process Nathan Baker1 1Computational and Statistical Analytics Division, Pacific Northwest National Laboratory March 3, 2014 Nathan Baker Signature Discovery (PNNL-SA-101246)
  2. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Acknowledgements Funding: PNNL LDRD Team members (highlighted in this presentation): Richard Anderson Nathan Baker Jonathan Barr Dan Best George Bonheyo Alan Brothers Paul Bruillard Court Corley Luke Gosink Alejandro Heredia-Langner Emilie Hogan Aimee Holmes Kris Jarman John Johnson Cliff Joslyn Kannan Krishnaswami Helen Kreuzer Bill Nickless Chris Oehmen Mark Oxley (AFIT) Elena Peterson Rich Quadrel Landon Sego Mark Tardiff Sandy Thompson Marvin Warner Bobbie-Jo Webb-Robertson Paul Whitney Adam Wynne Nathan Baker Signature Discovery (PNNL-SA-101246)
  3. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  4. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signatures provide a framework for heterogeneous feature integration Signature: A distinguishing collection of features that detects or characterizes a phenomenon of interest: Forensic, diagnostic, and prognostic signatures Modern signatures often involve features from many different data types Example: Maritime persistent surveillance and situational awareness Feature Measure- ment Prob of ID Step length 24 ± 2 in Step width 4.0 ± 0.5 in Knee angle 142 ± 5 deg 60% Walking speed 5.2 ± 0.2 fps Cycle time 2.6 ± 0.1 s 80% Acoustic power 70 dB 95% Is this Kim Smith? Nathan Baker Signature Discovery (PNNL-SA-101246)
  5. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Themes in signature discovery Challenges Modern signatures often involve features from many different data types Signatures are most useful when they are easily interpreted by decision-makers Key questions How do we best select features from multiple measurement sources to construct our signature? How do we detect signatures in “real world” environments? How do we assess the quality of a signature and compare different signatures? How do we recognize change and adapt signatures to dynamic phenomena? Is this process generalizable across domains? Nathan Baker Signature Discovery (PNNL-SA-101246)
  6. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature systems as mappings Elements of a signature system [Baker et al., 2013] Dimensional sets Dimensions can have multiple types Boolean Scalar Vector Et cetera Mappings µ - Observation ϑ - Feature extraction δ - Classification Events µ −→ Measurements ϑ −→ Features δ −→ (Labels, Uncertainties) Nathan Baker Signature Discovery (PNNL-SA-101246)
  7. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  8. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Developing a model of the signature discovery process Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Is the process of signature discovery generalizable across domains? Developed over the past four years Based on multiple sources Basic and applied research across several domains End-users Ongoing research projects 1000+ literature articles Highly iterative process Nathan Baker Signature Discovery (PNNL-SA-101246)
  9. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Specifying the problem Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development What are we looking for? Specify target phenomenon of interest Defines scope of signature (event space) Currently precludes anomaly detection Specify purpose Goal Detect Characterize Time frame Forensic Diagnostic Prognostic Nathan Baker Signature Discovery (PNNL-SA-101246)
  10. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Inventorying observables Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we observe the events associated with our target? This phase starts with testable hypothesis Hypotheses about the target Observables to test hypotheses Steps rely heavily on social dimension Multidisciplinary teams Brainstorming and related methods Selection of observables drives measurement process Nathan Baker Signature Discovery (PNNL-SA-101246)
  11. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Specifying measurements Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we collect data for our observables? Data collection requires Sampling strategy Measurement principle and platform Useful to characterize noise and drift Statistical design of experiments is important Nathan Baker Signature Discovery (PNNL-SA-101246)
  12. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Assessing and exploring the data Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Do salient features exist in the collected data? Typically results in significant reduction of data dimension and volume Useful tools include Data sub-setting Clustering Exploratory data analysis Resulting features form the basis for the signature Highly iterative process Nathan Baker Signature Discovery (PNNL-SA-101246)
  13. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative Developing, deploying, and assessing the signature Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we build and test the signature? Choose and train the classifier Apply the classifier to data Characterize signature behavior in presence of interference, drift, and other dynamics Evaluate signature performance on multiple dimensions Lack of suitable signature drives iterative development Nathan Baker Signature Discovery (PNNL-SA-101246)
  14. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Overview Process steps Signature Discovery Initiative The Signature Discovery Initiative Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development SDI has developed a formal process for signature discovery that transforms diverse data signature development to become: More efficient: reducing trial-and-error and decreasing the time to discovery a signature More economical: delivering methods and tools that allow users to reuse, rather than reinvent, signature discovery resources More rigorous: providing robust and well-defined processes for signature discovery Nathan Baker Signature Discovery (PNNL-SA-101246)
  15. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  16. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Instrumenting the signature discovery process: the SDI project portfolio Nathan Baker Signature Discovery (PNNL-SA-101246)
  17. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Constructing and validating signatures I Signature Quality Metrics Going beyond ROC curves Framework to combine signature fidelity, risk, cost, and utility Fishing for Features Greedy strategy to feature discovery Applied to survey-based disease signatures Applied to microbe-based fuel production Nathan Baker Signature Discovery (PNNL-SA-101246)
  18. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Constructing and validating signatures II Expert- and Data-driven Approaches to Signature Construction Enhancing expert input with data-driven discovery Applications to biomarker discovery Multi-source signatures for bioforensics and nuclear programs Fusing features from diverse sources Providing transparent interpretability Wide-ranging applications Nathan Baker Signature Discovery (PNNL-SA-101246)
  19. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Detecting signatures in dynamic environments I Bioinformatics-Inspired Signature Detection Generalized alphabets for sequence signatures Applications to cyber-security Compressive Sensing for Threat Detection Sparse representation of signatures and data Reduces data requirements for signature detection Nathan Baker Signature Discovery (PNNL-SA-101246)
  20. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Detecting signatures in dynamic environments II Graph Analytics for Dynamic Event Signatures Reduction of large data environments Detection of temporal signatures Hierarchical Signature Detection Multilevel signature detection Supports bandwidth-limited data environments Sensor Degradation and Signature Detection Understand the effect that changes in operating conditions have on signatures Design systems for robust performance Nathan Baker Signature Discovery (PNNL-SA-101246)
  21. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Integrating signature discovery capabilities I Analytic Framework Methods and Architecture A software platform to apply signature discovery methodologies across multiple domains Combines SDI tools and algorithms into a single computational framework Semantic Workflows for Signature Discovery Guides construction of signature discovery workflows Evaluates semantic similarity in signature workflows Signature Discovery Workbench Facilitates user interaction with SDI tools Supports visual analytic approaches to feature selection and signature construction Nathan Baker Signature Discovery (PNNL-SA-101246)
  22. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Signature Quality Metrics (SQM) Fidelity Other attributes Risk Utility analysis Primarily data driven Problem specific, requires SME input Graphics & metrics to compare signature systems Cost SQM model and software Landon Sego, Aimee Holmes, Luke Gosink, Bobbie-Jo Webb-Robertson, Helen Kreuzer, Richard Anderson, Alan Brothers, Courtney Corley, Mark Tardiff Nathan Baker Signature Discovery (PNNL-SA-101246)
  23. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Demonstrated SQM application areas Bioforensics: Identification of culture media and “suspect” institutions [Sego et al., 2013] Radiation Portal Monitoring: Algorithm comparison for gamma-ray PVT scanners at U.S. border crossings [Nobles et al., 2013] Proteomics: Identifying composite measures of quality for proteomic data sets [Amidan et al., 2013] Nathan Baker Signature Discovery (PNNL-SA-101246)
  24. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES A bioforensic signature system The Bayes net estimates the probability that a particular culture medium was used to grow the spores in the sample 15 possible candidate signature systems Nathan Baker Signature Discovery (PNNL-SA-101246)
  25. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Identifying the attributes of signature quality Fidelity For each forensic sample, a Bayes net produces a vector of probabilities indexed by culture medium Fidelity is measured by the log score: the natural log of the probability assigned to the true culture medium Monetary costs were identified for each assay, based on average prices posted by commercial laboratories Sample size: most precious resource is the biological sample $200 $100 $250 $170 Presence of heme Presence of agar C/N isotope ratios Metals (Cu, Zn) 0.1 mg 0.01 mg 1.0 mg 0.3 mg Nathan Baker Signature Discovery (PNNL-SA-101246)
  26. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Pareto frontier of bioforensic signatures Nathan Baker Signature Discovery (PNNL-SA-101246)
  27. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Comparing signatures with expected utility I u = 0.81u1(Prob. correct culture medium) + 0.14u2(Sample size) + 0.04u3(Assay cost) Nathan Baker Signature Discovery (PNNL-SA-101246)
  28. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Comparing signatures with expected utility II These systems included the IRMS assay Nathan Baker Signature Discovery (PNNL-SA-101246)
  29. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES MLSTONES: Machine Learning String Tools for Operational and Network Security Information is encoded and passed down through genes to proteins Proteins consist of ∼20 amino acids that can be mapped to 20 characters Protein sequences can be represented as text strings Proteins with similar sequences often have similar properties; matches do not have to be “exact” Elena Peterson, Chris Oehmen Nathan Baker Signature Discovery (PNNL-SA-101246)
  30. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Tools and models for sequence analysis BLAST (Basic Local Alignment Search Tool) – is the de facto standard for sequence alignment ScalaBlast is a highly parallelized version of the BLAST algorithms that run much faster and improves throughput [Oehmen & Baxter, 2013] Query Sequence Library or ‘subject’ Sequence Alignment and similarity score for sequence pair exact matches SIMLARITY SCORES: Score = 37.7 bits (119), Expect = 4e-04 ALIGNMENT REGION: Query: 6 AAANGEFPIA-CLLLQAACDFAEFPADIAD----HAKDFENG 42 AAANGEFPIA C QAACDFAEFPADIAD AKDFENG Sbjct: 5 AAANGEFPIAAC---QAACDFAEFPADIADAAACQAKDFENG 43 gap insertion mismatch >entity_2 HCAAAAANGEFPIAACQAACDFAEFPADIADAAACQAKDFENGAEAKADFEAFEAAAKCD FEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFE AKAACDFEAFEAKAACDFEAFEAKAACDFEAFENGEAKAACDFEAAACDFEAHIAACQDF PIAACQIHIHAKKAADFPIHIHIHIHIHIHAAIAIAAADFPIAIAADHIHIHIHIHIHIH H >entity_1 HMMMCAAANGEFPIACLLLQAACDFAEFPADIADHAKDFENGAEAKADFEAFEAAAKCDF EAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEAKAACDFEAFEA KAACDFEAFEAKAACDFEAFEAKAACDFEAFENGEAKAACDFEAAACDFEAHIHDFPIHI HIHAKKAADFPIHIHIHIHIHIHAAIAIAAADFPIAIAADHIHIHIHIHIHIHH Nathan Baker Signature Discovery (PNNL-SA-101246)
  31. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Application to binary analysis Binaries can be aligned with MLSTONES and clustered based on similarity [Peterson et al., 2013] Nathan Baker Signature Discovery (PNNL-SA-101246)
  32. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Sequence motifs as signatures files1/thing1:HMMM---ANGEFPIACLLLQAACDMNVCDADIADHKPMKWTVRTMKADFEAFEAAAK files1/thing2:HMMM---ANGEFPIACLLLQAACDPOIYTADIADHKPMKWTVRTMKADFEAFEAAAK files1/thing3:HMMM---ANGEFPIACLLLQAACDKKLMNADIADH----------KADFEAFEAAAK files1/thing4:HMMM---ANGEFPIACLLLQAACDSDFGHADIADH----------KADFEAFEAAAK files1/thing5:HMMM---ANGEFPRACLLLQAACDMMNPQADIADH----------KADFEAFEAAAK files1/thing6:HMMM---ANGEFPRACLLLQAACDERVPQADIADH----------KADFEAFEAAAK files1/thing7:HMMM---ANGEFPRACLLLQAACDKMLRTADIADH----------KADFEAFEAAAK files1/thing8:HMMM---ANGEFPRACLLLQAACDDDGKLADIADH----------KADFEAFEAAAK files1/thing9:HMMMLMNANGEFPRACLLLQAACDGHKIDADIADH----------KADFEAFEAAAK **** ******:********** ****** ************ Representative motifs can be identified with multiple alignments using software such as MAFFT Nathan Baker Signature Discovery (PNNL-SA-101246)
  33. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES Sequence families as signatures Sequence families can be represented with hidden Markov models using software such as HMMER Nathan Baker Signature Discovery (PNNL-SA-101246)
  34. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Signature Quality Metrics MLSTONES General applicability to sequence-type targets Pattern detection in biological systems provides A rich base of knowledge for detecting correlations A useful vocabulary for describing relationship and inheritance in those correlations Applying biological principles of inheritance makes it possible to Annotate or infer function Provide a framework for forensic analysis inference: create families and motifs based on “developer” or other forensic markers Nathan Baker Signature Discovery (PNNL-SA-101246)
  35. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Outline Introduction The Signature Discovery process Overview Process steps Signature Discovery Initiative Signature Discovery Initiative research projects Signature Quality Metrics MLSTONES Conclusions and future work Nathan Baker Signature Discovery (PNNL-SA-101246)
  36. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Summary and future work Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Making the signature discovery process... More rigorous More reproducible More efficient Continued investment in signature discovery Demonstrating impact through application Testing the “generalizbility” hypothesis Nathan Baker Signature Discovery (PNNL-SA-101246)
  37. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work Learn more about the Signature Discovery Initiative at http://signatures.pnnl.gov or contact Nathan Baker ([email protected]). Nathan Baker Signature Discovery (PNNL-SA-101246)
  38. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work References I [Amidan et al., 2013] Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Smith DJ, Sego LH, Payne SH, Tardiff MF. Signatures for Mass Spectrometry Data Quality. Molecular and Cellular Proteomics, in revision [Baker et al., 2013] Baker NA, Barr JL, Bonheyo GT, Joslyn CA, Krishnaswami K, Oxley ME, Quadrel R, Sego LH, Tardiff MF, Wynne AS. Research towards a systematic signature discovery process. IEEE Intelligence and Security Informatics Signature Discovery Workshop, 2013. http://dx.doi.org/10.1109/ISI.2013.6578848 [Hogan et al., 2013a] Hogan EA, JR Johnson, III, M Halappanavar, and C Lo. 2013. Graph Analytics for Signature Discovery. IEEE International Conference on Intelligence and Security Informatics (ISI), 2013, 315-320. http://dx.doi.org/10.1109/ISI.2013.6578850 [Hogan et al., 2013b] Hogan EA, JR Johnson, III, and M Halappanavar. Graph Coarsening for Path Finding in Cybersecurity Graphs. Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop (CSIIRW’13), 2013, 7. http://dx.doi.org/10.1145/2459976.2459984 Nathan Baker Signature Discovery (PNNL-SA-101246)
  39. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work References II [Oehmen & Baxter, 2013] Oehmen CS, Baxter DJ. ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems. Bioinformatics, 29, 797-798, 2013. http://dx.doi.org/10.1093/bioinformatics/btt013 [Peterson et al., 2013] Peterson ES, DS Curtis, AR Phillips, JR Teuton, and CS Oehmen. A Generalized Bio-inspired Method for Discovering Sequence-based Signatures IEEE International Conference on Intelligence and Security Informatics (ISI), 2013. http://dx.doi.org/10.1109/ISI.2013.6578853 [Nobles et al., 2013] Nobles MA, Sego LH, Cooley SK, Gosink LJ, Anderson RM, Hays SE, Tardiff MF. A Decision Theoretic Approach to Evaluate Radiation Detection Algorithms. Proceedings of the 2013 IEEE International Conference on Technologies for Homeland Security, to appear. Nathan Baker Signature Discovery (PNNL-SA-101246)
  40. Introduction The Signature Discovery process Signature Discovery Initiative research projects

    Conclusions and future work References III [Sego et al., 2013] Sego LH, AE Holmes, LJ Gosink, BJM Webb-Robertson, HW Kreuzer, RM Anderson, AJ Brothers, CD Corley, and MF Tardiff. 2013. Assessing the Quality of Bioforensic Signatures. IEEE Intelligence and Security Informatics Signature Discovery Workshop, 2013. http://dx.doi.org/10.1109/ISI.2013.6578856 Nathan Baker Signature Discovery (PNNL-SA-101246)