Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Signature Discovery: Data Analytics and Data F...

Nathan Baker
September 29, 2014

Signature Discovery: Data Analytics and Data Fusion

Presentation at SCIX conference, 29-Sep-2014.

Nathan Baker

September 29, 2014
Tweet

More Decks by Nathan Baker

Other Decks in Research

Transcript

  1. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature Discovery: Data Analytics and Data Fusion Nathan Baker1 1Applied Statistics & Computational Modeling Group, Pacific Northwest National Laboratory September 29, 2014 Nathan Baker Signature Discovery (PNNL-SA-105568)
  2. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Acknowledgements Funding: PNNL Laboratory-Directed Research & Development (LDRD) Subset of the 100+ team members: George Bonheyo Paul Bruillard Courtney Corley Alejandro Heredia-Langner Emilie Hogan Kris Jarman Helen Kreuzer Kannan Krishnaswami Lee Ann McCue Jason McDermott Mark Oxley (AFIT) Rich Quadrel Karin Rodland Landon Sego Jana Strasburg Mark Tardiff Sandy Thompson Karen Wahl Marvin Warner Bobbie-Jo Webb-Robertson Nathan Baker Signature Discovery (PNNL-SA-105568)
  3. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Outline Signature discovery Overview Signature discovery process Signature Discovery Initiative Signature Discovery Initiative research projects Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Conclusions and ongoing work Nathan Baker Signature Discovery (PNNL-SA-105568)
  4. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Signatures provide a framework for heterogeneous feature integration Signature: A distinguishing collection of features that detects or characterizes a phenomenon of interest: Forensic signatures Diagnostic signatures Prognostic signatures Example: Bioforensic attribution Nathan Baker Signature Discovery (PNNL-SA-105568)
  5. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Themes in signature discovery Modern signatures often involve features from many different data types Signatures are most useful when they are easily interpreted by decision-makers Nathan Baker Signature Discovery (PNNL-SA-105568)
  6. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Signature systems as mappings Elements of a signature system [Baker et al., 2013] Dimensional sets Dimensions can have multiple types Boolean Scalar Vector Et cetera Mappings µ - Observation ϑ - Feature extraction δ - Classification Events µ −→ Measurements ϑ −→ Features δ −→ (Labels, Uncertainties) Nathan Baker Signature Discovery (PNNL-SA-105568)
  7. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Developing a model of the signature discovery process Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Is the process of signature discovery generalizable across domains? Developed over the past four years Based on multiple sources Basic and applied research across several domains End-users Ongoing research projects 1000+ literature articles Highly iterative process Nathan Baker Signature Discovery (PNNL-SA-105568)
  8. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Specifying the problem Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development What are we looking for? Specify target phenomenon of interest Defines scope of signature (event space) Currently precludes anomaly detection Specify purpose Goal Detect Characterize Time frame Forensic Diagnostic Prognostic Nathan Baker Signature Discovery (PNNL-SA-105568)
  9. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Inventorying observables Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we observe the events associated with our target? This phase starts with testable hypotheses Hypotheses about the target Observables to test hypotheses Steps rely heavily on social dimension Multidisciplinary teams Brainstorming and related methods Selection of observables drives measurement process Nathan Baker Signature Discovery (PNNL-SA-105568)
  10. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Specifying measurements Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we collect data for our observables? Data collection requires Sampling strategy Measurement principle and platform Useful to characterize noise and drift Statistical design of experiments is important Nathan Baker Signature Discovery (PNNL-SA-105568)
  11. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Assessing and exploring the data Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development Do salient features exist in the collected data? Typically results in significant reduction of data dimension and volume Useful tools include Data sub-setting Clustering Exploratory data analysis Resulting features form the basis for the signature Highly iterative process Nathan Baker Signature Discovery (PNNL-SA-105568)
  12. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative Developing, deploying, and assessing the signature Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development How do we build and test the signature? Choose and train the classifier Apply the classifier to data Characterize signature behavior in presence of interference, drift, and other dynamics Evaluate signature performance on multiple dimensions Lack of suitable signature drives iterative development Nathan Baker Signature Discovery (PNNL-SA-105568)
  13. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Overview Signature discovery process Signature Discovery Initiative The Signature Discovery Initiative Select observables for measurement Specify measurement principle Specify measurement procedure Measure or collect Feature extraction processes Features suitable? Signature construction Signature detection Signature quality assessment Signature suitable? Generate hypotheses Identify potential observables Specify target phenomena of interest Specify purpose START FINISH Problem specification Inventory hypotheses and observables Specify measurements Data assessment and exploration Signature development SDI has developed a formal process that transforms signature discovery using diverse data types to become: More efficient: reducing trial-and-error and decreasing the time to discovery a signature More economical: delivering methods and tools that allow users to reuse, rather than reinvent, signature discovery resources More rigorous: providing robust and well-defined processes for signature discovery Nathan Baker Signature Discovery (PNNL-SA-105568)
  14. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Outline Signature discovery Overview Signature discovery process Signature Discovery Initiative Signature Discovery Initiative research projects Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Conclusions and ongoing work Nathan Baker Signature Discovery (PNNL-SA-105568)
  15. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Instrumenting the signature discovery process: the SDI project portfolio Nathan Baker Signature Discovery (PNNL-SA-105568)
  16. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Constructing and validating signatures I Signature Quality Metrics Going beyond ROC curves Framework to combine signature fidelity, risk, cost, and utility Fishing for Features Greedy strategy to feature discovery Applied to survey-based disease signatures Applied to microbe-based fuel production Nathan Baker Signature Discovery (PNNL-SA-105568)
  17. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Constructing and validating signatures II Expert- and Data-driven Approaches to Signature Construction Enhancing expert input with data-driven discovery Applications to biomarker discovery Multi-source Signatures for Bioforensics and Nuclear Programs Fusing features from diverse sources Providing transparent interpretability Wide-ranging applications Nathan Baker Signature Discovery (PNNL-SA-105568)
  18. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Detecting Signatures in Dynamic Environments I Bioinformatics-Inspired Signature Detection Generalized alphabets for sequence signatures Applications to cyber-security Compressive Sensing for Threat Detection Sparse representation of signatures and data Reduces data requirements for signature detection Nathan Baker Signature Discovery (PNNL-SA-105568)
  19. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Detecting Signatures in Dynamic Environments II Graph Analytics for Dynamic Event Signatures Reduction of large data environments Detection of temporal signatures Hierarchical Signature Detection Multilevel signature detection Supports bandwidth-limited data environments Nathan Baker Signature Discovery (PNNL-SA-105568)
  20. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Detecting Signatures in Dynamic Environments III Sensor Degradation and Signature Detection Understand the effect that changes in operating conditions have on signatures Design systems for robust performance Nathan Baker Signature Discovery (PNNL-SA-105568)
  21. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Integrating signature discovery capabilities I Analytic Framework Methods and Architecture A software platform to apply signature discovery methodologies across multiple domains Combines SDI tools and algorithms into a single computational framework Semantic Workflows for Signature Discovery Guides construction of signature discovery workflows Evaluates semantic similarity in signature workflows Signature Discovery Workbench Facilitates user interaction with SDI tools Supports visual analytic approaches to feature selection and signature construction Nathan Baker Signature Discovery (PNNL-SA-105568)
  22. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Signatures for bioforensic attribution Bobbie-Jo Webb-Robertson, Courtney Corley, Helen Kreuzer, Lee Ann McCue, Karen Wahl Nathan Baker Signature Discovery (PNNL-SA-105568)
  23. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Building a multi-source signature Goal: Develop signatures to prioritize bioforensic investigation leads. Challenge: How can structured (experimental data) and unstructured (text, literature, reporting) be integrated into a signature for attribution? Nathan Baker Signature Discovery (PNNL-SA-105568)
  24. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Multi-source experimental signatures for production environments Prior work [Jarman et al., 2008] demonstrated that using disparate analytical measurements (DS , DM, DE , DI ) of Bacillus spores could yield a predictive model of production environment (R). Nathan Baker Signature Discovery (PNNL-SA-105568)
  25. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Multi-source experimental signatures for production environments Example production environments: BA – Blood Agar G{A, B} - Glucose Medium {Agar, Broth} LB{A, B} – Luria-Bertaini {Agar, Broth} N{A, B} – Nutrient {Agar, Broth} NSM{A, B} - Nutrient Sporulating Medium {Agar, Broth} TS{A, B} – Tryptic Soy {Agar, Broth} . . . Nathan Baker Signature Discovery (PNNL-SA-105568)
  26. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Multi-source experimental signatures for production environments Nathan Baker Signature Discovery (PNNL-SA-105568)
  27. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Expanding experimental signatures for institution prioritization How do we identify institutions that have experience with the kind of culturing practice pointed to by the experimental evidence? Nathan Baker Signature Discovery (PNNL-SA-105568)
  28. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Expanding experimental signatures for institution prioritization Institutions are linked to the literature; can culturing recipes be predicted from scientific documents? Nathan Baker Signature Discovery (PNNL-SA-105568)
  29. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Expanding experimental signatures for institution prioritization Use text mining methods on journal articles in the public domain. Nathan Baker Signature Discovery (PNNL-SA-105568)
  30. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Expanding experimental signatures for institution prioritization Probabilities can be calculated based on experimental evidence and observed text entity frequencies. Nathan Baker Signature Discovery (PNNL-SA-105568)
  31. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Testing the combined signature Information 144 total documents 52 documents hand-curated 92 additional documents 165 institution Evaluation Cross-validation (bootstrapping): 52 documents Area under Receiver Operating Characteristic curve (AUC) Nathan Baker Signature Discovery (PNNL-SA-105568)
  32. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Testing the combined signature Results limited to the culture medias of the hand curation. Bayesian AUC = 0.7 ± 0.2, random AUC = 0.5 ± 0.1. Nathan Baker Signature Discovery (PNNL-SA-105568)
  33. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Advantages and application of the combined signature Integrates available information: additional experimental and/or soft data streams can be added Can modify the final probability and prioritization (e.g., foreign vs. domestic, individual researchers) Automated approach, any number of documents (institutions, people) can be evaluated Yields a easy-to-interpret confidence metric Nathan Baker Signature Discovery (PNNL-SA-105568)
  34. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Signature Quality Metrics (SQM) Fidelity Other attributes Risk Utility analysis Primarily data driven Problem specific, requires SME input Graphics & metrics to compare signature systems Cost SQM model and software Landon Sego, Aimee Holmes, Luke Gosink, Bobbie-Jo Webb-Robertson, Helen Kreuzer, Daniel Watkins, Richard Anderson, Alan Brothers, Courtney Corley, Mark Tardiff Nathan Baker Signature Discovery (PNNL-SA-105568)
  35. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Demonstrated SQM application areas Bioforensics: Identification of culture media and “suspect” institutions [Sego et al., 2013, Watkins et al., 2013] Radiation Portal Monitoring: Algorithm comparison for gamma-ray PVT scanners at U.S. border crossings [Nobles et al., 2013] Proteomics: Identifying composite measures of quality for proteomic data sets [Amidan et al., 2014] Nathan Baker Signature Discovery (PNNL-SA-105568)
  36. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment A bioforensic signature system The Bayes net estimates the probability that a particular culture medium was used to grow the spores in the sample 15 possible candidate signature systems Nathan Baker Signature Discovery (PNNL-SA-105568)
  37. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Identifying the attributes of signature quality Fidelity For each forensic sample, a Bayes net produces a vector of probabilities indexed by culture medium Fidelity is measured by the log score: the natural log of the probability assigned to the true culture medium Monetary costs were identified for each assay, based on average prices posted by commercial laboratories Sample size: most precious resource is the biological sample $200 $100 $250 $170 Presence of heme Presence of agar C/N isotope ratios Metals (Cu, Zn) 0.1 mg 0.01 mg 1.0 mg 0.3 mg Nathan Baker Signature Discovery (PNNL-SA-105568)
  38. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Pareto frontier of bioforensic signatures Nathan Baker Signature Discovery (PNNL-SA-105568)
  39. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Comparing signatures with expected utility I u = 0.80u1(Prob. correct culture medium) + 0.15u2(Sample size) + 0.05u3(Assay cost) Nathan Baker Signature Discovery (PNNL-SA-105568)
  40. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Comparing signatures with expected utility II These systems included the IRMS assay Nathan Baker Signature Discovery (PNNL-SA-105568)
  41. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Outline Signature discovery Overview Signature discovery process Signature Discovery Initiative Signature Discovery Initiative research projects Signature discovery methodology Applying signature discovery to bioforensics Signature quality assessment Conclusions and ongoing work Nathan Baker Signature Discovery (PNNL-SA-105568)
  42. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Expanding the scope of Signature Discovery Initiative applications New investments in applications Large-scale Challenge Projects Small-scale application projects Goals Demonstrate impact and ability to construct, measure, detect, and assess quality for new signatures Evaluate cross-domain generalizability of the developed approaches Quantify the improvement over existing signatures and signature discovery processes Develop partnerships and collaborations Signatures of Environmental Perturbation Microbial Community and Organic Matter Resilience Discover one or more signatures that indicate impaired resilience of the soil carbon biogeochemical system The signature should inform a fundamental and predictive understanding of climate change effects on soils 1 Signatures of Illicit Nuclear Trafficking for Strategic Goods Use the SDI process to discover, exploit, and validate signatures within networks of strategic goods in international commerce Signatures will identify entities, relationships, and products having important implications to nuclear non-proliferation efforts. 3 Nathan Baker Signature Discovery (PNNL-SA-105568)
  43. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work Learn more about the Signature Discovery Initiative at http://signatures.pnnl.gov or contact Nathan Baker ([email protected]). Nathan Baker Signature Discovery (PNNL-SA-105568)
  44. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work References I [Amidan et al., 2014] Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Smith DJ, Sego LH, Payne SH, Tardiff MF. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research, 13, 2215-22, 2014. http://dx.doi.org/0.1021/pr401143e [Baker et al., 2013] Baker NA, Barr JL, Bonheyo GT, Joslyn CA, Krishnaswami K, Oxley ME, Quadrel R, Sego LH, Tardiff MF, Wynne AS. Research towards a systematic signature discovery process. IEEE Intelligence and Security Informatics Signature Discovery Workshop, 2013. http://dx.doi.org/10.1109/ISI.2013.6578848 [Jarman et al., 2008] Jarman K, Kreuzer-Martin H, Wunschel D, Valentine N, Cliff J, Petersen C, Colburn H, Wahl K. Bayesian-Integrated Microbial Forensics. Applied and Environmental Microbiology, 74, 3573-82, 2008. http://dx.doi.org/10.1128/aem.02526-07 Nathan Baker Signature Discovery (PNNL-SA-105568)
  45. Signature discovery Signature Discovery Initiative research projects Conclusions and ongoing

    work References II [Nobles et al., 2013] Nobles MA, Sego LH, Cooley SK, Gosink LJ, Anderson RM, Hays SE, Tardiff MF. A Decision Theoretic Approach to Evaluate Radiation Detection Algorithms. Proceedings of the 2013 IEEE International Conference on Technologies for Homeland Security, 2013. http://dx.doi.org/10.1109/THS.2013.6699086 [Sego et al., 2013] Sego LH, Holmes AE, Gosink LJ, Webb-Robertson BJM, Kreuzer HW, Anderson RM, Brothers AJ, Corley CD, Tardiff MF. Assessing the Quality of Bioforensic Signatures. IEEE Intelligence and Security Informatics Signature Discovery Workshop, 2013. http://dx.doi.org/10.1109/ISI.2013.6578856 [Watkins et al., 2013] Watkins DM, Sego LH, Holmes AE, Webb-Robertson BM, White AM, Wunschel DS, Kreuzer HW, Corley CD, Tardiff MF. Assessing Performance and Tradeoffs of Bioforensic Signature Analysis. Proceedings of the 2013 IEEE International Conference on Technologies for Homeland Security, 304-309, 2013. http://dx.doi.org/10.1109/ths.2013.6699019 Nathan Baker Signature Discovery (PNNL-SA-105568)