Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KBase Fizzy

Gregory Ditzler
February 19, 2013
34

KBase Fizzy

Invited talk from a 2013 DOE contractors meeting.

Gregory Ditzler

February 19, 2013
Tweet

Transcript

  1. Overview •  What are the functional features (or some other

    variables) that provide the most differentiating information between multiple phenotypes in my data set? –  What are the OTUs that best differentiate between healthy and unhealthy patients? –  Such knowledge is not only useful for classification, but also interpretation of a data sample •  Feature selection can aid in the research of large sample data bases where examination of the data by hand is infeasible •  We propose using a feature selection tool - based off of the Chi- squared test - for selection the top 15 features in a sample. –  Deployment is on a KBase server for public use –  Feature selection is implemented using the Scikits-Learn Python module
  2. Functional Annotation Assign Functions Functional Databases (Pfam, SEED, etc.) “Functional

    Profile” Feature Selection Count Frequencies of Selected Functions Training Metagenomes Group 1 Group 2 Testing Metagenomes Classify Train Classifier Measure performance and Determine feature subsets that have best Accuracy and AUC Phenotype Classification: Which Features Best Discriminate Between Classes?
  3. Functional Annotation Assign Functions Functional Databases (Pfam, SEED, etc.) “Functional

    Profile” Feature Selection Count Frequencies of Selected Functions Training Metagenomes Group 1 Group 2 Testing Metagenomes Classify Train Classifier Measure performance and Determine feature subsets that have best Accuracy and AUC Phenotype Classification: Which Features Best Discriminate Between Classes? User Supplied! KBase!
  4. Found interesting trends along the age parameter after feature selection

    Lan, Kriete, and Rosen. Accepted to BMC Microbiome Journal. Overlap of Gut Microbiomes Case Study
  5. Total Experiment: 7 Fungi 8 Bacterium 0" 150" 300" 450"

    600" 0" 100" 200" 300" 400" CO2$ppm/hr$ Hours$ Ascomycetes$on$Xylan$ Faster   Slower   We annotated Metacyc pathways in the 15 genomes Experimental data collected by Chris Blackwood’s lab at Kent State University Correlating Respiration Rates to Metabolic Pathways
  6. Envisioned Workflow •  Input: Users will upload samples, have them

    annotated for taxonomy and function •  Implement a feature selection routine to identify relevant features in the learning problem •  Users will be allowed to select from a variety of information theory methods for feature selection •  Output: Users will be returned a list of relevant features
  7. Overview Fizzy  Module   KBase   KBase    Matrix  

      Service   Feature   Selec7on   Data   Retrieval   User   IDs  and  abundance   source   Request data Retrieve data Features KBase   Metagenome   Service   •  User makes a request to to Fizzy with KBase IDs and an abundance source •  Fizzy calls the KBase Matrix service to access the abundance data in Biom format •  Scikit-Learn’s Chi-squared feature selection is called with the data from the KBase Matrix service •  The user is returned a list of feature names
  8. How Feature Selection Works •  The Chi-squared statistic measures the

    dependence between random variable –  Features and class labels are random variables –  Features that are independent of the labels are non-informative for prediction and should be discarded •  The feature selection method works by computing the Chi- squared statistic for each class and feature combination –  Features are assumed to be independent –  Top 15 scoring features are selected •  Current implementation does not consider redundancy terms in the random variables
  9. Fizzy: The Feature Selector •  Program Inputs –  Metagenome IDs

    (list <string>) •  Metagenome IDs that are currently available on KBase. The “label” information is assumed to be located in the metadata under: metadata::sample::data::biome •  Future work will be extended to user uploaded data sets –  Source (string) •  M5NR, RefSeq, SwissProt, GenBank, IMG, SEED, TrEMBL, PATRIC, KEGG, M5RNA, RDP, Greengenes, LSU, and SSU –  Authentication may be required if access to a “private” ID(s) is being requested •  Program Output –  List of feature names collected via feature selection. Any potential error may have been listed in the returned feature set. (list <string>)
  10. Help with Fizzy (Python) gditzler$ ./kb_fizzy.py --help! Usage: kb_fizzy.py [options]!

    ! Options:! -h, --help show this help message and exit! -i STRING, --IDs=STRING! IDs from KBase! -s STRING, --source=STRING! Source of data! !
  11. Future Extensions to Fizzy •  Information theoretic methods for variable

    selection –  A Chi-squared test is one of many implementations of feature selection. We plan on integrating information theoretic feature selection methods, such as mRMR, into the KBase Fizzy module. –  Efficient C-libraries exist for such implementations •  Upload your own abundance tables –  We are planning on extending Fizzy to use KBase services that allow users to upload their own data. In which case, we will work with an implementation of Fizzy that receives the data from the KBase upload module rather than metagenome IDs •  Extended user control –  Control over: feature selection method, the number of features to select, and parameters affiliated with feature selection