Upgrade to Pro — share decks privately, control downloads, hide ads and more …

My very first faculty job talk in 2008, USC

James Taylor
January 01, 2008

My very first faculty job talk in 2008, USC

The very first talk from my faculty job search. Learned a lot, didn't get the offer. Still wondering if I should have left out the flying spaghetti monster.

James Taylor

January 01, 2008
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Evolutionary signs of function • Comparisons between multiple genomes can

    reveal functional constraint • After speciation different changes occur in each lineage, events occur randomly, but selection determines if events are tolerated • Constraint due to function may prevent certain changes, resulting in a different pattern of change in functional regions
  2. Adapted from Wasserman and Sandelin. Nature Review Genetics. 2004. Transcription

    Initiation Complex Gene Promoter cis-regulatory module
  3. 0 1 2 bits -10 G T A -9 G

    T A -8 G T A -7 G C T -6 C A T G -5 A C T -4 T A G -3 G C A -2 A G C T -1 0 A C T 1 2 G 3 4 C A G T 5 T A C 6 T G A 7 T A C 8 C T G A 9 A T 10 A T 11 C A T Structure: Schultz et al. Science. 1991, Logo software: Crooks et al. Genome Research. 2004
  4. Figure 3. Evolution of Zeste Binding Sites in the z

    and Ubx Promoters (A) Two experimentally characterized [37] Zeste binding sites in the z promoter for which we cannot rej evolving under the HBZ model using the T statistic. (B) Four experimentally characterized [34–36] Zeste binding sites the Ubx promoter for which we can rej evolving under the HBZ model using the T statistic. In the species missing orthologous binding sites for (i) on the opposite strand in approximately the same locations, consistent with compensatory evolution. Moses et al. PLOS Computational Biology, 2006 Matching sites on opposite strand Insertion in D. mel maintains binding site
  5. Transcription regulation data from the ENCODE pilot project • Promoter

    activity in 16 cell lines for ~600 putative promoters • ChIP-chip for 18 DNA binding proteins (sequence specific and general transcription machinary) • DNAseI hypersensitivity and Nucleosome depletion
  6. Comparative genomic data from the ENCODE pilot project • Sequence

    from orthologous regions in 28 species • Concept: annotate all of the regions of the human genome that are “under evolutionary constraint” • Used all three popular methods (phastCons, GERP, binCons) • “Moderate consensus set” of all regions predicted by at least two methods • Covers ~4.9% of the ENCODE regions
  7. For example, phastCons ≈ fraction of genome in conserved regions

    ≈ average length of conserved elements Siepel et al. Genome Research, 2006 emits string of alignment columns ≈ alignment c n begin TCGCGACATATACGA... TTGGGGCATGTGGGT... AGCAGACGTCCGCAA...
  8. contrast, an unexpectedly larg functional elements show no ranging from

    93% for Un.TxF of non-coding functional elem elements seemed to be uncons There are two methodolog apparent excess of unconstra tional elements: the underestim estimation of experimentally i not believe that either of thes large and varied levels of unc sequences. The set of constrain ate and complete due to the d by bulk fitting procedures and constraint there is clearly a pro tured in the defined 4.9% of co Fraction of experimental annotation overlapping constrained sequence b CDSs 5′ UTRs 3′ UTRs Un.TxFrags Pseudogenes RxFrags DHSs FAIRE RFBRs-SeqSp RFBRs ARs RNA transcription Open chromatin DNA/protein annotation Overlap 20% 70% 33% Bases Overall Regions Yes Yes No Yes 25% 75% (3 out of 4) Bases Regions 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 Figure 11 | Overlap of constrained sequences and various experimental annotations. a, A schematic depiction shows the different tests used for assessing overlap between experimental annotations and constrained sequences, both for individual bases and for entire regions. b, Observed fraction of overlap, depicted separately for bases and regions. The results are shown for selected experimental annotations. The internal bars indicate 95% Un.TxFrags AR TSSs F RxFrags CDSs CSs 0 0.0005 0.00010 Rate of po Heterozygosity (×10–4) 5′ UTRs 4 5 6 7 8 9 10 Intersection of features with conserved regions ENCODE “RFBRs” are too large (high overlap with constrained elements at region level, low at base level)
  9. regulatory information is not dispersed independently across the genome, but

    rather is clustered into distinct regions57. We refer to regions that contain multiple regulatory elements as ‘regulatory clus- ters’. We sought to predict the location of regulatory clusters by promoter distal sites Predicting matin stru tone modi location a of chroma machine ( DHSs to d TSSs. We proximal performed Informatio TSSs using 110 high-s TSS. As ex groups (d ticular cat To inve gene expre a transcrip results of tions acro of predict variables; 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 0.1 0.15 0.2 0.25 0.3 Fraction of TSSs near RFBRs Fraction of RFBRs near TSSs E2F1 Pol II TAF1 MYC CTCF SIRT1 SPI1 H3K27me3 STAT1 SMARCC1 SMARCC2 H3K4me2 H3K4me3 H3K4me1 Sequence-specific >200 >100 > 50 > 25 ≤ 25 General >200 >100 > 50 > 25 ≤ 25 Figure 6 | Distribution of RFBRs relative to GENCODE TSSs. Different RFBRs from sequence-specific factors (red) or general factors (blue) are ENCODE “RFBRs” are biased toward TSSs
  10. An alternative set of ENCODE regulatory elements • Putative transcriptional

    regulatory regions (pTRRs) • ChIP-chip data for sequence-specific factors and identified using experimental platforms with high site resolution • Supported by secondary evidence (DNase hypersensitivity, nucleosome depletion, certain chromatin modifications)
  11. Evaluation • Calculate the ability of score to distinguish regulatory

    regions from neutral regions at various threshold • Sensitivity: fraction of regulatory regions scoring over threshold • Specificity: fraction of ancestral repeats excluded at threshold
  12. First problem: are we being too stringent? • phastCons is

    trained to partition conserved / non- conserved regions, but how informative is its range? • An alternative measure: Alignability • Simply, the fraction of an element that can be aligned between two species • Composite alignability: average of alignabilities of multiple species weighted by branch lengths
  13. Background Correction • Need to correct for this regional variation

    to get comparable measures • Our approach: Standardize element’s scores relative to the ENCODE region that contains them (0.5 to 1.8 megabases) • Takes into account other regional features, not just neutral rate
  14. ROC Curves: “Specific” Stanford promoters Sn 1.0 DNAse hypersens. specific

    promoters 0.0 0.2 ub 1-Sp Sn 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 DNAse hypersens. pTRR 0.0 0.2 composite alignability background corrected composite alignability phastCons background corrected phastCons Sp Sn
  15. ROC Curves: pTRRs 0.8 1.0 s. pTRR 0.0 Sp Sn

    1-Sp Sn 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 DNAse hypersens. pTRR 0.0 0.2 composite alignability background corrected composite alignability phastCons background corrected phastCons
  16. Sensitivity when specificity is 75% Conclusion: pTRRs show evidence of

    evolutionary constraint that distinguishes them from neutral regions, even though this cannot be detected by a method that relies on strong conservation (e.g. phastCons). correction type feature score none background neutra DHS alignability 0.4322 0.3075 0 phastcons 0.5346 0.5316 0 pTRR alignability 0.6033 0.7552 0 phastcons 0.3989 0.4053 0 specific promoters alignability 0.3681 0.5205 0 phastcons 0.6687 0.6871 0 ubiquitous promoters alignability 0.5948 0.8017 0 phastcons 0.7328 0.7845 0 Table 2: Sensitivity of different scores when specificity is fixe score for each feature is shown in bold. correction type feature score none background neutra DHS alignability 0.4322 0.3075 0 phastcons 0.5346 0.5316 0 pTRR alignability 0.6033 0.7552 0 phastcons 0.3989 0.4053 0 specific promoters alignability 0.3681 0.5205 0 phastcons 0.6687 0.6871 0 ubiquitous promoters alignability 0.5948 0.8017 0 phastcons 0.7328 0.7845 0
  17. Finding cis-regulatory elements using comparative genomics: Some lessons from ENCODE

    data David C. King,1,2,7 James Taylor,1,3,7 Ying Zhang,1,2 Yong Cheng,1,2 Heather A. Lawson,1,4 Joel Martin,1,2 ENCODE groups for Transcriptional Regulation and Multispecies Sequence Analysis, Francesca Chiaromonte,1,5 Webb Miller,1,3,6 and Ross C. Hardison1,2,8 1Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 2Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 3Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 4Department of Anthropology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 5Department of Statistics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 6Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA Identification of functional genomic regions using interspecies comparison will be most effective when the full span of relationships between genomic function and evolutionary constraint are utilized. We find that sets of putative transcriptional regulatory sequences, defined by ENCODE experimental data, have a wide span of evolutionary histories, ranging from stringent constraint shown by deep phylogenetic comparisons to recent selection on lineage-specific elements. This diversity of evolutionary histories can be captured, at least in part, by the suite of available comparative genomics tools, especially after correction for regional differences in the neutral substitution rate. Putative transcriptional regulatory regions show alignability in different clades, and the genes associated with them are enriched for distinct functions. Some of the putative regulatory regions show evidence for recent selection, including a primate-specific, distal promoter that may play a novel role in regulation. [Supplemental material is available online at www.genome.org.] Deciphering the language and evolution of gene regulatory mechanisms is one of the challenging goals of genomics and systems biology. Even the most basic concepts about the rela- tionship between function and evolution in noncoding DNA are still being refined (Miller et al. 2004; Dermitzakis et al. 2005). Conservation of noncoding sequences among divergent species, inferred from genomic sequence alignments, has been used widely as a predictor of cis-regulatory modules (CRMs) (Gumucio et al. 1996; Frazer et al. 2003). Notable success has been achieved opmental enhancers in gain-of-function assays (Aparicio et al. 1995; Nobrega et al. 2003, 2004; Woolfe et al. 2005; Bejerano et al. 2006). In contrast, some apparently constrained noncoding DNA sequences have little or no obvious function. Some gene deserts contain large numbers of noncoding sequences appar- ently constrained in mammals, but deletion of two gene deserts from mice generated only mild phenotypes (Nobrega et al. 2004). This led the investigators to “question the functionality, if any, of many of the large number of noncoding sequences shared Letter
  18. A machine learning approach • Don’t assume a database of

    known binding motifs • Don’t assume strict conservation of the important sequence signals • Instead, use alignments of validated examples to learn sequence and evolutionary patterns that characterize a class of elements
  19. Conceptual framework • Two training sets consisting of alignments, for

    example confirmed regulatory regions and neutral regions • Treat each alignment as a string over the alphabet of all possible alignment columns • Learn a classifier that distinguishes based on short discriminating substrings (“words”) • Problem: as the number of species grows, alphabet becomes large and parameter space enormous
  20. Objective Find a mapping from alignment columns into a smaller

    alphabet that maintains the “right” information for some classification problem CTCCCAGCTGCCCAGTGCCGCCTCTTTTT CTCCTAGCTG-CCAGCATCTCCCGTTTTT CTCCCAGCTGCCCTGCGCCTCCTCTTTTT ↓ 13111021321110232112113133333
  21. ESPERR: Learning strong and weak signals in genomic sequence alignments

    to identify functional elements James Taylor,1 Svitlana Tyekucheva, David C. King, Ross C. Hardison, Webb Miller, and Francesca Chiaromonte1 Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA Genomic sequence signals—such as base composition, presence of particular motifs, or evolutionary constraint—have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy (∼94%). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr). [Supplemental material is available online at www.genome.org.] Identification of functional elements within genome sequences often relies on specific characteristic signals, typically based on known biological examples. For instance, prediction of protein- coding exons and genes relies on knowledge of the genetic code and splicing signals. These predictions can be improved by in- corporating evolutionary information from orthologous regions of other species through sequence alignments. In particular, in- sertions and deletions are rarely tolerated in coding regions, most ubiquitous promoters, and (3) evolutionary patterns, par- ticularly a high level of interspecies conservation, which should characterize functional regions under purifying selection. While each of these signals is associated with some cis- regulatory modules, all of them have limitations (Tompa et al. 2005). Motif-based approaches can have high specificity, particu- larly when using a stringent consensus sequence, but when the patterns are degenerate (often the case with transcription fac- Methods
  22. First step: Ancestral probability distributions A G G A A

    A C G T - A G - A A A C G T - A G * A A A. Stage 1 first step: represent alignment columns as ancestral probability distributionstage 1 second step: create initial grouping (encoding) based on evolutionary similarity and frequency distribution
  23. A G G A A A C G T -

    A G - A A A C G T - A G * A A columns as ancestral probability distributionstage 1 second step: create initial grouping (encoding) based on evolutionary similarity and frequency distribution (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classification rate: Second step: Clustering
  24. ! ! ! ! ! ! ! ! ! !

nitialize from clustering 2) Generate candidate encodings with a random set of collapses and expansions 3) Encode training data with each candidate and evaluate with cross validation 4) Accept candidate with best performance 5) Iterate until stable (1) (2) (3) (4) (5) C C G A G T C C C A G C G G C A C C T C G G C C G A G T C C C A G C G G C A C C T C A G D. Use final encoding on alignments for training and classification. Encoding symbols can be visualized with “logos”. (colored circles represent groups of columns from clustering) C. Stage 2: search for best encoding based on classification rate: Third step: Refinement by iterative search
  25. ESPERR Regulatory Potential scores • Regulatory Potential (RP) Scores discriminate

    “known regulatory” from “neutral” regions • Seven species alignments: human, chimpanzee, macaque, mouse, rat, dog, cow • Training data sets of ~31,000 bases each (no more than three missing species allowed in a column) • 17 symbol final alphabet • Cross validation success rate (leave-one-out) of ~94%
  26. - 1 0 1 2 0.0 0.2 0.4 0.6 0.8

    1.0 Score Cumulative Distribution Reg. training set Exons Bulk AR training set 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Sensitivity A. B. ESPERR RP provides good separation
  27. chr11: 5255000 5260000 5265000 5270000 Compilation of Landmarks from Locus

    Experts HBE1_PRA HBE1_NRA HS1 HS2_pos HS2_neg HS3 HS3.1 HS3.2 HS4 HS5 phastCons chimp rhesus mouse rat dog cow Vertebrate Multiz Alignment & Conservation RepeatMasker Repeating Elements by RepeatMasker Human/Mouse/Rat RP Scores, Kolbe et al model 0.05 _ 0 _ ESPERR Regulatory Potential (7 species) 0.05 _ 0 _ RP Scores available from UCSC browser
  28. Experimental validation of predicted mammalian erythroid cis-regulatory modules Hao Wang,1,2

    Ying Zhang,1,3 Yong Cheng,1,2 Yuepin Zhou,1,2 David C. King,1,4 James Taylor,1,5 Francesca Chiaromonte,1,6 Jyotsna Kasturi,1,5 Hanna Petrykowska,1,2 Brian Gibb,1,2 Christine Dorman,1,2 Webb Miller,1,5,7 Louis C. Dore,8 John Welch,8 Mitchell J. Weiss,8 and Ross C. Hardison1,2,9 1Center for Comparative Genomics and Bioinformatics of the Huck Institutes of Life Sciences, 2Department of Biochemistry and Molecular Biology, 3Intercollege Graduate Degree Program in Genetics, 4Intercollege Graduate Degree Program in Integrative Biosciences, 5Department of Computer Science and Engineering, 6Department of Statistics, and 7Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 8Department of Pediatrics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA Multiple alignments of genome sequences are helpful guides to functional analysis, but predicting cis-regulatory modules (CRMs) accurately from such alignments remains an elusive goal. We predict CRMs for mammalian genes expressed in red blood cells by combining two properties gleaned from aligned, noncoding genome sequences: a positive regulatory potential (RP) score, which detects similarity to patterns in alignments distinctive for regulatory regions, and conservation of a binding site motif for the essential erythroid transcription factor GATA-1. Within eight target loci, we tested 75 noncoding segments by reporter gene assays in transiently transfected human K562 cells and/or after site-directed integration into murine erythroleukemia cells. Segments with a high RP score and a conserved exact match to the binding site consensus are validated at a good rate (50%–100%, with rates increasing at higher RP), whereas segments with lower RP scores or nonconsensus binding motifs tend to be inactive. Active DNA segments were shown to be occupied by GATA-1 protein by chromatin immunoprecipitation, whereas sites predicted to be inactive were not occupied. We verify four previously known erythroid CRMs and identify 28 novel ones. Thus, high RP in combination with another feature of a CRM, such as a conserved transcription factor binding site, is a good predictor of functional CRMs. Genome-wide predictions based on RP and a large set of well-defined transcription factor binding sites are available through servers at http://www.bx.psu.edu/. [Supplemental material is available online at www.genome.org. The expression profile data obtained during MEL cell differentiation have been submitted to GEO under accession no. GSE2217.] Comprehensive discovery of functional DNA sequences in ge- nomes requires both computational and experimental ap- sitional weight matrices in single DNA sequences far exceed the sites verified as being occupied by transcription factors (e.g., Letter
  29. Experimental validation • Predictions were performed in the regions around

    eight genes co-expressed with beta-globin • Putative cis-regulatory modules (preCRMs) predicted using both: • Regulatory Potential score • Presence of binding motif for GATA-1 • Validation using G1E-ER cells that have inducible GATA-1 expresion
  30. Higher RP scores yield better validation rates RP distribution across

    loci of interest Validation rate of test loci falling in the RP score bin
  31. Biological data explosion • Genome sequences and alignments • Large

    scale genotyping and resequencing • Gene expression and other high throughput functional assays • Short reads
  32. Making sense of this explosion of data requires developing sophisticated

    computational methods ...and making these methods accessible
  33. Wasting Time • For developers, building user interfaces is both

    time consuming and highly repetitive, yet doing it well is hard • Without accessible interfaces, experimentalists end up using ill suited / inefficient tools, hiring inexperienced students, ... • Even with accessible interfaces, users waste time moving data between data sources and tools, converting between data formats, ...
  34. Wasting Time • For developers, building user interfaces is both

    time consuming and highly repetitive, yet doing it well is hard • Without accessible interfaces, experimentalists end up using ill suited / inefficient tools, hiring inexperienced students, ... • Even with accessible interfaces, users waste time moving data between data sources and tools, converting between data formats, ...
  35. Wasting Time • For developers, building user interfaces is both

    time consuming and highly repetitive, yet doing it well is hard • Without accessible interfaces, experimentalists end up using ill suited / inefficient tools, hiring inexperienced students, ... • Even with accessible interfaces, users waste time moving data between data sources and tools, converting between data formats, ...
  36. Wasting Potential • New technologies allow individuals labs to generate

    massive amounts of experimental data • However, effectively analyzing this data still requires specific technical / computational skills • The easier it is for experimentalists to work with sophisticated computational tools, the greater the potential for biological discovery
  37. Wasting Potential • New technologies allow individuals labs to generate

    massive amounts of experimental data • However, effectively analyzing this data still requires specific technical / computational skills • The easier it is for experimentalists to work with sophisticated computational tools, the greater the potential for biological discovery
  38. Wasting Potential • New technologies allow individuals labs to generate

    massive amounts of experimental data • However, effectively analyzing this data still requires specific technical / computational skills • The easier it is for experimentalists to work with sophisticated computational tools, the greater the potential for biological discovery
  39. What is Galaxy? • An open-source framework for integrating various

    computational tools and databases into a cohesive workspace • A web-based service, integrating many popular tools and resources for comparative genomics • A completely self-contained application for building your own Galaxy style sites
  40. Why integrate tools with Galaxy? • Galaxy makes it substantially

    easier to give your tools user interfaces • The resulting user interfaces are of high quality, and continually improved • Your tools gain value by being integrated with data sources and other tools
  41. Genomic interval analysis • Set-like operations on intervals, base-level and

    interval level • Merge, intersect, subtract... • Interval clustering • Relational-like operations • Join, group • Data structures and high-level Python interfaces to all operations available as part of “bx-python”
  42. Genomic alignment analysis • Extracting features of interest from pairwise

    and multiple genome wide alignments • Dealing with gene / transcript structure • Filtering alignments in many ways • Tools for indexing alignments, fast random access, and all operations available in “bx-python”
  43. Phylogenomic tools • Built on top of HyPhy (http://hyphy.org) •

    Phylogenetic tree reconstruction • Selection detection • Hypothesis testing • Relative rate tests • Detecting recombination
  44. Statistical genetics • Built with RGenetics (http://rgenetics.org) • Experiment design

    (including power and sample size calculations) • Quality control and filtering • Exploration of and adjustment for population substructure • Linkage disequlibrium visualization, data reduction based on LD (tag SNP identification) • Inference for pedigree and unrelated subject data
  45. Flexible execution environment • Dependencies between jobs handled by “JobManager”

    within Galaxy. • Either in-process with the web application, or a separate process managing a queue to which multiple front- ends submit
  46. Flexible execution environment • Once jobs are ready, submitted to

    a “JobRunner” • Runners are pluggable • Can have multiple runners, and jobs to different runners depending on capabilities • Current implementations: • Local runner executing a limited number of local processes • PBS runner dispatches to a cluster of worker nodes • Pluggable queueing policies
  47. Datatypes • Datatypes supported by a Galaxy instance can be

    configured at runtime • Declarative definition of “metadata” • Easy way to define custom metadata • Automatically generated editing interfaces (similar to tool interfaces) • Actions on datatypes (displaying at external sites, format conversion) all pluggable • Nothing “genomics” specific hardcoded!
  48. Immutability principle • All datasets in Galaxy are immutable •

    Running tools always generates new datasets • Datasets associated with user accounts are stored indefinitely
  49. Sharing histories • A history in Galaxy is a complete

    record of a complex analysis • Histories in Galaxy can be easily be shared • A shared history is always a copy (the original analysis is always retained) • All of the details of any analysis can thus be inspected, rerun, ...
  50. Acknowledgements • ESPERR: Francesca Chiaromonte, Webb Miller, Ross Hardison, David

    King, Svitlana Tyekucheva, Hao Wang, Ying Zhang • Galaxy: Anton Nekrutenko, Greg von Kuster, Dan Blankenberg, Nate Coraor, Guru Ananda, Ross Lazarus • ENCODE Project, Nathan Trinklein, Elliott Margulies • Biomart, GMOD, UCSC Genome Bioinformatics • National Science Foundation