Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Penn State BMB Lunch

Penn State BMB Lunch

A presentation during weekly faculty research lunch.

Anton Nekrutenko

May 05, 2015
Tweet

More Decks by Anton Nekrutenko

Other Decks in Research

Transcript

  1. Dan Blankenberg Nate Coraor Jennifer Jackson Dave Bouvier John Chilton

    Martin Čech Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Huck and the Pennsylvania Department of Public Health People Boris Rebolledo Jaramillo Nick Stoler Research Engineering
  2. Themes • Big data computation in life sciences - Reproducibility

    - Analysis standardization - Scalability - Community building • Mutational dynamics in constrained systems - Variant calling in non-diploid mixtures - High throughput mutation detection assays - Experimental evolution
  3. More than 70 known public Galaxy servers  15+ general

    servers Domain specific servers including: Ballaxy for structure based computational biology, Cistrome for regulatory sequence analysis, Genomic Hyperbrowser: statistical integration of genomic data, GigaGalaxy: integrating workflows published in GigaScience, Pathogen Portal:comparative analysis of host response to pathogens, ... Dozens of large scale private Galaxy instances
  4. Users troubles: - Data logistics - HPC - Poor knowledge

    of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility
  5. Developers’ grief: - Limited tool exposure - Parameter picking troubles

    - Data format nightmare - High profile publications
  6. HPC providers’ challenges: - Lack of HPC utilization skills -

    Software is not optimized - HPC is heterogeneous
  7. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  8. Galaxy’s ideological goals: How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  9. Galaxy’s practical goals: How to arm researchers with access to

    powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC
  10. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  11. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  12. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  13. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  14. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy
  15. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  16. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  17. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing.
  18. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  19. As analyses needs become increasingly complex, typical users have moved

    from running individual tools to primarily running workflows
  20. For research use, users need to be able to construct

    and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
  21. However, for reproducibility, we want to be able to ensure

    that a workflow can be exactly rerun, even in a different compute environment, and get exactly the same results
  22. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  23. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  24. New command line tools to address concerns from tool developers

    Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science
  25. Move to git[hub] centric development workflow Within three weeks, four

    major community contributions to core tools ols hub. eeks: ols of FastQC
  26. Tool citations, credit and incentivization Embed DOIs in Tool Configuration,

    Galaxy resolves and provides a list of citations, with links, which can exported for reference managers
  27. POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the

    cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
  28. Share a snapshot of this instance Current support for archiving

    instances with CloudMan Plan to support archiving analyses both from custom 
 Galaxy instances and on Galaxy main
  29. New approaches for dependency management Alternative approach for installing dependencies:

    Homebrew/Linuxbrew How can we run community contributed tools safely and efficiently? Support for defining dependencies as Docker containers
  30. What is Docker? Docker Virtual Machines “It run proce host

    o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits of VMs but much more lightweight and efficient
  31. ToolShed and Docker Tools can assert their dependencies are provided

    by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies
  32. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  33. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  34. Next steps Support for Jupyter (both Python and Julia) and

    RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool
  35. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin
  36. A user-friendly cloud environment designed to give researchers access to

    interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
  37. Transparent Migrations using Galaxy’s Hierarchical Object Store Galaxy Server Processes

    Corral Corral Staging Penn State Read Data In Corral? In Staging? In PSU? Yes Yes Yes No No No Object Not Found Write Data Nate Coraor
  38. Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC)

    Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) John Chilton Pulsar: Galaxy job runner that can run almost anywhere. No shared filesystem, stages all necessary Galaxy components
  39. Summary Galaxy is an (obsessively) open framework for making data

    analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
  40. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks
  41. Boris Rebolledo-Jaramillo (Nekrutenko’s Lab) Challenges in the identification of mtDNA

    heteroplasmy: a case of wasted data, and silver linings IBIOS-Bioinformatics and Genomics
  42. Acknowledgements Penn State Clinical and Translational Sciences Institute • Jessica

    Beiler, MPH • Lily Borhan • Clinical nurses Michael DeGiorgio Prabhani Kuruppummulage Don TEAM at PSU 6/2/2014 fulbright_logo.jpg (1828×604)
  43. Guja and Garcia-Diaz (2012) Biochimica et Biophysica Acta 1819: 939–947

    Non-coding rRNA (2) tRNA (22) mRNA (13) 37 I IV III V OXPHOS complexes T E L S H R G K SD W M I L V P F Q A N C Y TERM HSP1 HSP2 OL LSP OH Human mtDNA 16,569 bp + thousands of nuclear encoded genes
  44. ‣ Diabetes ‣ Male infertility ‣ Parkinson’s ‣ Alzheimer’s ‣

    Cancer Source: Fauci AS, Kasper DL, Braunwald E, Hauser SL, Longo DL, Jameson JL, Loscalzo J; Harrison’s Principles of Internal Medicine, 17th Edition: http://www.accessmedicine.com Copyright (C) The McGraw-Hill Companies, Inc. All rights reserved
  45. MT transmitted by mother only Fertilized oocyte PGC’s Primary oocyte

    Secondary oocyte Mature oocyte Birth Development mtDNA genotypic variance mtDNA/cell (log) Somatic cells Germ cells Folliculogenesis Poulton, et al (2010) PLoS Genet. 6(8) bottleneck
  46. Dramatic frequency shifts wild type mutant nucleus Primordial germ cell

    Primary oocytes Mature oocytes High Intermediate Low % Mutant genome in offspring Taylor and Turnbull (2005) Nature Reviews Genetics 6, 389-402
  47. Challenge: Accurate detection of heteroplasmy Li et al (2011) Am

    J Hum Genet. 87:237-249 131 individuals 37 hets ≥10% Het are not common He et al (2010) Nature. 464(7288): 610–614. 10 individuals 40 hets ≥1.6% Het are common
  48. RESEARCH Open Access Dynamics of mitochondrial heteroplasmy in three families

    investigated via a repeatable re-sequencing study Hiroki Goto1†, Benjamin Dickins2†, Enis Afgan3, Ian M Paul4, James Taylor3*, Kateryna D Makova1* and Anton Nekrutenko2* Abstract Background: Originally believed to be a rare phenomenon, heteroplasmy - the presence of more than one mitochondrial DNA (mtDNA) variant within a cell, tissue, or individual - is emerging as an important component of eukaryotic genetic diversity. Heteroplasmies can be used as genetic markers in applications ranging from forensics to cancer diagnostics. Yet the frequency of heteroplasmic alleles may vary from generation to generation due to the bottleneck occurring during oogenesis. Therefore, to understand the alterations in allele frequencies at heteroplasmic sites, it is of critical importance to investigate the dynamics of maternal mtDNA transmission. Results: Here we sequenced, at high coverage, mtDNA from blood and buccal tissues of nine individuals from three families with a total of six maternal transmission events. Using simulations and re-sequencing of clonal DNA, Goto et al. Genome Biology 2011, 12:R59 http://genomebiology.com/2011/12/6/R59 3 families = 9 individuals blood and cheek tissues detection threshold ≥ 2%
  49. Goto H, Dickins B et al. (2011) Genome Biology 12:R59

    Family 1 Family 2 Family 3 A C T G
  50. Goto H, Dickins B et al. (2011) Genome Biology 12:R59

    Family 1 Family 2 Family 3 14053 8992 7028 5063 A C T G
  51. MAF % N of sites Family 41 Suspicious samples have

    large number of sites, and tight distribution of MAF, i.e. each site has approx. the same MAF
  52. ACACTAGGAT ACGCTAAGAT SITE MAJOR MINOR 3 A G 7 G

    A Major allele sequence Minor allele sequence Heteroplasmic sites
  53. Evidence for contamination. Samples cluster within an unrelated family. We

    can hypothesize that individuals M52 and M52C1 got contaminated with either M57, M58, M58C1 or M58C2 Clean samples. Heteroplasmic sequences cluster with the expected family
  54. M45_BL M58C1_BL M67C1_BL M468C3_BL M45C2_CH M468_CH M483C4_CH M51_BL M45C3_BL M468_BL

    M45_CH M51C2_CH M483_CH M468C3_CH M57_BL M52_BL M483_BL M468C4_BL M58C1_CH M468C1_CH M483C3_CH M45C1_BL M58C2_BL M67C2_BL M57_CH M45C3_CH M67C1_CH M468C4_CH M51C1_BL M45C4_BL M468C1_BL M483C2_BL M58C2_CH M483C1_CH M58_BL M52C1_BL M483C1_BL M51_CH M45C4_CH M468C2_CH M45C2_BL M67_BL M468C2_BL M45C1_CH M52C1_CH M483C2_CH M51C2_BL M46_BL M58_CH M51C1_CH M46_CH M67_CH
  55. &B DESIGN CONFIRM CONTAMINATION A. Spike-ins – carry over? • 

    phiX174 •  pUC18 B. Sample preparation layout – adjacent to contaminant? phiX174 pUC18
  56. Evidence of contamination • phylogenetic tree • order in plate

    • unexpected spike-in &B DESIGN CONFIRM CONTAMINATION A. Spike-ins – carry over? •  phiX174 •  pUC18 B. Sample preparation layout – adjacent to contaminant? phiX174 pUC18
  57. BioTechniques 56:134 (2014) Reports Reports Controlling for contamination in re-sequencing

    studies with a reproducible web-based phylogenetic approach Benjamin Dickins1,2,*,†, Boris Rebolledo-Jaramillo1,3,*, Marcia Shu-Wei Su2, Ian M. Paul4, Daniel Blankenberg1, Nicholas Stoler3, Kateryna D. Makova2, and Anton Nekrutenko1 1Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, 2Department of Biology, Penn State University, University Park, PA, 3Interdisciplinary Graduate Program in BioSciences, Penn State University, University Park, PA, and 4Department of Pediatrics, Penn State College of Medicine, Hershey, PA * B.D. and B.R.-J. contributed equally to this work. †Present address: School of Science and Technology, Nottingham Trent University, UK BioTechniques 56:134-141(March 2014) doi 10.2144/000114146 Keywords: re-sequencing, contamination, next-generation sequencing, Galaxy, reproducibility M IS S IO N O N L
  58. cheek and blood 39 families x 2 individuals x 2

    tissues = 156 samples cheek and blood
  59. read length (bp) misaligned reads 250 bp paired-end (MiSeq) reads

    avoid NUMTs NUMTs: NUclear MiTochondrial sequences no misalignments
  60. MAP READS (WITH BWA) TO reference: hg19+rCRS+pUC18+phiX174 REQUIRE READ PAIRS

    TO 1. map to mtDNA 2. map in proper orientation within a pair 3. be non-chimeric IDENTIFY VARIABLE SITES USING GALAXY TOOLS BY REQUIRING 1. MAF ≥ 1% (in forward and reverse strands) 2. Coverage ≥1000x 3. No strand bias FINAL SET OF POINT HETEROPLASMIES NO YES DISCARD Did the site result from contamination? Statistically significant with an LRT test? Validates with Sanger or ddPCR? NO YES NO YES Expected spike-in alignments? Sensible number of sites per sample and unbiased MAF distribution? YES NO Phylogenetic analysis of contamination Dickins et al, 2014 YES NO DISCARD Pre-process NGS data Identify variable sites Confirm heteroplasmies Contamination pipeline Legend
  61. Computational methods Mapping Quality ≥20 Base Quality ≥30 alignment artifacts

    20,000x/sample X Reads filter Sites filter Statistical support
  62. Computational methods 20,000x/sample M.A.F. ≥ 1% depth ≥1,000x Reads filter

    Sites filter Statistical support strand bias X X low complexity X 172 sites M.A.F.: minor allele frequency
  63. Computational methods 20,000x/sample Reads filter Sites filter Statistical support 172

    sites Likelihood (by R. Nielsen) Poisson (Li and Stoneking, 2012) √ √
  64. QUARTET site-specific group of MAFs 172 sites among 98 quartets

    mother cheek mother blood child cheek child blood
  65. 172 sites among 98 quartets mother child Family Site Major

    Minor cheek blood cheek blood M494 9196 G A 0.032 0.03 0 0
  66. Successful validation of sites droplet digital PCR (ddPCR) Illumina MAF

    (%) ddPCR MAF (%) R2=0.79 R2=0.95 Sanger Illumina MAF (%) Sanger MAF (%) R2=0.41 R2=0.75
  67. 31/39 3 sites/family ; 1.5 sites/tissue Positions 185,189, 214, 215,

    16093 and 16183 found in multiple families. ts/tv = 48
  68. Evidence of purifying selection number of sites Region bp Observed

    Random Neutral D-loop 1,122 34 6.6* 7.9* tRNA 1,508 6 8.9 10.6 rRNA 2,513 13 14.9 17.7 Protein Syn 2,834 20 16.8 20 Protein NonSyn 8,533 25 50.5* 60.2* Intergenic 88 0 0.5 0.6 Total 16,569** 98 n/a 197 *Significantly different from observed (p<0.05, test comparing two proportions) **Direct sum is inconsistent with the length of the mtDNA due to overlapping annotations The numbers of synonymous (Syn) and non-synonymous (NonSyn) sites were calculated with the Nei-Gojobori method.
  69. Transmission categories allele frequencies mother child Class Total Family Site

    Major Minor cheek blood cheek blood All 22 M188 5107 C T 0.152 0.165 0.228 0.225 Child 16 M500 4191 A T 0 0 0.047 0.056 Mother 22 M494 9196 G A 0.032 0.03 0 0 Gain 13 M137 8953 A G 0 0.013 0 0 Loss 25 M249 16093 C T 0.101 0.015 0.03 0 98
  70. CHILD R2=0.92 MOTHER R2=0.49 CHEEK R2=0.13 BLOOD R2=0.29 Stronger shift

    between generations tissues generations cheek blood cheek blood mother child mother child
  71. Effective bottleneck size (N) ≈ 32-35 p(1-p) N= --------- σ2

    genetic Millar, et al (2008). PLoS Genet. 4(10):e1000209 )LJXUH Using raw variance Accounting for mitotic segregation maternal M.A.F. σ2 gen = σ2 raw - 4σ2 measure
  72. mutant allele frequency alleles mother child Disease Site Region WT

    Mut Effect Family cheek blood cheek blood Disease(s) allele freq. 195 D-loop T C - M494 0.084 0.012 0.002 0.001 Bipolar disorder 1.0 (a) 1,391 12S T C - M513 0.04 0.027 0.001 0 Hypertrophic cardiomyophaty 1.0 (b) 1,555 12S A G - M520 0.002 0.001 0.014 0.014 Deafness >0.52 (c) 2,352 16S C T - SC16 0.429 0.437 0.247 0.245 Left ventricular noncompaction 1.0 (d) 3,242 Leu G A - M242 0.001 0.001 0.008 0.016 Renal tubular dysfunction >0.49(e) 3,243 Leu A G - M512 0.335 0.144 0.686 0.611 MELAS, MIDD; MERRF; CPEO >0.5 (f) 12,634 ND5 A G I to V M203 0.001 0.025 0.001 0.001 Thyroid cancer (cell line) 1.0 (g) 13,708 ND5 G A A to T SC8 0.001 0 0.022 0.016 LHON >0.92 (h) One in five families carries a disease- associated mtDNA mutation a. Rollins B et al. (2009) PLoS One 4(3):e4913 b. Prasad GN et al. (2006) Int J Cardiol 109:432–433. c. Del Castillo FJ et al. (2003) J Med Genet 40:632–6. d. Tang S, Huang T (2010) Biotechniques 48:287–296. e. Wortmann SB et al. (2012) Eur J Med Genet 55:552–6. f. Ma Y et al. (2009) Mitochondrion 9:139–43. g. Abu-Amero KK, et al. (2005) Oncogene 24:1455–60. h. Du W-D et al. (2011) Dis Markers 30:181–90.
  73. Conclusions ‣ Severe germline bottleneck size (≈ 35) ‣ Positive

    association between the number of heteroplasmies in a child and maternal age at fertilization ‣ Mutation rate 1.3x10-8 mutations/site/year ‣ One in five individuals carries a disease-associated heteroplasmy
  74. Maternal age effect and severe germ-line bottleneck in the inheritance

    of human mitochondrial DNA Boris Rebolledo-Jaramilloa,1, Marcia Shu-Wei Sub,1, Nicholas Stolera, Jennifer A. McElhoec, Benjamin Dickinsd, Daniel Blankenberga, Thorfinn S. Korneliussene,f, Francesca Chiaromonteg, Rasmus Nielsene, Mitchell M. Hollandc, Ian M. Paulh, Anton Nekrutenkoa,2, and Kateryna D. Makovab,2 Departments of aBiochemistry and Molecular Biology, bBiology, and gStatistics, cForensic Science Program, Pennsylvania State University, University Park, PA 16802; dSchool of Science and Technology, Nottingham Trent University, Nottingham NG1 4BU, United Kingdom; eDepartment of Integrative Biology, University of California, Berkeley, CA 94720; fCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, Denmark; and hDepartment of Pediatrics, College of Medicine, Pennsylvania State University, Hershey, PA 17033 Edited by Michael Lynch, Indiana University, Bloomington, IN, and approved September 8, 2014 (received for review May 20, 2014) The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplas- mies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother–child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000× per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a dis- ease-associated heteroplasmy, with minor allele frequency ≥1%. reduction in the number of mtDNA segregating units during oogenesis (6–8). The size of the bottleneck for mice has been evaluated to be 185 (9), yet for humans this size is difficult to obtain experimentally. Published estimates of the human bot- tleneck size are too broad [1–200 (10, 11)] to be useful in pre- dicting the transmission of disease variants. Genetic drift theory predicts that a small bottleneck size will result in drastic shifts in heteroplasmy levels from a mother to her child, potentially reach- ing nondisease levels or levels with higher disease severity. After fertilization, mtDNA variants are distributed among cells owing to mitotic segregation—the random partitioning of mitochondria during cell divisions (12). We also lack an accurate estimate of the germ-line mtDNA mutation rate in humans, with pedigree and PNAS 43(111):15474 (2014)
  75. ?

  76. PacBio acquisition (PAR 15-088) • Major and minor user groups

    - Major = 3+ PIs with NIH funding (75% AUT) - Minor = NSF, DoE, DoD … • Structure - Justification of need (9 pages) - Technical expertise (3 pages) - Research projects (30 pages) - Summary Tables (6 pages) - Admin (6 pages) - Inst. commitment (3 pages) - Overall benefit (3 pages)