Penn State BMB Lunch

May 5 | Lunch Nekrutenko Lab @galaxyproject / #usegalaxy http://www.galaxyproject.org

Dan Blankenberg Nate Coraor Jennifer Jackson Dave Bouvier John Chilton
Martin Čech Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Huck and the Pennsylvania Department of Public Health People Boris Rebolledo Jaramillo Nick Stoler Research Engineering

Talking points • Research directions • PacBio acquisition proposal

Themes • Big data computation in life sciences - Reproducibility
- Analysis standardization - Scalability - Community building • Mutational dynamics in constrained systems - Variant calling in non-diploid mixtures - High throughput mutation detection assays - Experimental evolution

What is the ultimate product of the open-source software development?

Contributions

Publications

More than 70 known public Galaxy servers 15+ general
servers Domain speciﬁc servers including: Ballaxy for structure based computational biology, Cistrome for regulatory sequence analysis, Genomic Hyperbrowser: statistical integration of genomic data, GigaGalaxy: integrating workﬂows published in GigaScience, Pathogen Portal:comparative analysis of host response to pathogens, ... Dozens of large scale private Galaxy instances

Big data in life sciences

Who are we dealing with? Users Developers HPC

Users troubles: - Data logistics - HPC - Poor knowledge
of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility

Developers’ grief: - Limited tool exposure - Parameter picking troubles
- Data format nightmare - High proﬁle publications

HPC providers’ challenges: - Lack of HPC utilization skills -
Software is not optimized - HPC is heterogeneous

user HPC dev

user HPC dev Galaxy

Galaxy: accessible analysis system

A free (for everyone) web service integrating a wealth of
tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workﬂows, ...

Galaxy’s ideological goals: How best can data intensive methods be
accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?

Galaxy’s practical goals: How to arm researchers with access to
powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC

Describe analysis tool behavior abstractly

Describe analysis tool behavior abstractly Analysis environment automatically and transparently
tracks details

tracks details Workﬂow system for complex analysis, constructed explicitly or automatically

tracks details Workﬂow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis

Visualization and visual analytics

Ways to use Galaxy The public web service at http://usegalaxy.org
Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere

Galaxy in a world of increasingly complex analyses

We are in the age of multiple datasets

Galaxy’s user interface is designed to be simple and intuitive
for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?

Users typically use many histories when working with many samples;
New multiple history view makes working with 100s of histories easy

A not-so-new feature: mapping over multiple datasets However, this breaks
down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)

Dataset collections complex combinations of datasets that can be treated
as a single unit

Dataset Collections Organize user data Individual Datasets Collection Collection Contents

Operations over collections For “list” collections, existing tools can automatically
be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools

Map/reduce in workﬂows More Powerful Workflows Arbitrary # of Inputs
(... paired). Run applications in parallel (one per input). Merged output for subsequent processing.

Enhanced Tuxedo Suite Workﬂow RNA-Seq workflow based using the Tuxedo
suite.

Dataset Collections Extremely ﬂexible for grouping collections of complex datasets,
can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workﬂow scheduling improvements (backgrounding, decision points, streaming)

An analysis is really a workﬂow

As analyses needs become increasingly complex, typical users have moved
from running individual tools to primarily running workﬂows

For research use, users need to be able to construct
and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows

(Goecks et al. Cancer Medicine, 2015)

However, for reproducibility, we want to be able to ensure
that a workﬂow can be exactly rerun, even in a diﬀerent compute environment, and get exactly the same results

1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private
clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed

Fostering the tool developer community

The Galaxy Toolshed: Sharing tools, workﬂows, and their dependencies

Repositories are owned by the contributor, can contain tools, workﬂows,
etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)

ToolShed Challenges Good for deployment and archiving, diﬃcult for development

New command line tools to address concerns from tool developers
Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science

Move to git[hub] centric development workﬂow Within three weeks, four
major community contributions to core tools ols hub. eeks: ols of FastQC

Tool citations, credit and incentivization Embed DOIs in Tool Conﬁguration,
Galaxy resolves and provides a list of citations, with links, which can exported for reference managers

ToolShed Challenges Complex dependency deﬁnitions, packaging dependencies is a rabbit
hole

Virtualize everything: control the host environment

POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the
cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diagnostics. However, combining NGS and clinical data presents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certiﬁed Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)

Share a snapshot of this instance Current support for archiving
instances with CloudMan Plan to support archiving analyses both from custom   Galaxy instances and on Galaxy main

New approaches for dependency management Alternative approach for installing dependencies:
Homebrew/Linuxbrew How can we run community contributed tools safely and eﬃciently? Support for deﬁning dependencies as Docker containers

What is Docker? Docker Virtual Machines “It run proce host
o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management beneﬁts of VMs but much more lightweight and eﬃcient

ToolShed and Docker Tools can assert their dependencies are provided
by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies

What is you ned a new, ad hoc, analysis within
Galaxy

Interactive programming environments

For researchers without informatics expertise, the web UI and existing
tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting

Docker enables interactive———— environments Framework allows spinning up secure* isolated
environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook

Next steps Support for Jupyter (both Python and Julia) and
RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool

Using Galaxy main to drive scalability improvements…

PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores
• 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin

funded by the National Science Foundation Award #ACI-1445604

A user-friendly cloud environment designed to give researchers access to
interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identiﬁers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.

Transparent Migrations using Galaxy’s Hierarchical Object Store Galaxy Server Processes
Corral Corral Staging Penn State Read Data In Corral? In Staging? In PSU? Yes Yes Yes No No No Object Not Found Write Data Nate Coraor

Blacklight (PSC) Messaging Server Pulsar Galaxy Server Processes Stampede (TACC)
Galaxy Server VMs (TACC) Pulsar Job control (AMQP) Data transfer (HTTPS) Data transfer (HTTPS) John Chilton Pulsar: Galaxy job runner that can run almost anywhere. No shared ﬁlesystem, stages all necessary Galaxy components

Summary Galaxy is an (obsessively) open framework for making data
analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacriﬁcing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools

Galaxy is a community! Join us on irc, mailing lists,
Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks

Boris Rebolledo-Jaramillo (Nekrutenko’s Lab) Challenges in the identiﬁcation of mtDNA
heteroplasmy: a case of wasted data, and silver linings IBIOS-Bioinformatics and Genomics

Acknowledgements Penn State Clinical and Translational Sciences Institute • Jessica
Beiler, MPH • Lily Borhan • Clinical nurses Michael DeGiorgio Prabhani Kuruppummulage Don TEAM at PSU 6/2/2014 fulbright_logo.jpg (1828×604)

Guja and Garcia-Diaz (2012) Biochimica et Biophysica Acta 1819: 939–947
Non-coding rRNA (2) tRNA (22) mRNA (13) 37 I IV III V OXPHOS complexes T E L S H R G K SD W M I L V P F Q A N C Y TERM HSP1 HSP2 OL LSP OH Human mtDNA 16,569 bp + thousands of nuclear encoded genes

‣ Diabetes ‣ Male infertility ‣ Parkinson’s ‣ Alzheimer’s ‣
Cancer Source: Fauci AS, Kasper DL, Braunwald E, Hauser SL, Longo DL, Jameson JL, Loscalzo J; Harrison’s Principles of Internal Medicine, 17th Edition: http://www.accessmedicine.com Copyright (C) The McGraw-Hill Companies, Inc. All rights reserved

homoplasmy heteroplasmy wild type mutant nucleus

MT transmitted by mother only Fertilized oocyte PGC’s Primary oocyte
Secondary oocyte Mature oocyte Birth Development mtDNA genotypic variance mtDNA/cell (log) Somatic cells Germ cells Folliculogenesis Poulton, et al (2010) PLoS Genet. 6(8) bottleneck

Dramatic frequency shifts wild type mutant nucleus Primordial germ cell
Primary oocytes Mature oocytes High Intermediate Low % Mutant genome in oﬀspring Taylor and Turnbull (2005) Nature Reviews Genetics 6, 389-402

Research topic: Dynamics of heteroplasmy transmission • Prevalence • Bottleneck

Challenge: Accurate detection of heteroplasmy Li et al (2011) Am
J Hum Genet. 87:237-249 131 individuals 37 hets ≥10% Het are not common He et al (2010) Nature. 464(7288): 610–614. 10 individuals 40 hets ≥1.6% Het are common

RESEARCH Open Access Dynamics of mitochondrial heteroplasmy in three families
investigated via a repeatable re-sequencing study Hiroki Goto1†, Benjamin Dickins2†, Enis Afgan3, Ian M Paul4, James Taylor3*, Kateryna D Makova1* and Anton Nekrutenko2* Abstract Background: Originally believed to be a rare phenomenon, heteroplasmy - the presence of more than one mitochondrial DNA (mtDNA) variant within a cell, tissue, or individual - is emerging as an important component of eukaryotic genetic diversity. Heteroplasmies can be used as genetic markers in applications ranging from forensics to cancer diagnostics. Yet the frequency of heteroplasmic alleles may vary from generation to generation due to the bottleneck occurring during oogenesis. Therefore, to understand the alterations in allele frequencies at heteroplasmic sites, it is of critical importance to investigate the dynamics of maternal mtDNA transmission. Results: Here we sequenced, at high coverage, mtDNA from blood and buccal tissues of nine individuals from three families with a total of six maternal transmission events. Using simulations and re-sequencing of clonal DNA, Goto et al. Genome Biology 2011, 12:R59 http://genomebiology.com/2011/12/6/R59 3 families = 9 individuals blood and cheek tissues detection threshold ≥ 2%

Goto H, Dickins B et al. (2011) Genome Biology 12:R59
Family 1 Family 2 Family 3 A C T G

Goto H, Dickins B et al. (2011) Genome Biology 12:R59
Family 1 Family 2 Family 3 14053 8992 7028 5063 A C T G

✓ Few heteroplasmies per individual ✓ Evidence of transmission ✓
Variable MAF Expectations:

MAF % N of sites Family 41 Suspicious samples have
large number of sites, and tight distribution of MAF, i.e. each site has approx. the same MAF

ACACTAGGAT ACGCTAAGAT SITE MAJOR MINOR 3 A G 7 G
A Major allele sequence Minor allele sequence Heteroplasmic sites

Evidence for contamination. Samples cluster within an unrelated family. We
can hypothesize that individuals M52 and M52C1 got contaminated with either M57, M58, M58C1 or M58C2 Clean samples. Heteroplasmic sequences cluster with the expected family

M45_BL M58C1_BL M67C1_BL M468C3_BL M45C2_CH M468_CH M483C4_CH M51_BL M45C3_BL M468_BL
M45_CH M51C2_CH M483_CH M468C3_CH M57_BL M52_BL M483_BL M468C4_BL M58C1_CH M468C1_CH M483C3_CH M45C1_BL M58C2_BL M67C2_BL M57_CH M45C3_CH M67C1_CH M468C4_CH M51C1_BL M45C4_BL M468C1_BL M483C2_BL M58C2_CH M483C1_CH M58_BL M52C1_BL M483C1_BL M51_CH M45C4_CH M468C2_CH M45C2_BL M67_BL M468C2_BL M45C1_CH M52C1_CH M483C2_CH M51C2_BL M46_BL M58_CH M51C1_CH M46_CH M67_CH

&B DESIGN CONFIRM CONTAMINATION A. Spike-ins – carry over? • 
phiX174 •  pUC18 B. Sample preparation layout – adjacent to contaminant? phiX174 pUC18

Evidence of contamination • phylogenetic tree • order in plate
• unexpected spike-in &B DESIGN CONFIRM CONTAMINATION A. Spike-ins – carry over? •  phiX174 •  pUC18 B. Sample preparation layout – adjacent to contaminant? phiX174 pUC18

Implementation

BioTechniques 56:134 (2014) Reports Reports Controlling for contamination in re-sequencing
studies with a reproducible web-based phylogenetic approach Benjamin Dickins1,2,*,†, Boris Rebolledo-Jaramillo1,3,*, Marcia Shu-Wei Su2, Ian M. Paul4, Daniel Blankenberg1, Nicholas Stoler3, Kateryna D. Makova2, and Anton Nekrutenko1 1Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, 2Department of Biology, Penn State University, University Park, PA, 3Interdisciplinary Graduate Program in BioSciences, Penn State University, University Park, PA, and 4Department of Pediatrics, Penn State College of Medicine, Hershey, PA * B.D. and B.R.-J. contributed equally to this work. †Present address: School of Science and Technology, Nottingham Trent University, UK BioTechniques 56:134-141(March 2014) doi 10.2144/000114146 Keywords: re-sequencing, contamination, next-generation sequencing, Galaxy, reproducibility M IS S IO N O N L

mother-child pair = family PA High resolution Population scale

cheek and blood 39 families x 2 individuals x 2
tissues = 156 samples cheek and blood

Long-range PCR to enrich mtDNA 9kb 9kb cheek and blood
cheek and blood

read length (bp) misaligned reads 250 bp paired-end (MiSeq) reads
avoid NUMTs NUMTs: NUclear MiTochondrial sequences no misalignments

MAP READS (WITH BWA) TO reference: hg19+rCRS+pUC18+phiX174 REQUIRE READ PAIRS
TO 1. map to mtDNA 2. map in proper orientation within a pair 3. be non-chimeric IDENTIFY VARIABLE SITES USING GALAXY TOOLS BY REQUIRING 1. MAF ≥ 1% (in forward and reverse strands) 2. Coverage ≥1000x 3. No strand bias FINAL SET OF POINT HETEROPLASMIES NO YES DISCARD Did the site result from contamination? Statistically significant with an LRT test? Validates with Sanger or ddPCR? NO YES NO YES Expected spike-in alignments? Sensible number of sites per sample and unbiased MAF distribution? YES NO Phylogenetic analysis of contamination Dickins et al, 2014 YES NO DISCARD Pre-process NGS data Identify variable sites Confirm heteroplasmies Contamination pipeline Legend

Computational methods Mapping Quality ≥20 Base Quality ≥30 alignment artifacts
20,000x/sample X Reads ﬁlter Sites ﬁlter Statistical support

Computational methods 20,000x/sample M.A.F. ≥ 1% depth ≥1,000x Reads ﬁlter
Sites ﬁlter Statistical support strand bias X X low complexity X 172 sites M.A.F.: minor allele frequency

Computational methods 20,000x/sample Reads ﬁlter Sites ﬁlter Statistical support 172
sites Likelihood (by R. Nielsen) Poisson (Li and Stoneking, 2012) √ √

QUARTET site-speciﬁc group of MAFs 172 sites among 98 quartets
mother cheek mother blood child cheek child blood

172 sites among 98 quartets mother child Family Site Major
Minor cheek blood cheek blood M494 9196 G A 0.032 0.03 0 0

Successful validation of sites droplet digital PCR (ddPCR) Illumina MAF
(%) ddPCR MAF (%) R2=0.79 R2=0.95 Sanger Illumina MAF (%) Sanger MAF (%) R2=0.41 R2=0.75

31/39 3 sites/family ; 1.5 sites/tissue Positions 185,189, 214, 215,
16093 and 16183 found in multiple families. ts/tv = 48

Evidence of purifying selection number of sites Region bp Observed
Random Neutral D-loop 1,122 34 6.6* 7.9* tRNA 1,508 6 8.9 10.6 rRNA 2,513 13 14.9 17.7 Protein Syn 2,834 20 16.8 20 Protein NonSyn 8,533 25 50.5* 60.2* Intergenic 88 0 0.5 0.6 Total 16,569** 98 n/a 197 *Signiﬁcantly diﬀerent from observed (p<0.05, test comparing two proportions) **Direct sum is inconsistent with the length of the mtDNA due to overlapping annotations The numbers of synonymous (Syn) and non-synonymous (NonSyn) sites were calculated with the Nei-Gojobori method.

Transmission categories allele frequencies mother child Class Total Family Site
Major Minor cheek blood cheek blood All 22 M188 5107 C T 0.152 0.165 0.228 0.225 Child 16 M500 4191 A T 0 0 0.047 0.056 Mother 22 M494 9196 G A 0.032 0.03 0 0 Gain 13 M137 8953 A G 0 0.013 0 0 Loss 25 M249 16093 C T 0.101 0.015 0.03 0 98

Drift acting on transmission ∆MAF ∆MAF = mother -child

CHILD R2=0.92 MOTHER R2=0.49 CHEEK R2=0.13 BLOOD R2=0.29 Stronger shift
between generations tissues generations cheek blood cheek blood mother child mother child

Eﬀective bottleneck size (N) ≈ 32-35 p(1-p) N= --------- σ2
genetic Millar, et al (2008). PLoS Genet. 4(10):e1000209 )LJXUH Using raw variance Accounting for mitotic segregation maternal M.A.F. σ2 gen = σ2 raw - 4σ2 measure

Age of mother correlates with number of heteroplasmies

Age at fertilization correlates with number of heteroplasmies in child

mutant allele frequency alleles mother child Disease Site Region WT
Mut Eﬀect Family cheek blood cheek blood Disease(s) allele freq. 195 D-loop T C - M494 0.084 0.012 0.002 0.001 Bipolar disorder 1.0 (a) 1,391 12S T C - M513 0.04 0.027 0.001 0 Hypertrophic cardiomyophaty 1.0 (b) 1,555 12S A G - M520 0.002 0.001 0.014 0.014 Deafness >0.52 (c) 2,352 16S C T - SC16 0.429 0.437 0.247 0.245 Left ventricular noncompaction 1.0 (d) 3,242 Leu G A - M242 0.001 0.001 0.008 0.016 Renal tubular dysfunction >0.49(e) 3,243 Leu A G - M512 0.335 0.144 0.686 0.611 MELAS, MIDD; MERRF; CPEO >0.5 (f) 12,634 ND5 A G I to V M203 0.001 0.025 0.001 0.001 Thyroid cancer (cell line) 1.0 (g) 13,708 ND5 G A A to T SC8 0.001 0 0.022 0.016 LHON >0.92 (h) One in ﬁve families carries a disease- associated mtDNA mutation a. Rollins B et al. (2009) PLoS One 4(3):e4913 b. Prasad GN et al. (2006) Int J Cardiol 109:432–433. c. Del Castillo FJ et al. (2003) J Med Genet 40:632–6. d. Tang S, Huang T (2010) Biotechniques 48:287–296. e. Wortmann SB et al. (2012) Eur J Med Genet 55:552–6. f. Ma Y et al. (2009) Mitochondrion 9:139–43. g. Abu-Amero KK, et al. (2005) Oncogene 24:1455–60. h. Du W-D et al. (2011) Dis Markers 30:181–90.

Conclusions ‣ Severe germline bottleneck size (≈ 35) ‣ Positive
association between the number of heteroplasmies in a child and maternal age at fertilization ‣ Mutation rate 1.3x10-8 mutations/site/year ‣ One in ﬁve individuals carries a disease-associated heteroplasmy

Maternal age effect and severe germ-line bottleneck in the inheritance
of human mitochondrial DNA Boris Rebolledo-Jaramilloa,1, Marcia Shu-Wei Sub,1, Nicholas Stolera, Jennifer A. McElhoec, Benjamin Dickinsd, Daniel Blankenberga, Thorfinn S. Korneliussene,f, Francesca Chiaromonteg, Rasmus Nielsene, Mitchell M. Hollandc, Ian M. Paulh, Anton Nekrutenkoa,2, and Kateryna D. Makovab,2 Departments of aBiochemistry and Molecular Biology, bBiology, and gStatistics, cForensic Science Program, Pennsylvania State University, University Park, PA 16802; dSchool of Science and Technology, Nottingham Trent University, Nottingham NG1 4BU, United Kingdom; eDepartment of Integrative Biology, University of California, Berkeley, CA 94720; fCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, Denmark; and hDepartment of Pediatrics, College of Medicine, Pennsylvania State University, Hershey, PA 17033 Edited by Michael Lynch, Indiana University, Bloomington, IN, and approved September 8, 2014 (received for review May 20, 2014) The manifestation of mitochondrial DNA (mtDNA) diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted owing to a lack of data on the size of the mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother–child pairs of European ancestry (a total of 156 samples, each sequenced at ∼20,000× per site). On average, each individual carried one heteroplasmy, and one in eight individuals carried a disease-associated heteroplasmy, with minor allele frequency ≥1%. reduction in the number of mtDNA segregating units during oogenesis (6–8). The size of the bottleneck for mice has been evaluated to be 185 (9), yet for humans this size is difficult to obtain experimentally. Published estimates of the human bottleneck size are too broad [1–200 (10, 11)] to be useful in pre- dicting the transmission of disease variants. Genetic drift theory predicts that a small bottleneck size will result in drastic shifts in heteroplasmy levels from a mother to her child, potentially reach- ing nondisease levels or levels with higher disease severity. After fertilization, mtDNA variants are distributed among cells owing to mitotic segregation—the random partitioning of mitochondria during cell divisions (12). We also lack an accurate estimate of the germ-line mtDNA mutation rate in humans, with pedigree and PNAS 43(111):15474 (2014)

Klavens Lab Syphonostat

PacBio acquisition (PAR 15-088) • Major and minor user groups
- Major = 3+ PIs with NIH funding (75% AUT) - Minor = NSF, DoE, DoD … • Structure - Justiﬁcation of need (9 pages) - Technical expertise (3 pages) - Research projects (30 pages) - Summary Tables (6 pages) - Admin (6 pages) - Inst. commitment (3 pages) - Overall beneﬁt (3 pages)

Penn State BMB Lunch

Penn State BMB Lunch

More Decks by Anton Nekrutenko

Other Decks in Research

Featured

Transcript