Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducibility 2017

Reproducibility 2017

Reproducible research and bioinformatics

Leonardo Collado-Torres

June 24, 2017
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    Reproducible Research and
    Bioinformatics
    Leonardo Collado-Torres
    @fellgernon
    http://lcolladotor.github.io/

    View Slide

  2. Reproducible research
    2
    What is reproducible research?
    • Have you heard about it?
    • How would you definite it?
    • Is it the same as replicability?
    • Is it important?

    View Slide

  3. Reproducible research
    3
    Research
    • Science moves forward then discoveries are
    replicated and reproduced
    Implementing Reproducible Research by Stodden, Leish, Peng

    View Slide

  4. https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

    View Slide

  5. https://www.nature.com/news/1-
    500-scientists-lift-the-lid-on-
    reproducibility-1.19970

    View Slide

  6. Open Science Collaboration, Science, 2015
    35/97 (36%) replications P < 0.05 in same direction

    View Slide

  7. https://simplystatistics.org/2017/03/02/rr-glossy/

    View Slide

  8. https://github.com/jtleek/replication_paper/blob/gh-pages/in_the_media.md

    View Slide

  9. Patil, Peng & Leek, Perspectives on Psychological Science, 2016

    View Slide

  10. Researchers need a new definition for replication that
    acknowledges variation in both the original study and in
    the replication study. Specifically, a study replicates if the
    data collected from the replication are drawn from the
    same distribution as the data from the original experiment.
    Multiple independent replications of the same
    study will be needed to definitively evaluate replication.
    Patil, Peng & Leek, Perspectives on Psychological Science, 2016
    Reproducible research
    Key conclusions:

    View Slide

  11. Reproducible research
    11
    Replication
    • Replication, the practice of independently
    implementing scientific experiments to validate
    specific findings, is the cornerstone of discovering
    scientific truth.
    Implementing Reproducible Research by Stodden, Leish, Peng

    View Slide

  12. Kim et al, biorXiv, 2017

    View Slide

  13. Kim et al, biorXiv, 2017

    View Slide

  14. Reproducible research
    14
    Reproducibility: from back in 2006
    • However, because of the time, expense, and
    opportunism of many current epidemiologic studies,
    it is often impossible to fully replicate their findings.
    An attainable minimum standard is
    "reproducibility," which calls for data sets and
    software to be made available for verifying
    published findings and conducting alternative
    analyses.
    Peng et al, Reproducible epidemiologic research, Am J Epidemiol., 2006

    View Slide

  15. Reproducible research
    15
    Reproducibility
    • Reproducibility can be thought of as a different
    standard of validity from replication because it
    forgoes independent data collection and uses the
    methods and data collected by the original
    investigator.
    Implementing Reproducible Research by Stodden, Leish, Peng

    View Slide

  16. Reproducible research
    16
    A bit more practical
    • The sharing of analytic data and computer codes
    uses to map those data into computational results
    is central to any comprehensive definition of
    reproducibility.
    Implementing Reproducible Research by Stodden, Leish, Peng

    View Slide

  17. Reproducible research
    17
    Why is it important?
    • Except for the simplest of analyses, the computer
    code used to analyze a dataset is the only record
    that permits others to fully understand what a
    researcher has done.
    Implementing Reproducible Research by Stodden, Leish, Peng

    View Slide

  18. Reproducible research
    18
    Drawing the line

    View Slide

  19. Reproducible research
    19
    Together
    • Reproducibility is the ability to take the code and
    data from a previous publication, rerun the code
    and get the same results. Replicability is the ability
    to rerun an experiment and get “consistent” results
    with the original study using new data. Results that
    are not reproducible are hard to verify and results
    that do not replicate in new studies are harder to
    trust.
    https://simplystatistics.org/2017/03/02/rr-glossy/

    View Slide

  20. Reproducible research
    20
    Visually
    Patil, Peng & Leek, biorXiv, 2016
    http://biorxiv.org/content/early/2016/07/29/066803

    View Slide

  21. http://science.sciencemag.org/content/354/6317/1240

    View Slide

  22. http://rpubs.com/lcollado/4080

    View Slide

  23. Reproducible research
    23
    Reproducible documents
    • Have you ever had your code in one file, your
    description of the results in another file?
    • Ever made copy-paste mistakes?
    • What if you were asked to change some models
    or revise the document?
    • Was it easy to maintain?

    View Slide

  24. Reproducible research
    24
    Reproducible documents
    • What would be a reproducible document for you?

    View Slide

  25. Reproducible research
    25
    Reproducible documents in R
    • R Markdown is the easiest
    • It's based on Markdown: simple human readable
    syntax
    • You maintain a single file! It has the
    • code,
    • figures,
    • description of results.
    • It then creates a file in the format you want to
    share with others.

    View Slide

  26. Reproducible research
    26
    R Markdown
    http://rmarkdown.rstudio.com/

    View Slide

  27. Reproducible research
    27
    R Markdown
    http://rmarkdown.rstudio.com/

    View Slide

  28. https://github.com/leekgroup/polyester_code/blob/master/polyester_manuscript.Rmd
    Complex example

    View Slide

  29. http://htmlpreview.github.io/?https://github.com/alyssafrazee/polyester_code/blob/master/polyester_manuscript.html

    View Slide

  30. Reproducible research
    30
    Reproducible research can still be wrong!
    • Unfortunately, the mere reproducibility of computational
    results is insufficient to address the replication crisis
    because even a reproducible analysis can suffer from
    many problems—confounding from omitted variables,
    poor study design, missing data—that threaten the validity
    and useful interpretation of the results. Although
    improving the reproducibility of research may increase the
    rate at which flawed analyses are uncovered […] it does
    not change the fact that problematic research is
    conducted in the first place.
    Leek and Peng, PNAS, 2015

    View Slide

  31. https://www.nature.com/
    news/1-500-scientists-
    lift-the-lid-on-
    reproducibility-1.19970

    View Slide

  32. Bioinformatics
    32
    Dictionary definition
    https://www.merriam-webster.com/dictionary/bioinformatics

    View Slide

  33. Bioinformatics
    http://hyperphysics.phy-astr.gsu.edu/hbase/Organic/dogma.html

    View Slide

  34. Bioinformatics
    http://www.batcallid.com/canBCIDfeatures.html

    View Slide

  35. Bioinformatics
    https://www.evogeneao.com/learn/tree-of-life

    View Slide















































  36. 0
    3000
    6000
    9000
    1970 1980 1990 2000 2010
    Year
    Yearly
    PDB sequences over time
    Bioinformatics
    http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100

    View Slide















































  37. 0e+00
    5e+04
    1e+05
    1970 1980 1990 2000 2010
    Year
    Total
    PDB sequences over time
    Bioinformatics
    http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100

    View Slide

  38. Bioinformatics
    https://www.ncbi.nlm.nih.gov/genbank/statistics/

    View Slide

  39. Bioinformatics
    39
    Luscombe et al., Methods Inf Med., 2011

    View Slide

  40. Bioinformatics
    http://hyperphysics.phy-astr.gsu.edu/hbase/Organic/dogma.html

    View Slide

  41. Genome
    Transcripts

    View Slide

  42. Bioinformatics
    http://www.sequence-alignment.com/
    https://commons.wikimedia.org/wiki/File:Sequence_alignment_dendrotoxins.jpg

    View Slide

  43. Bioinformatics
    https://blast.ncbi.nlm.nih.gov/Blast.cgi

    View Slide

  44. Bioinformatics
    Wilks et al., biorXiv, 2017

    View Slide

  45. Bioinformatics

    View Slide

  46. Bioinformatics
    Wilks et al.,
    biorXiv,
    2017

    View Slide

  47. Bioinformatics

    View Slide

  48. Bioinformatics
    Wilks et al., biorXiv, 2017

    View Slide

  49. Bioinformatics
    49
    Aims
    • Organize data, allow new entries, data curation
    • Develop tools and resources that aid in the analysis
    of data
    • Use these tools to analyze the data and interpret
    the results in a biologically meaningful manner
    Luscombe et al., Methods Inf Med., 2011

    View Slide

  50. Bioinformatics
    50
    Welch et al., PLoS Comp Bio, 2014
    Recommended curriculum

    View Slide

  51. http://www.nature.com/news/don-t-let-useful-data-go-to-waste-1.21555

    View Slide

  52. AUCAGUCGAUCACCGAU
    transcription
    RNA
    translation
    protein
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    DNA M M M
    slide adapted from Alyssa Frazee

    View Slide

  53. AUCAGUCGAUCACCGAU
    transcription
    RNA
    translation
    protein
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    DNA M M M
    Bisulfite
    RNA
    ChIP
    Genome
    slide adapted from Alyssa Frazee

    View Slide

  54. AUCAGUCGAUCACCGAU
    transcription
    RNA
    translation
    protein
    ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
    DNA M M M
    RNA
    slide adapted from Alyssa Frazee

    View Slide

  55. Genome
    Transcripts
    Reads

    View Slide

  56. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:[email protected]@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +
    @[email protected]/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>[email protected]
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
    +
    HIGHIHFHEGE4111:.;[email protected][email protected]?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +
    [email protected]=42:[email protected]>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>[email protected]@[email protected]@DFCCAA8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +
    [email protected];[email protected]@8>5554,/':[email protected]@[email protected]:[email protected]?=GG=;3 gb

    View Slide

  57. GTEx TCGA
    slide adapted from Shannon Ellis

    View Slide

  58. SRA

    View Slide

  59. Slide adapted from Ben Langmead

    View Slide

  60. Genome

    View Slide

  61. http://rail.bio/
    Slide adapted from Ben Langmead

    View Slide

  62. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View Slide

  63. Obstacle: our research moves (spot) markets
    Spike in market price due to preprocessing job flows
    slide adapted from Jeff Leek

    View Slide

  64. Obstacle: our research moves (spot) markets
    Weekday market volatility
    Weekend EC2
    inactivity
    slide adapted from Jeff Leek

    View Slide

  65. https://jhubiostatistics.shinyapps.io/recount/

    View Slide

  66. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View Slide

  67. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse https://github.com/leekgroup/recount-analyses/

    View Slide

  68. expression data for ~70,000 human samples
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  69. expression data for ~70,000 human samples
    Answer meaningful
    questions about
    human biology and
    expression
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    slide adapted from Shannon Ellis

    View Slide

  70. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View Slide

  71. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always clear…
    sra_meta$Se
    x
    “1 Male, 2 Female”, “2 Male, 1 Female”, “3
    Female”, “DK”, “male and female” “Male
    (note: ….)”, “missing”, “mixed”, “mixture”,
    “N/A”, “Not available”, “not applicable”,
    “not collected”, “not determined”, “pooled
    male and female”, “U”, “unknown”,
    “Unknown”
    slide adapted from Shannon Ellis

    View Slide

  72. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across samples
    in TCGA
    test set
    TCGA
    The Cancer Genome Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  73. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of Samples
    (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  74. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3% 96.5% 91.0%
    70.2%
    Prediction is
    more
    accurate in
    healthy
    tissue
    50.6%
    slide adapted from Shannon Ellis

    View Slide

  75. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    sex tissue
    M Blood
    F Heart
    F Liver
    slide adapted from Shannon Ellis

    View Slide

  76. bioconductor.org/packages/derfinder
    bioconductor.org/packages/recount
    > biocLite(“derfinder”)
    > biocLite(“recount”)
    http://rail.bio
    $ ./install-rail-rna-V

    View Slide

  77. Collaborators
    The Leek Group
    Jeff Leek
    Shannon Ellis
    Hopkins
    Ben Langmead
    Chris Wilks
    Kai Kammers
    Kasper Hansen
    Margaret Taub
    OHSU
    Abhinav Nellore
    LIBD
    Andrew Jaffe
    Emily Burke
    Stephen Semick
    Carrie Wright
    Amanda Price
    Nina Rajpurohit
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    CONACyT 351535
    AWS in Education
    Seven Bridges
    IDIES SciServer

    View Slide

  78. Bioinformatics
    78
    Staying Current in Bioinformatics & Genomics
    • “focused on applied methodology and study design
    rather than any particular phenotype, model system,
    disease, or specific method”
    • “a software implementation that’s well documented,
    actively supported, and performs well in fair
    benchmarks”
    http://www.gettinggeneticsdone.com/

    View Slide

  79. Bioinformatics
    79
    Staying Current in Bioinformatics & Genomics
    • Twitter
    • Blogs
    • Some websites
    • Pre-prints
    • Journal articles
    http://www.gettinggeneticsdone.com/

    View Slide

  80. 11
    Reproducible Research and
    Bioinformatics
    Leonardo Collado-Torres
    @fellgernon
    http://lcolladotor.github.io/

    View Slide