$30 off During Our Annual Pro Sale. View Details »

Adventures in Computational Biology

Adventures in Computational Biology

Introductory research talk given in the Department of Biology at Johns Hopkins, to high schoolers and high school teachers participating in the Molecular Biology and Genomics Research program.

Alyssa Frazee

July 17, 2014
Tweet

More Decks by Alyssa Frazee

Other Decks in Science

Transcript

  1. Adventures in
    Computational Biology
    Alyssa Frazee
    Johns Hopkins biostatistics
    using statistics, math, biology, and computer
    programming to untangle the mysteries of gene
    expression

    View Slide

  2. gene expression: definition
    transcription
    DNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT!
    RNA AUCAGUCGAUCACCGAU!
    translation
    protein

    View Slide

  3. gene expression: definition
    transcription
    DNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT!
    RNA AUCAGUCGAUCACCGAU!

    View Slide

  4. gene expression: definition
    transcription
    DNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT!
    RNA AUCAGUCGAUCACCGAU!
    we can measure a gene’s expression level as
    the amount of RNA present in the cell that
    was transcribed from the gene

    View Slide

  5. transcription can get complex: splicing
    DNA ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT!
    RNA AUCAGUCGAUCACCGAU!
    AUCAGUCGAUC
    \!
    CGAUCACCGAU!

    View Slide

  6. so what?
    differences in gene expression and splicing
    have been implicated in:
    cell differentiation
    (Trapnell et al 2010)

    View Slide

  7. so what?
    differences in gene expression and splicing
    have been implicated in:
    organism development
    (Graveley 2010)
    image:  Chris,ne  Gerhart,  bit.ly/16h6P0Y.  license.    

    View Slide

  8. so what?
    differences in gene expression and splicing
    have been implicated in:
    cancer
    (Govindan 2012)
    image:  Wikimedia  Commons,  
    bit.ly/1cvKEc6    

    View Slide

  9. measuring expression: RNA-seq data
    Genome
    Transcripts
    RNA-seq reads
    (50-100 bp long)
    great for whole
    genomes / populations!

    View Slide

  10. next gen
    sequencing
    is awesome
    G
    C
    T
    A
    A
    G
    C T
    A
    F
    a Illumina/Solexa — Reversible terminators
    Incorporate
    all four
    nucleotides,
    each label
    with a
    different dye
    Repeat cycles
    T
    G
    C
    T
    G
    C
    T
    G
    C
    G C
    A
    T
    G
    C
    G C
    A
    T
    G
    C
    G C
    A
    T
    G
    C
    F
    F
    F F
    F
    F
    F
    F F F
    F F F
    F
    F F
    F
    F
    F
    F
    Cleave dye
    and terminating
    groups, wash
    Wash, four-
    colour imaging
    Metzker  2010  

    View Slide

  11. next gen sequencing is awesome
    Metzker  2010  
    Repeat cycles
    wa
    C
    G
    A
    T
    b
    CATCGT
    Top:
    Bottom: CCCCCC
    Figure 2 | Four-colour and one-colour cyclic reve
    termination (CRT) method uses Illumina/Solexa’s 3
    solid-phase-amplified template clusters (FIG. 1b, sh
    imaging, a cleavage step removes the fluorescent
    tris(2-carboxyethyl)phosphine (TCEP)23. b | The fou

    View Slide

  12. @22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2!
    CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
    +!
    GGFFGBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
    @22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2!
    GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
    +!
    @=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF
    @22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1!
    TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
    +!
    DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI
    @22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2!
    AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCaaaaaGaCnCgCC
    +!
    HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############
    @22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1!
    AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
    +!
    B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH
    @22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2!
    GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
    +!
    IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8
    @22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1!
    TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
    +!
    HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;

    View Slide

  13. ©  Allie  Brosh,  Hyperbole  and  a  Half.    
    hLp://bit.ly/1mImIJx  
    (I try to avoid this)

    View Slide

  14. analyzing the data, step (1): alignment  
    match reads back to their original genomic location
    genome
    lide  credit:  Jeff    

    View Slide

  15. analyzing the data, step (2): assembly!
    reconstruct transcripts from read alignments
    Genome
    Fragments
    Transcripts
    lide  credit:  Jeff    

    View Slide

  16. Genome
    equally likely
    assemblies
    analyzing the data, step (2): assembly!
    ASSEMBLY IS A REALLY DIFFICULT PROBLEM

    View Slide

  17. Genome
    analyzing the data, step (2): assembly!
    ASSEMBLY IS A REALLY DIFFICULT PROBLEM
    …but researchers are motivated to solve it
    because data-driven assembly allows for
    discovery

    View Slide

  18. View Slide

  19. View Slide

  20. Genome
    Transcripts
    lide  credit:  Jeff    
    analyzing the data, step (3): estimate expression!
    How many reads originated from each transcript?

    View Slide

  21. analyzing the data, step (4): differential
    expression testing
    Is the mean abundance for transcript X the same in population
    A and population B?
    transcripts
    flag as differentially expressed
    population A population B

    View Slide

  22. our proposed pipeline: Ballgown
    image:  bit.ly/HBbLaO  

    View Slide

  23. our proposed pipeline: Ballgown
    align with TopHat

    View Slide

  24. our proposed pipeline: Ballgown
    align with TopHat
    (which depends on Bowtie

    View Slide

  25. our proposed pipeline: Ballgown
    align with TopHat
    assemble with Cufflink

    View Slide

  26. our proposed pipeline: Ballgown
    align with TopHat
    assemble with Cufflink
    analyze with
    Ballgown

    View Slide

  27. align assemble
    estimate
    abundances
    organize output
    •  visualize assembly structure
    •  postprocess assembly if necessary
    •  test for differential expression
    [use tool of your choice]
    Ballgown

    View Slide

  28. visualize
    51210000 51215000 51220000
    genomic position
    0 78.92 157.84 236.76 315.68 394.61 473.53 552.45 631.37 713.88
    expression, by transcript
    gene XLOC_000454, sample 1

    View Slide

  29. check
    51210000 51215000 51220000
    genomic position
    Assembled and Annotated Transcripts
    annotated assembled

    View Slide

  30. fix?
    51210000 51215000 51220000
    genomic position
    Assembled and Annotated Transcripts
    annotated assembled

    View Slide

  31. test for differential expression

    View Slide

  32. test for differential expression
    recall! we want to know if the mean abundance for transcript X
    is the same in population A and population B.
    transcripts
    flag as differentially expressed
    population A population B

    View Slide

  33. test for differential expression
    LINEAR REGRESSION
    IN FIFTEEN MINUTES!

    View Slide




















































  34. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)

    View Slide




















































  35. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)

    View Slide




















































  36. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    this seems like a good line!
    y=mx + b

    View Slide




















































  37. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    but wait! is this line better?

    View Slide




















































  38. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    or perhaps this one?

    View Slide




















































  39. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    what does “better” even mean?

    View Slide




















































  40. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)

    View Slide




















































  41. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    the y-axis distance from a fitted
    line to a data point is called a
    “residual”

    View Slide




















































  42. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    “best” line has minimum
    “sum of squared residuals”

    View Slide




















































  43. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    The Best Line for this data

    View Slide




















































  44. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    male
    female

    View Slide




















































  45. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    male
    female
    regression can handle it!

    View Slide




















































  46. ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    male
    female
    regression can handle it!

    View Slide

  47. what do we learn from regression?
    y = a + b1
    x
    weight = 20.9 + 1.53*height
    slope?
    intercept?

    View Slide

  48. what do we learn from regression?
    y = a + b1
    x1
    + b2
    x2
    weight = 55.2 + 0.935*height + 9.76(if male)
    slope(s)?
    intercept?

    View Slide

  49. nice statistical tests exist to help us
    decide which model is better



















































    ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)
    Weight (lbs)
    male
    female



















































    ● ●



























































    ● ●
















    ● ●





























    ● ●







    ● ●






















    ● ●































































    ● ●

































    ● ●
























    ● ●












































































































































    ● ●










    ● ●
























    55 60 65 70 75 80
    100 120 140 160
    Heights and Weights of 500 Adults
    Height (inches)

    View Slide

  50. back to molecular biology!
    for each transcript, fit 2 linear regression models:
    model A: includes “disease” as covariate
    model B: does not include disease aaa
    Y is expression (# of RNA-seq reads)
    X is “disease status”
    if model A fits better than model B, then disease status
    has something to do with expression of that transcript

    View Slide

  51. a real-world example
    6 7 8 9 10
    0 1 2 3 4
    RIN
    log2(transcript expression + 1)
    transcript 2208, chr1: 17732159−17739760
    YRI
    CEU
    FIN
    GBR
    TSI
    6 7 8 9 10
    0.0 0.5 1.0 1.5 2.0
    RIN
    log2(transcript expression + 1)
    transcript 295456, chr8: 52729948−52814118
    YRI
    CEU
    FIN
    GBR
    TSI

    View Slide

  52. a real-world example
    Applicable to other data!
    I’ve done projects on:
    •  cancer
    •  psychiatric disease
    •  stem cells

    View Slide

  53. View Slide

  54. other “flavors” of linear models for
    gene expression:
    •  EdgeR and DESeq (“generalized” linear models)
    •  limma (empirical Bayes approach)

    View Slide

  55. on our way

    View Slide

  56. thank you!
    contact (please feel free):
    email: [email protected]
    twitter: @acfrazee
    website: alyssafrazee.com
    My collaborators: Jeff Leek (advisor), Geo Pertea, Steven
    Salzberg, Ben Langmead, Andrew Jaffe, and several others in
    the Center for Computational Biology and biostatistics
    department

    View Slide

  57. references (by PubMed ID)
    •  Ballgown paper: http://biorxiv.org/content/early/
    2014/03/30/003665
    •  Cufflinks: 20436464 (Trapnell et al 2010)
    •  EdgeR: 19910308 (Robinson et al 2010)
    •  DESeq: 20979621 (Anders and Huber 2010)
    •  Limma: “Linear Models for Microarray Data” by Gordon K Smyth, in
    Bioinformatics and Computational Biology Solutions using R and
    Bioconductor, Springer 2005; 24485249 (Law et al 2014)
    •  Drosophila life cycle: 21179090 (Graveley et al 2011)
    •  Isoforms & cancer: 22980976 (Govindan et al 2012)
    •  cell differentiation: 20436464 (Cufflinks; Trapnell et al 2010)
    •  Next generation sequencing paper/figures: 19997069 (Metzker 2010)
    •  image sources: http://bit.ly/16h6P0Y, http://bit.ly/1cvKEc6,
    http://bit.ly/19TaSH9, http://bit.ly/12pNREw, http://bit.ly/HBbLaO

    View Slide