Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DBG-Assemble A Genome

DBG-Assemble A Genome

Assemble a genome

Buttonwood

August 28, 2013
Tweet

Other Decks in Education

Transcript

  1. Origin of de Bruijn graphs Graph Theory: Hamilton path VS

    Euler path In 1946, the Dutch mathematician Nicolaas de Bruijn The ‘superstring problem’: find a shortest circular ‘superstring’ that contains all possible ‘substrings’ of length k (k-mers) over a given alphabet.{0,1} -> {A,T,C,G}
  2. Impact of Changing k-mer Size Big? Small? Avoid even k?

    avoid even k, because with even k, many k-mers become reverse comcomplements of their own sequences.
  3. Sequencing errors ---> Tips A B C D • Clip

    the short tips that had lengths < 2 Kmers • Or less number of reads through it.
  4. Sequencing errors ---> CrossLink ATGGAAGTCGCG ATGGAAG TGGAAGT GGAAGTC GAAGTCG AAGTCGC

    AGTCGCG GAGGAAGACCTT GAGGAAG AGGAAGA GGAAGAC GAAGACC AAGACCT AGACCTT GAGGAAGTCC AGGAAGT ATGGAAGTCG seq1 seq1-read1 seq2-read1 seq2 • Remove low-coverage nodes. Low-coverage connection
  5. Sequencing errors --->Bubbles ATGGAAGTCGCG... ATGGAAG TGGAAGT GGAAGTC GAAGTCG AAGTCGC AGTCGCG

    TGGAAGA GGAAGAC GAAGACG AAGACGC AGACGCG ATGGAAGACG... ATGGAAGTCG... seq1 seq1-read1 seq1-read2 • Remove low-coverage paths. • Same as SNP Bubbles. Low-coverage paths Bubbles
  6. TINY OR LONG REPEAT ATTTAAATTAGCGATATTAGCATCTCTT .... AATTA ATTAG TAGCG AGCGA

    GCGAT CGATA GATAT ATATT TATTA TAGCA AGCAT GCATC CATCT ATCTC TCTCT ... TTAGC c a d b e .... AATTAGC ATTAGCG TAGCGAT AGCGATA GCGATAT CGATATT GATATTA ATATTAG TATTAGC TTAGCGA ATTAGCA TTAGCAT TAGCATC .... You see what? Bigger k-mer(long overlap) cross the repeat.
  7. SNPs--->Bubbles ATGGAAGTCGCG... ATGGAAG TGGAAGT GGAAGTC GAAGTCG AAGTCGC AGTCGCG TGGAAGA GGAAGAC

    GAAGACG AAGACGC AGACGCG ATGGAAGACG... ATGGAAGTCG... hap1 hap1-read1 hap2-read1 • Equal coverage paths. Equal-coverage paths Bubbles ATGGAAGACGCG... hap2 ? Adjacent SNPs ATGGTAGTCGCG... ATGGAAGACGCG... hap1 hap2
  8. Indels--->Bubbles ATGGAAGTCGCGTCGA... ATGGAAG TGGAAGT ... CGCGTCG GCGTCGA TGGAAGG ... GGCCTCG

    ATGGAAG---- GCGTC... ATGGAAGTCG... hap1 hap1-read1 hap2-read1 • Equal coverage paths. • Long road map. Equal-coverage paths Bubbles ATGGAAG-----GCGTCGA... hap2 ? Adjacent SNPs with Indels ATGGTAGTCGCAAGCC... ... ATGGAAGACGC---GCG... hap1 hap2
  9. ASSIGNMENT 1 Let Kmer=4 Let Kmer=5 Let Kmer=7 ATTA TTAG

    TAGG AGGA ATTAGGATCATGATCCTCTGTGGATAAGATCTTTTTATTTAAAGATCTCTTTATTAGATCTCTT … ATTA DBG of Genome; DBG of Reads; L = 15 TRY!
  10. ASSIGNMENT 2 Simulated Hap1 about 1M: Count K-mer freq; Then

    sequence 40X; Then add repeat/Error rate; Then SNPs(two haplotype);