Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BioPerl_Rules_-_Drools.pdf

John SJ Anderson
June 03, 2013
97

 BioPerl_Rules_-_Drools.pdf

John SJ Anderson

June 03, 2013
Tweet

Transcript

  1. John Anderson 2 • Once upon a time, was a

    biologist • Impressive Credential 2 • Internet Tough Guy (@genehack) • Well Known Blowhard • http://github.com/genehack Tuesday, June 11, 13
  2. Jay Hannah 3 • Dropout: B.S. Mechanical Engineering, ISU 1994

    • Dropout: B.S. Psychology, ISU 1995 • Dropout: B.A. Philosophy, ISU 1995 • Part-time, stipend-funded hobbyist 2006-2010 • University of Nebraska, Omaha • University of Nebraska Medical Center • Dropout: B.S. Bioinformatics, UNO 2010 • http://www.bioperl.org/wiki/Jay_Hannah • Self-taught DB / web developer since 1995. • http://github.com/jhannah Tuesday, June 11, 13
  3. 10 use Bio::Seq; $seq = Bio::Seq->new( -seq => 'GTGCATCTGACTCCTG', -id

    => 'JAY1', ); say $seq->seq; # GTGCATCTGACTCCTG BioPerl! Tuesday, June 11, 13
  4. 11 • Boring classroom example • Nobody actually does this

    • You need way more metadata BioPerl! Tuesday, June 11, 13
  5. 11 • Boring classroom example • Nobody actually does this

    • You need way more metadata • 1999 wants its dashed params back BioPerl! Tuesday, June 11, 13
  6. The Human Genome 13 • 3 billion DNA base pairs

    (A, C, G, or T) • Fully extended, the DNA from a single cell would have a total length of almost 6 feet. • All the DNA in your cells could reach the moon ...6000 times! • 24 distinct chromosomes • Estimated 20,000-25,0000 genes http://en.wikipedia.org/wiki/Human_genome http://www.rothamsted.ac.uk/notebook/courses/guide/dnast.htm Tuesday, June 11, 13
  7. 14 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG

    CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13
  8. The Human Genome 15 • Only 2.5% of DNA is

    different between humans and mice. Only 1% different from chimpazee. [1] • “We share half our genes [DNA] with the banana.” [2] • DNA is the blueprint of ALL life. You grew from a single cell to an adult human. What made you you? Why aren’t you me? Or a chimp? Or a banana tree, a whale shark, plankton, a clover or a giant redwood? • Answer: proteins. 1. Mural, R.J., et al., Science, v. 296, May 31, 2002, p. 1661. 2. May, R., Quoted in Coglan & Boyce, New Scientist 167 (July Tuesday, June 11, 13
  9. DNA -> ASCII 24 DNA consists of four nucleic acids:

    (A)denine (C)ytosine (G)uanine (T)hymine C <=> G A <=> T Tuesday, June 11, 13
  10. 25 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG

    CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13
  11. Transcription 26 DNA => RNA A => U T =>

    A G => C C => G In RNA, (T)hymine is replaced by (U)racil. Tuesday, June 11, 13
  12. Amino acids 27 Ala A Alanine Arg R Arginine Asn

    N Asparagine Asp D Aspartic acid Cys C Cysteine Gln Q Glutamine Glu E Glutamic acid Gly G Glycine His H Histidine Ile I Isoleucine Leu L Leucine Lys K Lysine Met M Methionine Phe F Phenylalanine Pro P Proline Ser S Serine Thr T Threonine Trp W Tryptophan Tyr Y Tyrosine Val V Valine Tuesday, June 11, 13
  13. 31 use Bio::Seq; # BioPerl! my $seq = Bio::Seq->new( -seq

    => 'GTGCATCTGACTCCTGAGGAGAAG', -id => 'JAY1', ); say $seq->translate->seq; # VHLTPEEK “Central Dogma” Tuesday, June 11, 13
  14. 32 “Central Dogma” • Biology / chemistry are complicated. •

    But these text manipulation examples are just simple map-based transforms Tuesday, June 11, 13
  15. 32 “Central Dogma” • Biology / chemistry are complicated. •

    But these text manipulation examples are just simple map-based transforms • And the example isn't even terribly relevant — is this even a coding sequence? Tuesday, June 11, 13
  16. 32 “Central Dogma” • Biology / chemistry are complicated. •

    But these text manipulation examples are just simple map-based transforms • And the example isn't even terribly relevant — is this even a coding sequence? • LOL dashed params Tuesday, June 11, 13
  17. 33 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION

    Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... A GenBank file Tuesday, June 11, 13
  18. 34 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION

    Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... A GenBank file Tuesday, June 11, 13
  19. 35 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION

    Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... A GenBank file Tuesday, June 11, 13
  20. 36 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG

    CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13
  21. 37 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta'

    ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy! Tuesday, June 11, 13
  22. 38 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta'

    ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy!BAD Tuesday, June 11, 13
  23. 39 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta'

    ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy!BAD guess how this works Tuesday, June 11, 13
  24. 40 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta'

    ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy!BAD guess how this works BioPerl: No Nice Things For You! Tuesday, June 11, 13
  25. It’s so easy! 41 >#12345 example 1 aaaatgggggggggggccccgtt Seq::SeqIO:: All

    that code did is generate this: Not exactly rocket science. BAD Tuesday, June 11, 13
  26. 42 FTHelper embl interpro raw MultiFile embldriver kegg scf abi

    entrezgene largefasta seqxml ace excel lasergene strider agave exp locuslink swiss alf fasta mbsout swissdriver asciitree fastq metafasta tab bsml flybase_chadoxml table bsml_sax game nexml tigr chadoxml gbdriver phd tigrxml chaos gbxml pir tinyseq chaosxml gcg pln ztr ctf genbank qual Seq::SeqIO:: Look at all these formats!! Tuesday, June 11, 13
  27. 43 FTHelper embl interpro raw MultiFile embldriver kegg scf abi

    entrezgene largefasta seqxml ace excel lasergene strider agave exp locuslink swiss alf fasta mbsout swissdriver asciitree fastq metafasta tab bsml flybase_chadoxml table bsml_sax game nexml tigr chadoxml gbdriver phd tigrxml chaos gbxml pir tinyseq chaosxml gcg pln ztr ctf genbank qual Seq::SeqIO:: Look at all these formats!! Guess what? All the code for dealing with all these formats is completely non-standardized and most of it was written by a graduate student who has fallen off the face of the planet. Tuesday, June 11, 13
  28. 44 Let's look at some code... So, let's look at

    that write_seq() method... https://metacpan.org/source/CJFIELDS/BioPerl-1.6.901/Bio/SeqIO.pm#L519 Well... let's look at the constructor... https://metacpan.org/source/CJFIELDS/BioPerl-1.6.901/Bio/SeqIO.pm#L350 Tuesday, June 11, 13
  29. 45 So, I guess maybe we need to look at

    https://metacpan.org/release/BioPerl Let's look at some code... Tuesday, June 11, 13
  30. 47 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION

    Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... Remember GenBank? Tuesday, June 11, 13
  31. 48 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION

    Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... Remember GenBank? “Features” Tuesday, June 11, 13
  32. 49 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG

    CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13
  33. 51 Look at all this awesome!! my $seqio_object = Bio::SeqIO->new(-file

    => $gb_file); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures) { say "primary tag: ", $feat_object->primary_tag; for my $tag ($feat_object->get_all_tags) { say " tag: $tag"; for my $value ($feat_object->get_tag_values($tag)) { say " value: $value"; } } } Tuesday, June 11, 13
  34. 53 Across all those formats! FTHelper embl interpro raw MultiFile

    embldriver kegg scf abi entrezgene largefasta seqxml ace excel lasergene strider agave exp locuslink swiss alf fasta mbsout swissdriver asciitree fastq metafasta tab bsml flybase_chadoxml table bsml_sax game nexml tigr chadoxml gbdriver phd tigrxml chaos gbxml pir tinyseq chaosxml gcg pln ztr ctf genbank qual Tuesday, June 11, 13
  35. 55 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus"

    /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" Tuesday, June 11, 13
  36. 56 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus"

    /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" Tuesday, June 11, 13
  37. 58 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus"

    /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" Tuesday, June 11, 13
  38. 59 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus"

    /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" EXACT! (5..100) BEFORE! (<5..100) AFTER! (>5..100) WITHIN! ((5.10)..100) BETWEEN! (99^100) UNCERTAIN! (99.?100) Worse yet!! Tuesday, June 11, 13
  39. 60 BioPerl to the rescue! my $seqio_object = Bio::SeqIO->new(-file =>

    $gb_file); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures) { if ($feat_object->primary_tag eq "CDS") { say $feat_object->spliced_seq->seq; # e.g. 'ATTATTTTCGCTCGCTTCTCGCGCTTTTGCGT...' if ($feat_object->has_tag('gene')) { for my $val ($feat_object->get_tag_values('gene')){ say "gene: $val”; # e.g. 'NDP', from a line like '/gene="NDP"' } } } } Tuesday, June 11, 13
  40. 61 my $seqio_object = Bio::SeqIO->new(-file => $gb_file); my $seq_object =

    $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures) { if ($feat_object->primary_tag eq "CDS") { say $feat_object->spliced_seq->seq; # e.g. 'ATTATTTTCGCTCGCTTCTCGCGCTTTTGCGT...' if ($feat_object->has_tag('gene')) { for my $val ($feat_object->get_tag_values('gene')){ say "gene: $val”; # e.g. 'NDP', from a line like '/gene="NDP"' } } } } BioPerl to the rescue! Tuesday, June 11, 13
  41. 64 Score = 55.5 bits (132), Expect = 2e-04, Method:

    Compositional matrix adjust. Identities = 51/206 (24%), Positives = 99/206 (48%), Gaps = 15/206 (7%) Frame = +1 Query 164 KQLAKDLRQWQTNVDVANDLALKLLRDYSADDTRKVHMITENINASWRSIHKRVSEREAA 223 K+L + Q QT V N ++++ S D + ++N W+ + K++S+R+ Sbjct 97 KELQDGIGQRQTVVRTLNATGEEIIQQSSKTDASILQEKLGSLNLRWQEVCKQLSDRKKR 276 Query 224 LEETHRLLQQFPLDLEKFLAWLTEAETTANVLQDATRKERLLEDSKGVKELMKQWQDLQG 283 LEE +L +F DL +F+ WL EA+ A++ + ++++L E + VK L+++ QG Sbjct 277 LEEQKNILSEFQRDLNEFVLWLEEADNIASIPLEPGKEQQLKEKLEQVKLLVEELPLRQG 456 Query 284 EIEAHTDVYHNLDENSQKILRSLEGS-DDAVLLQRRLDNMNFKWSELRKKSLNIRSHLEA 342 + L+E +L S S ++ L+ +L N +W ++ + + +EA Sbjct 457 -------ILKQLNETGGPVLVSAPISPEEQDKLENKLKQTNLQWIKVSRALPEKQGEIEA 615 Query 343 SSDQWKRLHLSLQE-------LLVWL 361 +L L++ LL+WL Sbjct 616 QIKDLGQLEKKLEDLEEQLNHLLLWL 693 NCBI BLAST Bio::SearchIO Tuesday, June 11, 13
  42. 65 Score = 55.5 bits (132), Expect = 2e-04, Method:

    Compositional matrix adjust. Identities = 51/206 (24%), Positives = 99/206 (48%), Gaps = 15/206 (7%) Frame = +1 Query 164 KQLAKDLRQWQTNVDVANDLALKLLRDYSADDTRKVHMITENINASWRSIHKRVSEREAA 223 K+L + Q QT V N ++++ S D + ++N W+ + K++S+R+ Sbjct 97 KELQDGIGQRQTVVRTLNATGEEIIQQSSKTDASILQEKLGSLNLRWQEVCKQLSDRKKR 276 Query 224 LEETHRLLQQFPLDLEKFLAWLTEAETTANVLQDATRKERLLEDSKGVKELMKQWQDLQG 283 LEE +L +F DL +F+ WL EA+ A++ + ++++L E + VK L+++ QG Sbjct 277 LEEQKNILSEFQRDLNEFVLWLEEADNIASIPLEPGKEQQLKEKLEQVKLLVEELPLRQG 456 Query 284 EIEAHTDVYHNLDENSQKILRSLEGS-DDAVLLQRRLDNMNFKWSELRKKSLNIRSHLEA 342 + L+E +L S S ++ L+ +L N +W ++ + + +EA Sbjct 457 -------ILKQLNETGGPVLVSAPISPEEQDKLENKLKQTNLQWIKVSRALPEKQGEIEA 615 Query 343 SSDQWKRLHLSLQE-------LLVWL 361 +L L++ LL+WL Sbjct 616 QIKDLGQLEKKLEDLEEQLNHLLLWL 693 Look at all this awesome!! Bio::SearchIO Bio::Search::HSP::HSPI $hsp->algorithm; $hsp->pvalue(); $hsp->evalue(); $hsp->frac_identical( ['query'|'hit'|'total'] ); $hsp->frac_conserved( ['query'|'hit'|'total'] ); $hsp->gaps( ['query'|'hit'|'total'] ); $hsp->query_string; $hsp->hit_string; $hsp->homology_string; $hsp->length( ['query'|'hit'|'total'] ); $hsp->rank; Tuesday, June 11, 13
  43. 71 Seriously. All snark aside, BioPerl is a needed piece

    of code. It just needs some love. Tuesday, June 11, 13
  44. 71 Seriously. All snark aside, BioPerl is a needed piece

    of code. It just needs some love. Okay, a lot of love. Tuesday, June 11, 13
  45. 71 Seriously. All snark aside, BioPerl is a needed piece

    of code. It just needs some love. Okay, a lot of love. If you have the skills to contribute, please join in and help with the cleanup. Tuesday, June 11, 13