Slide 1

Slide 1 text

BioPerl Rules / Drools YAPC::NA 2013 John Anderson, PhD Jay Hannah, JAPH 1 Tuesday, June 11, 13

Slide 2

Slide 2 text

John Anderson 2 • Once upon a time, was a biologist • Impressive Credential 2 • Internet Tough Guy (@genehack) • Well Known Blowhard • http://github.com/genehack Tuesday, June 11, 13

Slide 3

Slide 3 text

Jay Hannah 3 • Dropout: B.S. Mechanical Engineering, ISU 1994 • Dropout: B.S. Psychology, ISU 1995 • Dropout: B.A. Philosophy, ISU 1995 • Part-time, stipend-funded hobbyist 2006-2010 • University of Nebraska, Omaha • University of Nebraska Medical Center • Dropout: B.S. Bioinformatics, UNO 2010 • http://www.bioperl.org/wiki/Jay_Hannah • Self-taught DB / web developer since 1995. • http://github.com/jhannah Tuesday, June 11, 13

Slide 4

Slide 4 text

4 Jay? Bio? What? Tuesday, June 11, 13

Slide 5

Slide 5 text

5 Tuesday, June 11, 13

Slide 6

Slide 6 text

6 Tuesday, June 11, 13

Slide 7

Slide 7 text

DNA 7 Hollywood! Tuesday, June 11, 13

Slide 8

Slide 8 text

DNA 8 Holy crap! Chemistry! Tuesday, June 11, 13

Slide 9

Slide 9 text

DNA 9 Woot! ASCII! GTGCATCTGACTCCTG Tuesday, June 11, 13

Slide 10

Slide 10 text

10 use Bio::Seq; $seq = Bio::Seq->new( -seq => 'GTGCATCTGACTCCTG', -id => 'JAY1', ); say $seq->seq; # GTGCATCTGACTCCTG BioPerl! Tuesday, June 11, 13

Slide 11

Slide 11 text

11 BioPerl! Tuesday, June 11, 13

Slide 12

Slide 12 text

11 • Boring classroom example BioPerl! Tuesday, June 11, 13

Slide 13

Slide 13 text

11 • Boring classroom example • Nobody actually does this BioPerl! Tuesday, June 11, 13

Slide 14

Slide 14 text

11 • Boring classroom example • Nobody actually does this • You need way more metadata BioPerl! Tuesday, June 11, 13

Slide 15

Slide 15 text

11 • Boring classroom example • Nobody actually does this • You need way more metadata • 1999 wants its dashed params back BioPerl! Tuesday, June 11, 13

Slide 16

Slide 16 text

12 Genomes are big Tuesday, June 11, 13

Slide 17

Slide 17 text

The Human Genome 13 • 3 billion DNA base pairs (A, C, G, or T) • Fully extended, the DNA from a single cell would have a total length of almost 6 feet. • All the DNA in your cells could reach the moon ...6000 times! • 24 distinct chromosomes • Estimated 20,000-25,0000 genes http://en.wikipedia.org/wiki/Human_genome http://www.rothamsted.ac.uk/notebook/courses/guide/dnast.htm Tuesday, June 11, 13

Slide 18

Slide 18 text

14 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13

Slide 19

Slide 19 text

The Human Genome 15 • Only 2.5% of DNA is different between humans and mice. Only 1% different from chimpazee. [1] • “We share half our genes [DNA] with the banana.” [2] • DNA is the blueprint of ALL life. You grew from a single cell to an adult human. What made you you? Why aren’t you me? Or a chimp? Or a banana tree, a whale shark, plankton, a clover or a giant redwood? • Answer: proteins. 1. Mural, R.J., et al., Science, v. 296, May 31, 2002, p. 1661. 2. May, R., Quoted in Coglan & Boyce, New Scientist 167 (July Tuesday, June 11, 13

Slide 20

Slide 20 text

“Central Dogma of Molecular Biology” 16 Tuesday, June 11, 13

Slide 21

Slide 21 text

17 nsf.gov | http://tmblr.co/Z4VJiwMC6ol8 Tuesday, June 11, 13

Slide 22

Slide 22 text

18 Tuesday, June 11, 13

Slide 23

Slide 23 text

19 Tuesday, June 11, 13

Slide 24

Slide 24 text

20 Tuesday, June 11, 13

Slide 25

Slide 25 text

21 nsf.gov | http://tmblr.co/Z4VJiwMC6ol8 Tuesday, June 11, 13

Slide 26

Slide 26 text

22 Tuesday, June 11, 13

Slide 27

Slide 27 text

23 http://www.youtube.com/user/DNALearningCenter Tuesday, June 11, 13

Slide 28

Slide 28 text

DNA -> ASCII 24 DNA consists of four nucleic acids: (A)denine (C)ytosine (G)uanine (T)hymine C <=> G A <=> T Tuesday, June 11, 13

Slide 29

Slide 29 text

25 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13

Slide 30

Slide 30 text

Transcription 26 DNA => RNA A => U T => A G => C C => G In RNA, (T)hymine is replaced by (U)racil. Tuesday, June 11, 13

Slide 31

Slide 31 text

Amino acids 27 Ala A Alanine Arg R Arginine Asn N Asparagine Asp D Aspartic acid Cys C Cysteine Gln Q Glutamine Glu E Glutamic acid Gly G Glycine His H Histidine Ile I Isoleucine Leu L Leucine Lys K Lysine Met M Methionine Phe F Phenylalanine Pro P Proline Ser S Serine Thr T Threonine Trp W Tryptophan Tyr Y Tyrosine Val V Valine Tuesday, June 11, 13

Slide 32

Slide 32 text

28 Tuesday, June 11, 13

Slide 33

Slide 33 text

29 http://en.wikipedia.org/ wiki/Aspartic_acid “D” Tuesday, June 11, 13

Slide 34

Slide 34 text

“Central Dogma” 30 http://en.wikipedia.org/wiki/Image:Genetic_code.svg Tuesday, June 11, 13

Slide 35

Slide 35 text

31 use Bio::Seq; # BioPerl! my $seq = Bio::Seq->new( -seq => 'GTGCATCTGACTCCTGAGGAGAAG', -id => 'JAY1', ); say $seq->translate->seq; # VHLTPEEK “Central Dogma” Tuesday, June 11, 13

Slide 36

Slide 36 text

32 “Central Dogma” Tuesday, June 11, 13

Slide 37

Slide 37 text

32 “Central Dogma” • Biology / chemistry are complicated. Tuesday, June 11, 13

Slide 38

Slide 38 text

32 “Central Dogma” • Biology / chemistry are complicated. • But these text manipulation examples are just simple map-based transforms Tuesday, June 11, 13

Slide 39

Slide 39 text

32 “Central Dogma” • Biology / chemistry are complicated. • But these text manipulation examples are just simple map-based transforms • And the example isn't even terribly relevant — is this even a coding sequence? Tuesday, June 11, 13

Slide 40

Slide 40 text

32 “Central Dogma” • Biology / chemistry are complicated. • But these text manipulation examples are just simple map-based transforms • And the example isn't even terribly relevant — is this even a coding sequence? • LOL dashed params Tuesday, June 11, 13

Slide 41

Slide 41 text

33 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... A GenBank file Tuesday, June 11, 13

Slide 42

Slide 42 text

34 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... A GenBank file Tuesday, June 11, 13

Slide 43

Slide 43 text

35 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... A GenBank file Tuesday, June 11, 13

Slide 44

Slide 44 text

36 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13

Slide 45

Slide 45 text

37 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy! Tuesday, June 11, 13

Slide 46

Slide 46 text

38 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy!BAD Tuesday, June 11, 13

Slide 47

Slide 47 text

39 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy!BAD guess how this works Tuesday, June 11, 13

Slide 48

Slide 48 text

40 $seqio_obj = Bio::SeqIO->new( -file => '>sequence.fasta', -format => 'fasta' ); $seqio_obj->write_seq($seq_obj); Seq::SeqIO:: It’s so easy!BAD guess how this works BioPerl: No Nice Things For You! Tuesday, June 11, 13

Slide 49

Slide 49 text

It’s so easy! 41 >#12345 example 1 aaaatgggggggggggccccgtt Seq::SeqIO:: All that code did is generate this: Not exactly rocket science. BAD Tuesday, June 11, 13

Slide 50

Slide 50 text

42 FTHelper embl interpro raw MultiFile embldriver kegg scf abi entrezgene largefasta seqxml ace excel lasergene strider agave exp locuslink swiss alf fasta mbsout swissdriver asciitree fastq metafasta tab bsml flybase_chadoxml table bsml_sax game nexml tigr chadoxml gbdriver phd tigrxml chaos gbxml pir tinyseq chaosxml gcg pln ztr ctf genbank qual Seq::SeqIO:: Look at all these formats!! Tuesday, June 11, 13

Slide 51

Slide 51 text

43 FTHelper embl interpro raw MultiFile embldriver kegg scf abi entrezgene largefasta seqxml ace excel lasergene strider agave exp locuslink swiss alf fasta mbsout swissdriver asciitree fastq metafasta tab bsml flybase_chadoxml table bsml_sax game nexml tigr chadoxml gbdriver phd tigrxml chaos gbxml pir tinyseq chaosxml gcg pln ztr ctf genbank qual Seq::SeqIO:: Look at all these formats!! Guess what? All the code for dealing with all these formats is completely non-standardized and most of it was written by a graduate student who has fallen off the face of the planet. Tuesday, June 11, 13

Slide 52

Slide 52 text

44 Let's look at some code... So, let's look at that write_seq() method... https://metacpan.org/source/CJFIELDS/BioPerl-1.6.901/Bio/SeqIO.pm#L519 Well... let's look at the constructor... https://metacpan.org/source/CJFIELDS/BioPerl-1.6.901/Bio/SeqIO.pm#L350 Tuesday, June 11, 13

Slide 53

Slide 53 text

45 So, I guess maybe we need to look at https://metacpan.org/release/BioPerl Let's look at some code... Tuesday, June 11, 13

Slide 54

Slide 54 text

46 Genomes are complicated! Tuesday, June 11, 13

Slide 55

Slide 55 text

47 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... Remember GenBank? Tuesday, June 11, 13

Slide 56

Slide 56 text

48 LOCUS AE015929 2499279 bp DNA circular BCT 04-JAN-2006 DEFINITION Staphylococcus epidermidis ATCC 12228, complete genome. ACCESSION AE015929 AE016744 AE016745 AE016746 AE016747 AE016748 VERSION AE015929.1 GI:27316888 AUTHORS Zhang,Y.Q., Ren,S.X., Li,H.L., Wang,Y.X., Fu,G. FEATURES Location/Qualifiers CDS 15518..15847 /product="conserved hypothetical protein" /protein_id="AAO03607.1" /db_xref="GI:27314470" /translation="MTTDLHTLVLIILCGVVTLLIRVIPFVMISRVNLPAIVIKWLSF IPITLFTALIIDGVIQQHDHAFGYTLNLPYIIAIVPTVMLAIFTRSLTVTILGGIFVI ACLRLIF" ORIGIN 1 aagaaattgt gacgcttatt tgaagttatc cacttataca cataatttct cgcaaaaatt 61 gtggataaca catgcgctat acacacagtt attcaaaatt taacaacata ttcacagcca 121 tttgacatca cttggagtta aaaagtataa ttatgtggat aagtcgttca aattatgatt 181 ttacaaggat ttatttatta aatttatata cataaatggt gtgcataaat catagttatg 241 tttaagttat ccactgattg tgattaactt gtggataatt attaacatgc tgtgattatt ... Remember GenBank? “Features” Tuesday, June 11, 13

Slide 57

Slide 57 text

49 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA GAGGGGCCAGGATACAGCACCATGAATGCCATTGCAGTGAACGAATACAGCCAAACCAGCCAACCCAATAT ATGACTTTTGATTATGACTTGTTTGTAATTGGTGCTGGTTCTGGTGGTTTGGCTGCTTCTAAACGAGCTGC TAGCTATGGCGCAAAAGTAGCGATCGCCGAAAATGATTTAGTGGGTGGAACCTGTGTCATTCGGGGTTGTG TACCCAAAAAACTCATGGTTTATGGTTCTCACTTTCCCGCTTTATTCGAGGATGCAGCAGGCTATGGTTGG CAAGTCGGTAAGGCAGAATTAAATTGGGAACATTTCATTACATCTATAGATAAGGAAGTCCGGCGACTATC CCAACTGCACATCAGCTTTCTAGAAAAAGCCGGGGTAGAACTGATCTCTGGTCGTGCTACTTTGGTAGATA ATCACACAGTAGAAGTAGGCGAGCGTAAATTTACCGCCGATAAAATTTTAATTGCCGTTGGTGGTCGTCCC ATCAAACCAGAGTTGCCAGGGATGGAATATGGCATCACCTCCAACGAAATTTTTCACCTAAAAACCCAACC AAAACACATCGCTATCATTGGTTCTGGTTACATCGGTACAGAATTTGCCGGAATCATGCGTGGTTTGGGTT CACAAGTCACCCAAATTACCAGAGGTGACAAAATTCTCAAAGGTTTTGATGAAGACATCCGCACCGAAATT CAAGAAGGGATGACAAATCACGGTATTCGGATTATTCCTAAAAACGTAGTTACAGCTATTCAACAAGTACC AGAAGGTTTGAAAATAAGTTTATCTGGTGAAGACCAAGAACCAATCATTGCCGATGTATTTTTAGTAGCTA CAGGACGGGTTCCCAACGTAGATGGTTTAGGTCTGGAAAATGCTGGTGTTGATGTTGTTGACAGTTCTATA Tuesday, June 11, 13

Slide 58

Slide 58 text

50 BioPerl understands Feature annotations! Tuesday, June 11, 13

Slide 59

Slide 59 text

51 Look at all this awesome!! my $seqio_object = Bio::SeqIO->new(-file => $gb_file); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures) { say "primary tag: ", $feat_object->primary_tag; for my $tag ($feat_object->get_all_tags) { say " tag: $tag"; for my $value ($feat_object->get_tag_values($tag)) { say " value: $value"; } } } Tuesday, June 11, 13

Slide 60

Slide 60 text

52 Across all those formats! Tuesday, June 11, 13

Slide 61

Slide 61 text

53 Across all those formats! FTHelper embl interpro raw MultiFile embldriver kegg scf abi entrezgene largefasta seqxml ace excel lasergene strider agave exp locuslink swiss alf fasta mbsout swissdriver asciitree fastq metafasta tab bsml flybase_chadoxml table bsml_sax game nexml tigr chadoxml gbdriver phd tigrxml chaos gbxml pir tinyseq chaosxml gcg pln ztr ctf genbank qual Tuesday, June 11, 13

Slide 62

Slide 62 text

54 Complex / Fuzzy Locations! Tuesday, June 11, 13

Slide 63

Slide 63 text

55 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus" /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" Tuesday, June 11, 13

Slide 64

Slide 64 text

56 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus" /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" Tuesday, June 11, 13

Slide 65

Slide 65 text

57 Biology is messy! Tuesday, June 11, 13

Slide 66

Slide 66 text

58 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus" /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" Tuesday, June 11, 13

Slide 67

Slide 67 text

59 Biology is messy! FEATURES Location/Qualifiers source 1..177 /organism="Mus musculus" /mol_type="genomic DNA" /db_xref="taxon:10090" tRNA join(103..111,121..157) /gene="Phe-tRNA" EXACT! (5..100) BEFORE! (<5..100) AFTER! (>5..100) WITHIN! ((5.10)..100) BETWEEN! (99^100) UNCERTAIN! (99.?100) Worse yet!! Tuesday, June 11, 13

Slide 68

Slide 68 text

60 BioPerl to the rescue! my $seqio_object = Bio::SeqIO->new(-file => $gb_file); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures) { if ($feat_object->primary_tag eq "CDS") { say $feat_object->spliced_seq->seq; # e.g. 'ATTATTTTCGCTCGCTTCTCGCGCTTTTGCGT...' if ($feat_object->has_tag('gene')) { for my $val ($feat_object->get_tag_values('gene')){ say "gene: $val”; # e.g. 'NDP', from a line like '/gene="NDP"' } } } } Tuesday, June 11, 13

Slide 69

Slide 69 text

61 my $seqio_object = Bio::SeqIO->new(-file => $gb_file); my $seq_object = $seqio_object->next_seq; for my $feat_object ($seq_object->get_SeqFeatures) { if ($feat_object->primary_tag eq "CDS") { say $feat_object->spliced_seq->seq; # e.g. 'ATTATTTTCGCTCGCTTCTCGCGCTTTTGCGT...' if ($feat_object->has_tag('gene')) { for my $val ($feat_object->get_tag_values('gene')){ say "gene: $val”; # e.g. 'NDP', from a line like '/gene="NDP"' } } } } BioPerl to the rescue! Tuesday, June 11, 13

Slide 70

Slide 70 text

62 Tuesday, June 11, 13

Slide 71

Slide 71 text

62 Rescue this: Tuesday, June 11, 13

Slide 72

Slide 72 text

62 Rescue this: "Interface modules" - Tuesday, June 11, 13

Slide 73

Slide 73 text

62 Rescue this: "Interface modules" - https://metacpan.org/source/CJFIELDS/BioPerl-1.6.901/Bio/PrimarySeqI.pm Tuesday, June 11, 13

Slide 74

Slide 74 text

62 Rescue this: "Interface modules" - https://metacpan.org/source/CJFIELDS/BioPerl-1.6.901/Bio/PrimarySeqI.pm You got your Java in my Perl... Tuesday, June 11, 13

Slide 75

Slide 75 text

63 Search! Tuesday, June 11, 13

Slide 76

Slide 76 text

64 Score = 55.5 bits (132), Expect = 2e-04, Method: Compositional matrix adjust. Identities = 51/206 (24%), Positives = 99/206 (48%), Gaps = 15/206 (7%) Frame = +1 Query 164 KQLAKDLRQWQTNVDVANDLALKLLRDYSADDTRKVHMITENINASWRSIHKRVSEREAA 223 K+L + Q QT V N ++++ S D + ++N W+ + K++S+R+ Sbjct 97 KELQDGIGQRQTVVRTLNATGEEIIQQSSKTDASILQEKLGSLNLRWQEVCKQLSDRKKR 276 Query 224 LEETHRLLQQFPLDLEKFLAWLTEAETTANVLQDATRKERLLEDSKGVKELMKQWQDLQG 283 LEE +L +F DL +F+ WL EA+ A++ + ++++L E + VK L+++ QG Sbjct 277 LEEQKNILSEFQRDLNEFVLWLEEADNIASIPLEPGKEQQLKEKLEQVKLLVEELPLRQG 456 Query 284 EIEAHTDVYHNLDENSQKILRSLEGS-DDAVLLQRRLDNMNFKWSELRKKSLNIRSHLEA 342 + L+E +L S S ++ L+ +L N +W ++ + + +EA Sbjct 457 -------ILKQLNETGGPVLVSAPISPEEQDKLENKLKQTNLQWIKVSRALPEKQGEIEA 615 Query 343 SSDQWKRLHLSLQE-------LLVWL 361 +L L++ LL+WL Sbjct 616 QIKDLGQLEKKLEDLEEQLNHLLLWL 693 NCBI BLAST Bio::SearchIO Tuesday, June 11, 13

Slide 77

Slide 77 text

65 Score = 55.5 bits (132), Expect = 2e-04, Method: Compositional matrix adjust. Identities = 51/206 (24%), Positives = 99/206 (48%), Gaps = 15/206 (7%) Frame = +1 Query 164 KQLAKDLRQWQTNVDVANDLALKLLRDYSADDTRKVHMITENINASWRSIHKRVSEREAA 223 K+L + Q QT V N ++++ S D + ++N W+ + K++S+R+ Sbjct 97 KELQDGIGQRQTVVRTLNATGEEIIQQSSKTDASILQEKLGSLNLRWQEVCKQLSDRKKR 276 Query 224 LEETHRLLQQFPLDLEKFLAWLTEAETTANVLQDATRKERLLEDSKGVKELMKQWQDLQG 283 LEE +L +F DL +F+ WL EA+ A++ + ++++L E + VK L+++ QG Sbjct 277 LEEQKNILSEFQRDLNEFVLWLEEADNIASIPLEPGKEQQLKEKLEQVKLLVEELPLRQG 456 Query 284 EIEAHTDVYHNLDENSQKILRSLEGS-DDAVLLQRRLDNMNFKWSELRKKSLNIRSHLEA 342 + L+E +L S S ++ L+ +L N +W ++ + + +EA Sbjct 457 -------ILKQLNETGGPVLVSAPISPEEQDKLENKLKQTNLQWIKVSRALPEKQGEIEA 615 Query 343 SSDQWKRLHLSLQE-------LLVWL 361 +L L++ LL+WL Sbjct 616 QIKDLGQLEKKLEDLEEQLNHLLLWL 693 Look at all this awesome!! Bio::SearchIO Bio::Search::HSP::HSPI $hsp->algorithm; $hsp->pvalue(); $hsp->evalue(); $hsp->frac_identical( ['query'|'hit'|'total'] ); $hsp->frac_conserved( ['query'|'hit'|'total'] ); $hsp->gaps( ['query'|'hit'|'total'] ); $hsp->query_string; $hsp->hit_string; $hsp->homology_string; $hsp->length( ['query'|'hit'|'total'] ); $hsp->rank; Tuesday, June 11, 13

Slide 78

Slide 78 text

66 Tuesday, June 11, 13

Slide 79

Slide 79 text

66 Hmmm. Tuesday, June 11, 13

Slide 80

Slide 80 text

66 Hmmm. Bio::Search::HSP::HSPI? Tuesday, June 11, 13

Slide 81

Slide 81 text

66 Hmmm. Bio::Search::HSP::HSPI? Sounds cool, let's check it out: Tuesday, June 11, 13

Slide 82

Slide 82 text

66 Hmmm. Bio::Search::HSP::HSPI? Sounds cool, let's check it out: https://metacpan.org/release/BioPerl Tuesday, June 11, 13

Slide 83

Slide 83 text

67 Open Source! Tuesday, June 11, 13

Slide 84

Slide 84 text

68 Tuesday, June 11, 13

Slide 85

Slide 85 text

69 Tuesday, June 11, 13

Slide 86

Slide 86 text

70 Tuesday, June 11, 13

Slide 87

Slide 87 text

71 Tuesday, June 11, 13

Slide 88

Slide 88 text

71 Tuesday, June 11, 13

Slide 89

Slide 89 text

71 Seriously. Tuesday, June 11, 13

Slide 90

Slide 90 text

71 Seriously. All snark aside, BioPerl is a needed piece of code. It just needs some love. Tuesday, June 11, 13

Slide 91

Slide 91 text

71 Seriously. All snark aside, BioPerl is a needed piece of code. It just needs some love. Okay, a lot of love. Tuesday, June 11, 13

Slide 92

Slide 92 text

71 Seriously. All snark aside, BioPerl is a needed piece of code. It just needs some love. Okay, a lot of love. If you have the skills to contribute, please join in and help with the cleanup. Tuesday, June 11, 13