Sequence Matrix: Gene concatenation made easy

Sequence Matrix Gaurav Vaidya1, David Lohman2, Rudolf Meier2 Gene concatenation
made easy 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.

Our goals ✤ Many powerful tools exist for concatenating sequences.
✤ Adding new sequences to an existing dataset is tedious and time consuming. ✤ Our initial goal: simple, user-friendly program for concatenating sequences. ✤ We also added a few tools to help you look for lab contamination in your dataset.

Sequence Matrix ✤ Written in Java. ✤ Graphical user interface
libraries. ✤ Works on different operating systems. ✤ Easy to install: download and run the batch ﬁle.

Importing sequences ✤ You can use the sequence names as
entered in the input ﬁle. ✤ Or you can ask Sequence Matrix to try to identify the species names.

Importing sequences ✤ Sequences mode: ✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001
5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ Species name ✤ Daubentonia madagascariensis ✤ Macaca sylvanus

Importing sequences ✤ A common source of error is forgetting
to recode leading and trailing gaps as missing information. ✤ Sequence Matrix can automatically replace such gaps with question marks.

Importing sequences: Naming ✤ Sequences from one dataset are matched
up to another dataset by sequence name. ✤ Errors in sequence naming need to be ﬁxed. ✤ We recommend naming your ﬁles by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

Export: Taxonsets ✤ By default, we generate taxonsets on the
basis of: ✤ Combined length. ✤ Number of character sets ✤ Information for a particular gene.

Gene trees ✤ Two ways to do them: ✤ Use
the taxonset of taxa having information for a particular gene to exclude other taxa. ✤ Export the entire dataset with one ﬁle per column.

Export features ✤ You can also export the Sequence Matrix
table as an Excel-readable text ﬁle. ✤ Supervisory mode. ✤ Keep track of a project as it grows.

Character sets ✤ We can read character sets deﬁned in
Nexus CHARSET and TNT xgroup commands. ✤ These can be “split” into individual columns, or imported as a single column representing the entire ﬁle.

Excision ✤ Individual sequences can be excised from the dataset.
✤ Excised sequences will not be exported. ✤ Sequence Matrix will warn you about that.

Contamination ✤ You thought you were sequencing Gorilla gorilla ✤
but you were really sequencing Homo sapiens. ✤ We have two tools you can use: ✤ If Homo sapiens is in your dataset. ✤ If Homo sapiens is not in your dataset (experimental!).

H. sapiens in dataset ✤ Looks for pairs of sequences
whose pairwise distance is very low. ✤ Expected difference depends on gene: ✤ 28S doesn’t change very much, but ✤ COI changes very quickly. ✤ Some interpretation is required.

H. sapiens not present ✤ Use “Pairwise Distance Mode” to
look for unusual pairwise distances. ✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”. ✤ Colour sequences by their individual pairwise distances to the reference taxon.

H. sapiens not present ✤ Colour pairwise distances on the
gene in question by their pairwise distance to the reference taxon. ✤ Look for colour variation which is unusual or out of place. ✤ We would expect sequences from different species to be correlated together.

Pairwise distance mode ✤ You need to vary: ✤ The
gene you are studying. ✤ The reference taxon being compared against. ✤ Possibly helpful as an alert mechanism.

✤ Sequence Matrix allows you to assemble and examine multigene,
multitaxon datasets. ✤ Taxonsets allow you to analyse subsets of your data in downstream programs. ✤ Excising sequences gives you greater control over which sequences to analyse. ✤ You can look for contamination in two ways: ✤ Looking for very low pairwise distances across your entire dataset. ✤ Looking for unusual pairwise distances in Pairwise Distance Mode. Summary

Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali
✤ David Lohman ✤ Everybody at the NUS DBS Evolutionary Biology lab.

Question time!

Sequence Matrix: Gene concatenation made easy

Sequence Matrix: Gene concatenation made easy

Gaurav Vaidya

More Decks by Gaurav Vaidya

Other Decks in Science

Featured

Transcript

Sequence Matrix Gaurav Vaidya1, David Lohman2, Rudolf Meier2 Gene concatenation

Our goals ✤ Many powerful tools exist for concatenating sequences.

Sequence Matrix ✤ Written in Java. ✤ Graphical user interface

Importing sequences ✤ You can use the sequence names as

Importing sequences ✤ Sequences mode: ✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001

Importing sequences ✤ A common source of error is forgetting

Importing sequences: Naming ✤ Sequences from one dataset are matched

Export: Taxonsets ✤ By default, we generate taxonsets on the

Gene trees ✤ Two ways to do them: ✤ Use

Export features ✤ You can also export the Sequence Matrix

Character sets ✤ We can read character sets deﬁned in

Excision ✤ Individual sequences can be excised from the dataset.

Contamination ✤ You thought you were sequencing Gorilla gorilla ✤

H. sapiens in dataset ✤ Looks for pairs of sequences

H. sapiens not present ✤ Use “Pairwise Distance Mode” to

H. sapiens not present ✤ Colour pairwise distances on the

Pairwise distance mode ✤ You need to vary: ✤ The

✤ Sequence Matrix allows you to assemble and examine multigene,

Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali

Question time!