Slide 1

Slide 1 text

Sequence Matrix Gaurav Vaidya1, David Lohman2, Rudolf Meier2 Gene concatenation made easy 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.

Slide 2

Slide 2 text

Our goals ✤ Many powerful tools exist for concatenating sequences. ✤ Adding new sequences to an existing dataset is tedious and time consuming. ✤ Our initial goal: simple, user-friendly program for concatenating sequences. ✤ We also added a few tools to help you look for lab contamination in your dataset.

Slide 3

Slide 3 text

Sequence Matrix ✤ Written in Java. ✤ Graphical user interface libraries. ✤ Works on different operating systems. ✤ Easy to install: download and run the batch file.

Slide 4

Slide 4 text

Importing sequences ✤ You can use the sequence names as entered in the input file. ✤ Or you can ask Sequence Matrix to try to identify the species names.

Slide 5

Slide 5 text

Importing sequences ✤ Sequences mode: ✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ Species name ✤ Daubentonia madagascariensis ✤ Macaca sylvanus

Slide 6

Slide 6 text

Importing sequences ✤ A common source of error is forgetting to recode leading and trailing gaps as missing information. ✤ Sequence Matrix can automatically replace such gaps with question marks.

Slide 7

Slide 7 text

Importing sequences: Naming ✤ Sequences from one dataset are matched up to another dataset by sequence name. ✤ Errors in sequence naming need to be fixed. ✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

Slide 8

Slide 8 text

Export: Taxonsets ✤ By default, we generate taxonsets on the basis of: ✤ Combined length. ✤ Number of character sets ✤ Information for a particular gene.

Slide 9

Slide 9 text

Gene trees ✤ Two ways to do them: ✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa. ✤ Export the entire dataset with one file per column.

Slide 10

Slide 10 text

Export features ✤ You can also export the Sequence Matrix table as an Excel-readable text file. ✤ Supervisory mode. ✤ Keep track of a project as it grows.

Slide 11

Slide 11 text

Character sets ✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands. ✤ These can be “split” into individual columns, or imported as a single column representing the entire file.

Slide 12

Slide 12 text

Excision ✤ Individual sequences can be excised from the dataset. ✤ Excised sequences will not be exported. ✤ Sequence Matrix will warn you about that.

Slide 13

Slide 13 text

Contamination ✤ You thought you were sequencing Gorilla gorilla ✤ but you were really sequencing Homo sapiens. ✤ We have two tools you can use: ✤ If Homo sapiens is in your dataset. ✤ If Homo sapiens is not in your dataset (experimental!).

Slide 14

Slide 14 text

H. sapiens in dataset ✤ Looks for pairs of sequences whose pairwise distance is very low. ✤ Expected difference depends on gene: ✤ 28S doesn’t change very much, but ✤ COI changes very quickly. ✤ Some interpretation is required.

Slide 15

Slide 15 text

H. sapiens not present ✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances. ✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”. ✤ Colour sequences by their individual pairwise distances to the reference taxon.

Slide 16

Slide 16 text

H. sapiens not present ✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon. ✤ Look for colour variation which is unusual or out of place. ✤ We would expect sequences from different species to be correlated together.

Slide 17

Slide 17 text

Pairwise distance mode ✤ You need to vary: ✤ The gene you are studying. ✤ The reference taxon being compared against. ✤ Possibly helpful as an alert mechanism.

Slide 18

Slide 18 text

✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets. ✤ Taxonsets allow you to analyse subsets of your data in downstream programs. ✤ Excising sequences gives you greater control over which sequences to analyse. ✤ You can look for contamination in two ways: ✤ Looking for very low pairwise distances across your entire dataset. ✤ Looking for unusual pairwise distances in Pairwise Distance Mode. Summary

Slide 19

Slide 19 text

Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali ✤ David Lohman ✤ Everybody at the NUS DBS Evolutionary Biology lab.

Slide 20

Slide 20 text

Question time!