Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sequence Matrix: Gene concatenation made easy

Sequence Matrix: Gene concatenation made easy

Sequence Matrix[1] is a freely-available, cross-platform application that lets you concatenate gene datasets easily. A spreadsheet-like interface displays what you've assembled so far; TNT, Nexus, FASTA and Mega files can be dragged into the application to add them. Your entire dataset can be exported as TNT or Nexus files.

This presentation describes Sequence Matrix, why you might want to use it, and how it can help you prepare large, multigene/multitaxon datasets for easy analysis.

[1] http://code.google.com/p/sequencematrix/

Gaurav Vaidya

June 26, 2009
Tweet

More Decks by Gaurav Vaidya

Other Decks in Science

Transcript

  1. Sequence Matrix Gaurav Vaidya1, David Lohman2, Rudolf Meier2 Gene concatenation

    made easy 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.
  2. Our goals ✤ Many powerful tools exist for concatenating sequences.

    ✤ Adding new sequences to an existing dataset is tedious and time consuming. ✤ Our initial goal: simple, user-friendly program for concatenating sequences. ✤ We also added a few tools to help you look for lab contamination in your dataset.
  3. Sequence Matrix ✤ Written in Java. ✤ Graphical user interface

    libraries. ✤ Works on different operating systems. ✤ Easy to install: download and run the batch file.
  4. Importing sequences ✤ You can use the sequence names as

    entered in the input file. ✤ Or you can ask Sequence Matrix to try to identify the species names.
  5. Importing sequences ✤ Sequences mode: ✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001

    5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ Species name ✤ Daubentonia madagascariensis ✤ Macaca sylvanus
  6. Importing sequences ✤ A common source of error is forgetting

    to recode leading and trailing gaps as missing information. ✤ Sequence Matrix can automatically replace such gaps with question marks.
  7. Importing sequences: Naming ✤ Sequences from one dataset are matched

    up to another dataset by sequence name. ✤ Errors in sequence naming need to be fixed. ✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
  8. Export: Taxonsets ✤ By default, we generate taxonsets on the

    basis of: ✤ Combined length. ✤ Number of character sets ✤ Information for a particular gene.
  9. Gene trees ✤ Two ways to do them: ✤ Use

    the taxonset of taxa having information for a particular gene to exclude other taxa. ✤ Export the entire dataset with one file per column.
  10. Export features ✤ You can also export the Sequence Matrix

    table as an Excel-readable text file. ✤ Supervisory mode. ✤ Keep track of a project as it grows.
  11. Character sets ✤ We can read character sets defined in

    Nexus CHARSET and TNT xgroup commands. ✤ These can be “split” into individual columns, or imported as a single column representing the entire file.
  12. Excision ✤ Individual sequences can be excised from the dataset.

    ✤ Excised sequences will not be exported. ✤ Sequence Matrix will warn you about that.
  13. Contamination ✤ You thought you were sequencing Gorilla gorilla ✤

    but you were really sequencing Homo sapiens. ✤ We have two tools you can use: ✤ If Homo sapiens is in your dataset. ✤ If Homo sapiens is not in your dataset (experimental!).
  14. H. sapiens in dataset ✤ Looks for pairs of sequences

    whose pairwise distance is very low. ✤ Expected difference depends on gene: ✤ 28S doesn’t change very much, but ✤ COI changes very quickly. ✤ Some interpretation is required.
  15. H. sapiens not present ✤ Use “Pairwise Distance Mode” to

    look for unusual pairwise distances. ✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”. ✤ Colour sequences by their individual pairwise distances to the reference taxon.
  16. H. sapiens not present ✤ Colour pairwise distances on the

    gene in question by their pairwise distance to the reference taxon. ✤ Look for colour variation which is unusual or out of place. ✤ We would expect sequences from different species to be correlated together.
  17. Pairwise distance mode ✤ You need to vary: ✤ The

    gene you are studying. ✤ The reference taxon being compared against. ✤ Possibly helpful as an alert mechanism.
  18. ✤ Sequence Matrix allows you to assemble and examine multigene,

    multitaxon datasets. ✤ Taxonsets allow you to analyse subsets of your data in downstream programs. ✤ Excising sequences gives you greater control over which sequences to analyse. ✤ You can look for contamination in two ways: ✤ Looking for very low pairwise distances across your entire dataset. ✤ Looking for unusual pairwise distances in Pairwise Distance Mode. Summary
  19. Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali

    ✤ David Lohman ✤ Everybody at the NUS DBS Evolutionary Biology lab.