$30 off During Our Annual Pro Sale. View Details »

Sequence Matrix: Gene concatenation made easy

Sequence Matrix: Gene concatenation made easy

Sequence Matrix[1] is a freely-available, cross-platform application that lets you concatenate gene datasets easily. A spreadsheet-like interface displays what you've assembled so far; TNT, Nexus, FASTA and Mega files can be dragged into the application to add them. Your entire dataset can be exported as TNT or Nexus files.

This presentation describes Sequence Matrix, why you might want to use it, and how it can help you prepare large, multigene/multitaxon datasets for easy analysis.

[1] http://code.google.com/p/sequencematrix/

Gaurav Vaidya

June 26, 2009
Tweet

More Decks by Gaurav Vaidya

Other Decks in Science

Transcript

  1. Sequence Matrix
    Gaurav Vaidya1, David Lohman2, Rudolf Meier2
    Gene concatenation made easy
    1: NeatCo Asia, Singapore.
    2: Department of Biological Sciences,
    National University of Singapore, Singapore.

    View Slide

  2. Our goals
    ✤ Many powerful tools exist for concatenating sequences.
    ✤ Adding new sequences to an existing dataset is tedious and time consuming.
    ✤ Our initial goal: simple, user-friendly program for concatenating sequences.
    ✤ We also added a few tools to help you look for lab contamination in your dataset.

    View Slide

  3. Sequence Matrix
    ✤ Written in Java.
    ✤ Graphical user interface libraries.
    ✤ Works on different operating systems.
    ✤ Easy to install: download and run the batch file.

    View Slide

  4. Importing sequences
    ✤ You can use the sequence names as
    entered in the input file.
    ✤ Or you can ask Sequence Matrix to try
    to identify the species names.

    View Slide

  5. Importing sequences
    ✤ Sequences mode:
    ✤ gi|237510679|gb|AY556753.2|Daubentonia
    madagascariensis voucher WE94001 5.8S
    ribosomal RNA gene, partial sequence; internal
    transcribed spacer 2, complete sequence; and
    28S ribosomal RNA gene, partial sequence
    ✤ gi|237510678|gb|AY556735.2|Macaca
    sylvanus voucher OK96022 5.8S ribosomal
    RNA gene, partial sequence; internal
    transcribed spacer 2, complete sequence; and
    28S ribosomal RNA gene, partial sequence
    ✤ Species name
    ✤ Daubentonia madagascariensis
    ✤ Macaca sylvanus

    View Slide

  6. Importing sequences
    ✤ A common source of error is forgetting
    to recode leading and trailing gaps as
    missing information.
    ✤ Sequence Matrix can automatically
    replace such gaps with question marks.

    View Slide

  7. Importing sequences: Naming
    ✤ Sequences from one dataset are matched up to another dataset by sequence name.
    ✤ Errors in sequence naming need to be fixed.
    ✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

    View Slide

  8. Export: Taxonsets
    ✤ By default, we generate taxonsets on the
    basis of:
    ✤ Combined length.
    ✤ Number of character sets
    ✤ Information for a particular gene.

    View Slide

  9. Gene trees
    ✤ Two ways to do them:
    ✤ Use the taxonset of taxa having information for a particular gene to exclude other
    taxa.
    ✤ Export the entire dataset with one file per column.

    View Slide

  10. Export features
    ✤ You can also export the Sequence Matrix table as an Excel-readable text file.
    ✤ Supervisory mode.
    ✤ Keep track of a project as it grows.

    View Slide

  11. Character sets
    ✤ We can read character sets defined in
    Nexus CHARSET and TNT xgroup
    commands.
    ✤ These can be “split” into individual
    columns, or imported as a single
    column representing the entire file.

    View Slide

  12. Excision
    ✤ Individual sequences can be excised
    from the dataset.
    ✤ Excised sequences will not be exported.
    ✤ Sequence Matrix will warn you about
    that.

    View Slide

  13. Contamination
    ✤ You thought you were sequencing Gorilla gorilla
    ✤ but you were really sequencing Homo sapiens.
    ✤ We have two tools you can use:
    ✤ If Homo sapiens is in your dataset.
    ✤ If Homo sapiens is not in your dataset (experimental!).

    View Slide

  14. H. sapiens in dataset
    ✤ Looks for pairs of sequences whose
    pairwise distance is very low.
    ✤ Expected difference depends on gene:
    ✤ 28S doesn’t change very much, but
    ✤ COI changes very quickly.
    ✤ Some interpretation is required.

    View Slide

  15. H. sapiens not present
    ✤ Use “Pairwise Distance Mode” to look
    for unusual pairwise distances.
    ✤ Ignore one charset, then sort taxa based
    on their pairwise distance to a
    “reference taxon”.
    ✤ Colour sequences by their individual
    pairwise distances to the reference
    taxon.

    View Slide

  16. H. sapiens not present
    ✤ Colour pairwise distances on the gene
    in question by their pairwise distance to
    the reference taxon.
    ✤ Look for colour variation which is
    unusual or out of place.
    ✤ We would expect sequences from
    different species to be correlated
    together.

    View Slide

  17. Pairwise distance
    mode
    ✤ You need to vary:
    ✤ The gene you are studying.
    ✤ The reference taxon being compared
    against.
    ✤ Possibly helpful as an alert mechanism.

    View Slide

  18. ✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.
    ✤ Taxonsets allow you to analyse subsets of your data in downstream programs.
    ✤ Excising sequences gives you greater control over which sequences to analyse.
    ✤ You can look for contamination in two ways:
    ✤ Looking for very low pairwise distances across your entire dataset.
    ✤ Looking for unusual pairwise distances in Pairwise Distance Mode.
    Summary

    View Slide

  19. Acknowledgements
    ✤ Rudolf Meier
    ✤ Zhang Guanyang
    ✤ Farhan Ali
    ✤ David Lohman
    ✤ Everybody at the NUS DBS
    Evolutionary Biology lab.

    View Slide

  20. Question time!

    View Slide