Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Collections-specific Searches in Genetic Databases

Mike Trizna
October 05, 2017

Collections-specific Searches in Genetic Databases

These are the slides from my TDWG talk on October 5, 2017. (doi: 10.3897/tdwgproceedings.1.20320)

I did a brief overview of how specimen data is stored in GenBank records, how it can be searched, and how the NCBI LinkOut system can provide a "patch" on incorrect or incomplete specimen data.

I think introduced a tool I built called "genetic_collections", which can be found on GitHub here: https://github.com/MikeTrizna/genetic_collections.

Mike Trizna

October 05, 2017
Tweet

Other Decks in Science

Transcript

  1. Can you tell me how many GenBank records are derived

    from vouchers in your institution’s collection?
  2. The ”official" way to find this answer • By searching

    on indexed records connected to NCBI’s list of institution codes
  3. The ”official" answer • But… this requires specimen_voucher (or biomaterial

    or culture_collection) to be in properly constructed DwC triplet format
  4. The "real" answer (Not indexed by GenBank, because it is

    separated by a space and not a colon)
  5. How to ensure these extra records show up in the

    proper search • Have sequence authors send NCBI updates to records
  6. A "practical" solution (while hopefully working on correct solution): •

    Use NCBI LinkOuts to organize records that you know come from your institution
  7. What is a LinkOut? LinkOut is a service that allows

    you to link directly from PubMed and other NCBI databases to a wide range of information and services beyond the NCBI systems.
  8. LinkOut Pros • Once registered as a LinkOut provider, you

    can put a LinkOut on any record – regardless of ownership
  9. genetic_collections command line tools Written as a Python library, but

    installation also packages stand-alone command line tools – no Python knowledge required!
  10. genetic_collections workflow Institution Search Name queries Name or code matches

    with counts for context Other search parameters like Taxonomy, Date Ranges, etc. List of IDs Count of Results Tabular data formats: (Excel, TSV, CSV) Database Search Database Fetch Inputs Outputs
  11. Institution Searching If you’re searching by an Institution Name or

    Institution Code, it helps to know what names or codes are being used, and how many times they’re used. • NCBI BioCollections has 6812 institutions, – But only 2103 have at least 1 GenBank record • BOLD has 1940 unique “institution storing” values – Only 900 with 50 or more public records
  12. Acknowledgements • Special thanks to: – Diane Pitassy, collections manager

    for the NMNH Fish Collection – Tom Hollowell from the NMNH Informatics Department – The SI DNA Barcoding Network and Global Genome Initiative teams