Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BBX-01-Retrieve sequence data from Entrez DataBase NCBI

BBX-01-Retrieve sequence data from Entrez DataBase NCBI

Entrez is an integrated information search and retrieval system for biological databases from NCBI. The "dual" Objective of this Presentation is:-
To introduce Entrez as a biological data retrieval system
To learn how to use Entrez search engine to retrieve nucleotide/protein sequence data.

Praharshit Sharma

July 31, 2018
Tweet

More Decks by Praharshit Sharma

Other Decks in Education

Transcript

  1. These are Demo Slides on Entrez These are Demo Slides

    on Entrez For full Practicals, please Buy- http://imojo.in/3b06qm
  2. Theory Entrez is an integrated search engine which allows users

    to search and retrieve different data from the National Center for Biotechnology Information (NCBI). It can be accessed from the site www.ncbi.nlm.nih.gov/Entrez/. Entrez is NCBI’s major text search and retrieval system which integrates PubMed database and 39 other scientific literatures, nucleotide and protein databases, protein domain data, population study datasets, expression data, pathways and systems of interacting molecules, complete genome details and taxonomic information into a tightly inter linked system. These component databases can be accessed using one single query.
  3. The major functions of NCBI are: Create public databases for

    storing, retrieving, and analyzing knowledge about molecular biology, biochemistry, and genetics. Conduct research in computational biology, for analyzing the structure and function of biological molecules. Develop software tools for analyzing genomic data. Disseminate biomedical information. Gather biotechnology information worldwide.
  4. Entrez as Search Engine Entrez thereby act as the search

    engine for NCBI databases.Searching can be made more precisely by using Boolean operators like AND, OR or NOT with the search statement. Limits allow user to filter his search according to their choice. An Advanced Search interface allows performing more detailed queries.
  5. Entrez Boolean Search Statements User can perform Global search by

    selecting the default option “All Databases “, which displays result from the different databases and their number of records available for each database will also be showed. The databases are arranged in three main sections, of which the top section contains information about literature databases, the middle section includes molecular databases and the bottom section includes accessory literature database journals, NLM Catalog and MeSH.
  6. Entrez Associated DataBases-1 Books: Bookshelf provide free access to search,

    retrieve and read books and journals from life science area. It can be accessed from the site http://www.ncbi.nlm.nih.gov/books CDD: Conserved Domain Database is a collection of annotation of functional units in protein. It contains manually annotated domain models, which uses 3D structure information to define sequence /structure/function relationships. It can be accessed from the site www.ncbi.nlm.nih.gov/sites/entrez Gene: Gene database comprises of information about various species including their nomenclature, associated pathways, RefSeq's, phenotypes, links to genome. It can be accessed from the site http://www.ncbi.nlm.nih.gov/gene/
  7. Entrez Associated DataBases-2 CoreNucleotide: It is a source of sequences

    from different databases including GenBank, RefSeq, TPA, and PDB which will be helpful for the research purposes. It can be accessed from the site http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuccore EST: Expression Sequence Tag database is a collection of data from GenBank. These are sequence tagged site derived from cDNA, which act as a resource to evaluate gene expression, find potential variation, annotated genes. It can be accessed from the site http://www.ncbi.nlm.nih.gov/nucest Genome: Genome database is a collection of genomes information which include their sequences, maps, chromosomes and annotations. It can be accessed from the site http://www.ncbi.nlm.nih.gov/genome
  8. Entrez Associated DataBases-3 dbGaP: The database of Genotypes and Phenotypes

    is a library of results, from the studies of interaction of genotypes and phenotypes. It can be accessed from the site http://www.ncbi.nlm.nih.gov/gap GEO Datasets: The Gene Expression Omnibus (GEO) offers information on gene expression datasets, their original series and Platform records. It also provides additional information such as experimental details, cluster tools and differential expression queries. It can be accessed from the site www.ncbi.nlm.nih.gov/gds GEO Profiles: It offers to browse for profiles which are important on gene annotation or pre-computed profile characteristics. It can be accessed from the site http://www.ncbi.nlm.nih.gov/geoprofiles
  9. Entrez Associated DataBases-4 GSS: The GSS nucleotide database provides information

    from GenBank of Genome Survey Sequence records. It can be accessed from the site www.ncbi.nlm.nih.gov/nucgss HomoloGene: It is a collection of homologs from the annotated genes of completely sequenced eukaryotic organisms. It can be accessed from the site www.ncbi.nlm.nih.gov/homologene MeSH: MeSH (Medical Subject Headings) is the NLM (Nations Library of Medicine) controlled vocabulary used for browsing articles, also act as a thesaurus in biomedical sciences for Pubmed and MEDLINE. It can be accessed from the site www.ncbi.nlm.nih.gov/mesh
  10. Entrez Associated DataBases-5 NCBI Web Site: It browses the NCBI

    website. It can be accessed from the site http://www.ncbi.nlm.nih.gov/ NLM Catalog: NLM (United States National Library of Medicine) is the largest medical library which offers access to books, journals, technical information, audiovisuals, software’s and other resources. It can be accessed from the site http://www.ncbi.nlm.nih.gov/nlmcatalog OMIM: It is a comprehensive resource database for human genes and genetic disorders. It contains information about human genes and genetic phenotypes, which is updated daily. It can be accessed from the site www.ncbi.nlm.nih.gov/omim
  11. Entrez Associated DataBases-6 OMIA: Online Mendelian Inheritance in Animals is

    acting as a resource for genes, inherited disorders and traits in more than 135 animal species, authored by Professor Frank Nicholas. It provides access to animal species excluding those in human and mouse, for which species specific data are offered. It can be accessed from the site http://www.ncbi.nlm.nih.gov/omia PopSet: Population study dataset is a collection of set of DNA sequences, collected to study evolutionary relatedness of a population. It can be accessed from the site http://www.ncbi.nlm.nih.gov/popset Probe: It is a collection of nucleic acids reagents. It also contains information on reagent distributors, probe effectiveness and computed sequence similarities. It can be accessed from the site http://www.ncbi.nlm.nih.gov/probe
  12. Entrez Associated DataBases-7 Protein Sequence Database: It is a collection

    of sequences from GenBank, RefSeq, TAP, SwissProt, PIR, PRF, PDB. It can be accessed from the site www.ncbi.nlm.nih.gov/protein Pubchem BioAssay: It contains information of bioactivity screens of chemical substances from PubChem. It can be accessed from the site www.ncbi.nlm.nih.gov/pcassay PubChem Compound: It contains compounds with their unique structures and biological information from PubChem substances. It can be accessed from the site www.ncbi.nlm.nih.gov/pccompound
  13. Entrez Associated DataBases-8 PubChem Substance: It is a collection of

    records of substances from depositors into the system, descriptions of samples, and links to biological screening results which are available in PubChem BioAssay. It can be accessed from the site www.ncbi.nlm.nih.gov/pcsubstance PubMed: PubMed is a freely accessible database search system for health information which is developed and maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). It contains articles from MEDLINE and other biomedical articles. It can be accessed from the site www.ncbi.nlm.nih.gov/pubmed Pubmed Central: PubMed central is a freely accessible digital resource of full text articles for biomedical life science journals, which is linked to PubMed database. It can be accessed from the site www.ncbi.nlm.nih.gov/pmc/
  14. Entrez Associated DataBases-9 SNP: The SNP database contains information of

    single nucleotide polymorphisms, short insertion and deletion polymorphisms. It can be accessed from the site www.ncbi.nlm.nih.gov/snp Structure: The Structure database contains information of 3 dimensional structures of proteins and other polynucleotide. It can be accessed from the site www.ncbi.nlm.nih.gov/structure Taxonomy: Taxonomy contains information of all the organisms that are included in the genetic database with their nucleotide or protein sequence. It can be accessed from the site www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ UniGene: It identifies transcripts from the same locus, analyses expression by tissue, age, health status and report related proteins (protest) and clone resources. It can be accessed from the site www.ncbi.nlm.nih.gov/unigene
  15. Entrez Associated DataBases-10 UniGene: It identifies transcripts from the same

    locus, analyses expression by tissue, age, health status and report related proteins (protest) and clone resources. It can be accessed from the site www.ncbi.nlm.nih.gov/unigene UniSTS: It contains information about Sequenced Tagged Sites (STS) which are from the PCR primer pairs with their genomic positions, genes and sequence information from STS based maps and other experiments. It can be accessed from the site www.ncbi.nlm.nih.gov/unists BioSample: It is a collection of information of different biological source materials used in experimental assays. It can be accessed from the site www.ncbi.nlm.nih.gov/biosample The results of the query search are represented in different data formats like GenBank, FASTA.
  16. GENBANK GenBank is a collection of annotated DNA sequences, which

    is the NIH genetic sequence database. The different parameter components included are explained below. Locus name helps in group entries with similar sequences. The first 3 characters denotes the organism, the fourth and fifth characters gives other group designations, such as gene product and the last character is a series of sequential integers. Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the sequence record. Molecule Type shows the type of sequenced molecule . Genbank Division shows the GenBank division to which a record belongs and is indicated by a three letter abbreviation.
  17. Genbank Divisions-1 1. PRI - primate sequences • 2. ROD

    - rodent sequences • 3. MAM - other mammalian sequences • 4. VRT - other vertebrate sequences • 5. INV - invertebrate sequences • 6. PLN - plant, fungal, and algal sequences • 7. BCT - bacterial sequences • 8. VRL - viral sequences • 9. PHG - bacteriophage sequences
  18. Genbank Divisions-2 10. SYN - synthetic sequences • 11. UNA

    - unannotated sequences • 12. EST - EST sequences (expressed sequence tags) • 13. PAT - patent sequences • 14. STS - STS sequences (sequence tagged sites) • 15. GSS - GSS sequences (genome survey sequences) • 16. HTG - HTG sequences (high-throughput DNA seq) • 17. HTC - unfinished high-throughput cDNA sequencing • 18. ENV - environmental sampling sequences
  19. Genbank Divisions-3 • Modification Date shows the last date of

    modification. Definition is a brief description of sequence that includes information such as source organism, gene name/protein name, or some description of the sequence's function. • Accession number indicates the unique identifier for a sequence record. Records from the RefSeq NT_123456 constructed genomic contigs • NM_123456 mRNAs • NP_123456 proteins • NC_123456 chromosomes
  20. Genbank Divisions-4 Version shows a nucleotide sequence identification number that

    represents a single, specific sequence in the GenBank database. GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence. Keywords describes word or phrase of the sequence. Source indicates free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type. Organism describes the formal scientific name for the source organism and its lineage. Reference includes publications by the authors of the sequence that discuss the data reported in the record. Authors contains List of authors in the order in which they appear in the cited article. Entrez Search Field: Author [AUTH] Title represents the title of the published work or tentative title of an unpublished word. Entrez Search Field: Text Word [WORD]
  21. Genbank Divisions-5 https://www.ncbi.nlm.nih.gov/nuccore/NM_004801.4 Journal: MEDLINE abbreviation of the journal name.

    Entrez Search Field: Journal Name [JOUR] Pubmed: PubMed Identifier (PMID) Features shows information about genes and gene products, as well as regions of biological significance reported in the sequence. Source is a mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number. Can also include other information such as map location, strain, clone, tissue type, etc., if provided by submitter. Taxon is a stable unique identification number for the taxon of the source organism. CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence of amino acids in a protein.
  22. FASTA It is a file format used for representing nucleotide

    or protein sequences as a string with some basic tag or identifier in which nucleotides or amino acids are represented as single letter codes. A FASTA sequence starts with a (>) greater than symbol which implies the beginning of a new sequence records called as definition line (“def line”). An accession number or version number is followed by description of that entry. DNA sequence in either uppercase or lower case letters starts from the next line. The sequences contain 60 characters per line.
  23. DNA sequencing-1 These sequences which are stored in the database

    were obtained from different experimental methods. Most commonly used methods for DNA sequencing are Sanger Method and Maxam-Gilbert Method. Similarly Edman Degradation method and Mass Spectrometry technique are used for protein sequencing. Sanger Method (dideoxy chain termination method): Here 4 test tubes are taken labelled with A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands). Next a primer is to be added which anneals to one of the strand in template. The 3' end of the primer accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide. Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be obtained by analysing the bands in the gel based on the molecular weight. The primer or one of the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be detected from the gel easily and the sequence can be inferred.
  24. DNA sequencing-2 Maxam-Gilbert (Chemical degradation method): This method requires denature

    DNA fragment whose 5' end is radioactively labeled. This fragment is then subjected to purification before proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis technique helps in arranging the fragments based on their molecular weight. To view the fragments, gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each corresponding to a radio labeled DNA fragment, from which the sequence can be inferred. Edman Degradation reaction: The reaction finds the order of amino acids in a protein from the N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for finding the chemical structures of molecules like peptides and other chemical compounds. Based on the mass to charge ratio, one can identify the amino acids in a protein
  25. ASSIGNMENT after Practicals-1 1) Gene Expression database • • EST

    • CDD • Genome • Pubmed • 2) The database which is used to study the evolutionary relatedness of a population. • • Gene • SNP • GSS • Popset
  26. ASSIGNMENT after Practicals-2 3) MeSH database stands for • •

    Medical Subject Heading • Medical Structure Heading • Maximal Structure Heading • None of the above • 4) The largest middle section of an Entrez page occupies • • Molecular databases • Literature databases • All the above • None of the above • 5) OMIA – Online Mendelian Inheritance in Animals, True or False? • • True • False