Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accessing data from the command line

Istvan Albert
September 21, 2020

Accessing data from the command line

Reproducibility. Data repositories. Downloading with E-Direct

Istvan Albert

September 21, 2020
Tweet

More Decks by Istvan Albert

Other Decks in Education

Transcript

  1. What is Scien fic Reproducibility? There is a surprising amount

    of uncertainty and confusion of what the word reproducibility means. Bioinformatics is a mixture of many sciences - many biologists making use bioinformatics methods don't understand them suf ciently. Often scientists are unable to articulate the details at the appropriate levels.
  2. A defini on of reproducibility We need to know what

    someone did. Compare the two statements: 1. We have quality controlled the data with the tool called t r i m m o m a t i c keeping only data with an average quality of 30. 2. We ran t r i m m o m a t i c r e a d 1 . f q S L I D I N G W I N D O W : 4 : 3 0 for quality control. Think about the relative merits of each statement. If you had to redo the analysis which statement would you prefer?
  3. What should be reproducible? The biological discovery. It is not

    about getting the exact same les - we need to obtain results that support the same conclusion. What if a discovery can only be made with one particular approach?
  4. A Curious Case of Reproducibility There is an ongoing feud

    in the RNA-Seq world. Two software tools appear to be too similar. They reproduce each other too well. The author of the rst tool alleges that the second tool is a reimplemented copy of their algorithm. See the lecture page for links to: accusa on -> reply to accusa on -> rebu湥ਾal of reply. It is a unfortunate and sad story - yet fascinating.
  5. Ingredients of reproducibility Ensuring that we start with the same

    data Ensuring that we know what has been done. The above can be surprisingly challenging
  6. Your analysis is reproducible if, and only if, YOU YOURSELF

    are able to reproduce it at any me, and with ease.
  7. Reproducibility starts RIGHT NOW Reproducibility is not something you do

    later, at the end of the analysis. You should keep re‐doing it all the time. Erase and start over. Is it hard? Fix the hard part. A test: If you had to delete all your data and were allowed to keep just a single description of what you did how hard would it be to restore the results? That is "reproducibility".
  8. How do we reproduce data? We are interested in accessing

    the data and results for the study titled Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Published in 2014 in the Science research journal. A single cryptic line at the very end states: Sequence data are available at NCBI (NCBI BioGroup: PRJNA257197). “ “
  9. Accession numbers A "well" de ned identi er that links

    to an entry in a database. Same information may have multiple access points: 1 0 1 4 1 0 0 3, A F 0 8 6 8 3 3, A F 0 8 6 8 3 3 . 2 Different databases will use different standards. Your eye can learn to recognize accession numbers. P R N J A will be BioProjects.
  10. NCBI specific numbers Unique ID : a unique (often numerical,

    1 0 1 4 1 0 0 3) identi er as it is entered into the database. Their use is now discouraged but some tools still use them. Accession Numbers (Loci): an accession number: A F 0 8 6 8 3 3 applies to the complete database record and remains stable even if updates/revisions are made to the record. The data for a loci may change. Version number: an accession number A F 0 8 6 8 3 3 . 2 with a version number attached to it. These are unique and data under it never changes.
  11. Typical work strategy 1. Familiarize yourself with the data structure.

    Sometimes it is best done online to visualize it. 2. Write command line code that reproducibly obtains the data for accession numbers.
  12. Using Entrez Find the page for P R J N

    A 2 5 7 1 9 7 at the NCBI web-site. Click around on some of the data links. It states: 2 4 9 genomic features 2 2 4 0 protein sequences We can get these from the command line with Entrez Direct. Remember to activate your environment if you get command not found errors: s o u r c e a c t i v a t e b i o i n f o
  13. Entrez Direct A command line interface to NCBI. It is

    a set of tools: e s e a r c h to search databases e f e t c h to download sequences e l i n k to connect across databases e i n f o to obtain information on databases x t r a c t to process XML data See all databases you can search: e i n f o - d b s
  14. Geŗng data with Entrez Direct Once you know an accession

    number you can fetch it in different formats: e f e t c h - d b = n u c c o r e - f o r m a t = g b - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces: L O C U S A F 0 8 6 8 3 3 1 8 9 5 9 b p c R N A l i n e a r D E F I N I T I O N E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e . A C C E S S I O N A F 0 8 6 8 3 3 V E R S I O N A F 0 8 6 8 3 3 . 2 K E Y W O R D S . S O U R C E E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 ( E B O V - M a y )
  15. Same data in other format We will cover the details

    of the format in the future lecture. e f e t c h - d b = n u c c o r e - f o r m a t = f a s t a - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces > A F 0 8 6 8 3 3 . 2 E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e C G G A C A C A C A A A A A G A A A G A A G A A T T T T T A G G A T C T T T T G T G T G C G A A T A A C T A T G A G G A A G A T T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A C A C C T G G T C A G A G C C A C A T C A C A A A G A T A G A G A A C A A C C T A G G T C T C C G A A G G G A G C A A G G G C A T C A G T G T G T G A A A A T C C C T T G T C A A C A C C T A G G T C T T A T C A C A T C A C A A G T T C C A C C T C A G A C T C T G C A G G G A A C A A C C T T A A T A G A A A C A T T A T T G T T A A A G G A C A G C A T T A G T T C A C A G T C A A A C A A G C A A G A T
  16. Searching linked informa on Our "magic" project number is P

    R J N A 2 5 7 1 9 7 e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 The search alone reports the counts: < E N T R E Z _ D I R E C T > < D b > n u c c o r e < ∕ D b > < W e b E n v > N C I D _ 1 _ 1 1 8 2 8 3 9 7 3 _ 1 3 0 . 1 4 . 1 8 . 3 4 _ 9 0 0 1 _ 1 5 0 4 6 2 7 9 5 3 _ 2 0 1 4 3 4 9 2 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 4 9 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > This search could be asked to return 2 4 9 results.
  17. The search is database specific Different databases show different counts:

    e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 The search only reports if it has results or not: < E N T R E Z _ D I R E C T > < D b > p r o t e i n < ∕ D b > < W e b E n v > N C I D _ 1 _ 3 0 5 2 2 4 1 5 2 _ 1 3 0 . 1 4 . 2 2 . 2 1 5 _ 9 0 0 1 _ 1 5 0 4 6 3 1 6 7 5 _ 1 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 2 4 0 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > The search could be asked to return 2 2 4 0 results.
  18. Obtaining the data To get the data, pipe the esearch

    into a efetch: e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a What did we fetch: c a t n u c l e o t i d e . f a | h e a d - 4 Prints nucleotide sequences: > K R 1 0 5 3 4 5 . 1 Z a i r e e b o l a v i r u s i s o l a t e E b o l a v i r u s ∕ H . s a p i e n s - w t ∕ S L A T A A T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A T A C C G T T T C A G A G C C A T A T C A C C A A G A T A G A G A A C A A C C T A G G T C T C C G G A G G G G G C A A G G G C A T C A G C A G T T G A A A A T C C C T T G T C A A C A T C T A G G C C T T A T C A C A T C A C A A G T T C C G C C T T A A A C T C T G C
  19. Gets database specific data To get the data, pipe the

    search into a fetch: e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > p r o t e i n s . f a What did we fetch: c a t p r o t e i n s . f a | h e a d - 3 Prints protein sequences: > A K C 3 7 2 3 3 . 1 p o l y m e r a s e [ Z a i r e e b o l a v i r u s ] M A T Q H T Q Y P D A R L S S P I V L D Q C D L V T R A C G L Y S S Y S L N P Q L R N C K L P K H I Y R L K Y D V T V T K F L S L P I D F I V P I L L K A L S G N G F C P V E P R C Q Q F L D E I I K Y T M Q D A L F L K Y Y L K N V G A Q E D C V D D H F Q E
  20. Search different databases Different databases allow different information to be

    retrieved. All belonging to the same publication. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t r u n i n f o > r u n i n f o . c s v Now we have a le with lots of columns: c a t r u n i n f o . c s v | c u t - d , - f 1 , 2 , 1 6 | h e a d - 3 prints: R u n , R e l e a s e D a t e , L i b r a r y L a y o u t S R R 1 9 7 2 9 1 7 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 9 : 2 4 , P A I R E D S R R 1 9 7 2 9 1 8 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 8 : 2 6 , P A I R E D
  21. How to know what informa on is there? It is

    a bit of trial and error. The documentation can be exceedingly obtuse. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t s u m m a r y > s u m m a r y . x m l For example the r u n i n f o is a csv le but the s u m m a r y is in XML. The s u m m a r y . x m l is surprisingly large and will download a lot slower.
  22. Why is the NCBI site so hard to use? It

    is not just NCBI. The vast majority of repositories are like that. The scope and needs of life scientists exceed the ability of these organizations to act on that need. Most of them are insular are not required to communicate well. Being a bioinformatician requires the skill to recognize and work around the endless of limitations and problems that might occur. You have to learn to deal with it.
  23. Programma c access to other data sources Each data source

    may have its own way of accessing the data Programmatically NCBI - Entrez Direct ENSEMBLE - REST interface UCSC - mysql database queries In this course we will use Entrez Direct. Note: you will need to manually install m y s q l if you want to try out access to UCSC
  24. Why do we need programma c access? To be explicit

    of what we did. Compare the two: You still need to gure out how to do it. Versus: # G e t n u c l e o t i d e d a t a . e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a See how easily you can get the same data? We have obtained the sequence data for project PRJNA257197. “ “
  25. Other example: Ensembl REST interface Search for Ensembl REST interface

    for examples: c u r l - s ' h t t p s : ∕ ∕ r e s t . e n s e m b l . o r g ∕ l o o k u p ∕ i d ∕ E N S G 0 0 0 0 0 1 5 7 7 6 4 ? ' - H ' C o n t e n t - t y p e : a p p l i c a t i o n ∕ j s o n ' will print: { " s o u r c e " : " e n s e m b l _ h a v a n a " , " o b j e c t _ t y p e " : " G e n e " , " l o g i c _ n a m e " : " e n s e m b l _ h a v a n a _ g e n e " , " v e r s i o n " : 1 2 , " s p e c i e s " : " h o m o _ s a p i e n s " , " d e s c r i p t i o n " : " B - R a f p r o t o - o n c o g e n e , s e r i n e ∕ t h r e o n i n e k i n a s e [ S . . .
  26. Other example: UCSC mysql server UCSC can also produce data

    through a m y s q l server. m y s q l - h g e n o m e - m y s q l . c s e . u c s c . e d u - u g e n o m e - D h g 3 8 - N - A - e ' s e l e c t c h r o m , s t r a n d , t x S t a r t , c d s S t a r t f r o m k n o w n G e n e ' will produce: c h r 1 - 1 7 3 6 8 1 7 3 6 8 c h r 1 + 2 9 5 5 3 2 9 5 5 3 c h r 1 + 3 0 2 6 6 3 0 2 6 6 . . . Allows for very sophisticated joins and queries across the tables. You need to understand the data