Accessing data from the command line

Slide 1

Slide 1 text

Reproducibility. How to automate access to published data?

Slide 2

Slide 2 text

What is Scien ﬁc Reproducibility? There is a surprising amount of uncertainty and confusion of what the word reproducibility means. Bioinformatics is a mixture of many sciences - many biologists making use bioinformatics methods don't understand them suf ciently. Often scientists are unable to articulate the details at the appropriate levels.

Slide 3

Slide 3 text

A deﬁni on of reproducibility We need to know what someone did. Compare the two statements: 1. We have quality controlled the data with the tool called t r i m m o m a t i c keeping only data with an average quality of 30. 2. We ran t r i m m o m a t i c r e a d 1 . f q S L I D I N G W I N D O W : 4 : 3 0 for quality control. Think about the relative merits of each statement. If you had to redo the analysis which statement would you prefer?

Slide 4

Slide 4 text

What should be reproducible? The biological discovery. It is not about getting the exact same les - we need to obtain results that support the same conclusion. What if a discovery can only be made with one particular approach?

Slide 5

Slide 5 text

A Curious Case of Reproducibility There is an ongoing feud in the RNA-Seq world. Two software tools appear to be too similar. They reproduce each other too well. The author of the rst tool alleges that the second tool is a reimplemented copy of their algorithm. See the lecture page for links to: accusa on -> reply to accusa on -> rebu湥ਾal of reply. It is a unfortunate and sad story - yet fascinating.

Slide 6

Slide 6 text

Ingredients of reproducibility Ensuring that we start with the same data Ensuring that we know what has been done. The above can be surprisingly challenging

Slide 7

Slide 7 text

How to tell if your analysis is reproducible?

Slide 8

Slide 8 text

Your analysis is reproducible if, and only if, YOU YOURSELF are able to reproduce it at any me, and with ease.

Slide 9

Slide 9 text

Reproducibility starts RIGHT NOW Reproducibility is not something you do later, at the end of the analysis. You should keep re‐doing it all the time. Erase and start over. Is it hard? Fix the hard part. A test: If you had to delete all your data and were allowed to keep just a single description of what you did how hard would it be to restore the results? That is "reproducibility".

Slide 10

Slide 10 text

How do we reproduce data? We are interested in accessing the data and results for the study titled Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Published in 2014 in the Science research journal. A single cryptic line at the very end states: Sequence data are available at NCBI (NCBI BioGroup: PRJNA257197). “ “

Slide 11

Slide 11 text

Accession numbers A "well" de ned identi er that links to an entry in a database. Same information may have multiple access points: 1 0 1 4 1 0 0 3, A F 0 8 6 8 3 3, A F 0 8 6 8 3 3 . 2 Different databases will use different standards. Your eye can learn to recognize accession numbers. P R N J A will be BioProjects.

Slide 12

Slide 12 text

NCBI speciﬁc numbers Unique ID : a unique (often numerical, 1 0 1 4 1 0 0 3) identi er as it is entered into the database. Their use is now discouraged but some tools still use them. Accession Numbers (Loci): an accession number: A F 0 8 6 8 3 3 applies to the complete database record and remains stable even if updates/revisions are made to the record. The data for a loci may change. Version number: an accession number A F 0 8 6 8 3 3 . 2 with a version number attached to it. These are unique and data under it never changes.

Slide 13

Slide 13 text

Typical work strategy 1. Familiarize yourself with the data structure. Sometimes it is best done online to visualize it. 2. Write command line code that reproducibly obtains the data for accession numbers.

Slide 14

Slide 14 text

Using Entrez Find the page for P R J N A 2 5 7 1 9 7 at the NCBI web-site. Click around on some of the data links. It states: 2 4 9 genomic features 2 2 4 0 protein sequences We can get these from the command line with Entrez Direct. Remember to activate your environment if you get command not found errors: s o u r c e a c t i v a t e b i o i n f o

Slide 15

Slide 15 text

Entrez Direct A command line interface to NCBI. It is a set of tools: e s e a r c h to search databases e f e t c h to download sequences e l i n k to connect across databases e i n f o to obtain information on databases x t r a c t to process XML data See all databases you can search: e i n f o - d b s

Slide 16

Slide 16 text

Geŗng data with Entrez Direct Once you know an accession number you can fetch it in different formats: e f e t c h - d b = n u c c o r e - f o r m a t = g b - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces: L O C U S A F 0 8 6 8 3 3 1 8 9 5 9 b p c R N A l i n e a r D E F I N I T I O N E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e . A C C E S S I O N A F 0 8 6 8 3 3 V E R S I O N A F 0 8 6 8 3 3 . 2 K E Y W O R D S . S O U R C E E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 ( E B O V - M a y )

Slide 17

Slide 17 text

Same data in other format We will cover the details of the format in the future lecture. e f e t c h - d b = n u c c o r e - f o r m a t = f a s t a - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces > A F 0 8 6 8 3 3 . 2 E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e C G G A C A C A C A A A A A G A A A G A A G A A T T T T T A G G A T C T T T T G T G T G C G A A T A A C T A T G A G G A A G A T T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A C A C C T G G T C A G A G C C A C A T C A C A A A G A T A G A G A A C A A C C T A G G T C T C C G A A G G G A G C A A G G G C A T C A G T G T G T G A A A A T C C C T T G T C A A C A C C T A G G T C T T A T C A C A T C A C A A G T T C C A C C T C A G A C T C T G C A G G G A A C A A C C T T A A T A G A A A C A T T A T T G T T A A A G G A C A G C A T T A G T T C A C A G T C A A A C A A G C A A G A T

Slide 18

Slide 18 text

Searching linked informa on Our "magic" project number is P R J N A 2 5 7 1 9 7 e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 The search alone reports the counts: < E N T R E Z _ D I R E C T > < D b > n u c c o r e < ∕ D b > < W e b E n v > N C I D _ 1 _ 1 1 8 2 8 3 9 7 3 _ 1 3 0 . 1 4 . 1 8 . 3 4 _ 9 0 0 1 _ 1 5 0 4 6 2 7 9 5 3 _ 2 0 1 4 3 4 9 2 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 4 9 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > This search could be asked to return 2 4 9 results.

Slide 19

Slide 19 text

The search is database speciﬁc Different databases show different counts: e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 The search only reports if it has results or not: < E N T R E Z _ D I R E C T > < D b > p r o t e i n < ∕ D b > < W e b E n v > N C I D _ 1 _ 3 0 5 2 2 4 1 5 2 _ 1 3 0 . 1 4 . 2 2 . 2 1 5 _ 9 0 0 1 _ 1 5 0 4 6 3 1 6 7 5 _ 1 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 2 4 0 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > The search could be asked to return 2 2 4 0 results.

Slide 20

Slide 20 text

Obtaining the data To get the data, pipe the esearch into a efetch: e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a What did we fetch: c a t n u c l e o t i d e . f a | h e a d - 4 Prints nucleotide sequences: > K R 1 0 5 3 4 5 . 1 Z a i r e e b o l a v i r u s i s o l a t e E b o l a v i r u s ∕ H . s a p i e n s - w t ∕ S L A T A A T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A T A C C G T T T C A G A G C C A T A T C A C C A A G A T A G A G A A C A A C C T A G G T C T C C G G A G G G G G C A A G G G C A T C A G C A G T T G A A A A T C C C T T G T C A A C A T C T A G G C C T T A T C A C A T C A C A A G T T C C G C C T T A A A C T C T G C

Slide 21

Slide 21 text

Gets database speciﬁc data To get the data, pipe the search into a fetch: e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > p r o t e i n s . f a What did we fetch: c a t p r o t e i n s . f a | h e a d - 3 Prints protein sequences: > A K C 3 7 2 3 3 . 1 p o l y m e r a s e [ Z a i r e e b o l a v i r u s ] M A T Q H T Q Y P D A R L S S P I V L D Q C D L V T R A C G L Y S S Y S L N P Q L R N C K L P K H I Y R L K Y D V T V T K F L S L P I D F I V P I L L K A L S G N G F C P V E P R C Q Q F L D E I I K Y T M Q D A L F L K Y Y L K N V G A Q E D C V D D H F Q E

Slide 22

Slide 22 text

Search diﬀerent databases Different databases allow different information to be retrieved. All belonging to the same publication. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t r u n i n f o > r u n i n f o . c s v Now we have a le with lots of columns: c a t r u n i n f o . c s v | c u t - d , - f 1 , 2 , 1 6 | h e a d - 3 prints: R u n , R e l e a s e D a t e , L i b r a r y L a y o u t S R R 1 9 7 2 9 1 7 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 9 : 2 4 , P A I R E D S R R 1 9 7 2 9 1 8 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 8 : 2 6 , P A I R E D

Slide 23

Slide 23 text

How to know what informa on is there? It is a bit of trial and error. The documentation can be exceedingly obtuse. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t s u m m a r y > s u m m a r y . x m l For example the r u n i n f o is a csv le but the s u m m a r y is in XML. The s u m m a r y . x m l is surprisingly large and will download a lot slower.

Slide 24

Slide 24 text

Why is the NCBI site so hard to use? It is not just NCBI. The vast majority of repositories are like that. The scope and needs of life scientists exceed the ability of these organizations to act on that need. Most of them are insular are not required to communicate well. Being a bioinformatician requires the skill to recognize and work around the endless of limitations and problems that might occur. You have to learn to deal with it.

Slide 25

Slide 25 text

Programma c access to other data sources Each data source may have its own way of accessing the data Programmatically NCBI - Entrez Direct ENSEMBLE - REST interface UCSC - mysql database queries In this course we will use Entrez Direct. Note: you will need to manually install m y s q l if you want to try out access to UCSC

Slide 26

Slide 26 text

Why do we need programma c access? To be explicit of what we did. Compare the two: You still need to gure out how to do it. Versus: # G e t n u c l e o t i d e d a t a . e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a See how easily you can get the same data? We have obtained the sequence data for project PRJNA257197. “ “

Slide 27

Slide 27 text

Other example: Ensembl REST interface Search for Ensembl REST interface for examples: c u r l - s ' h t t p s : ∕ ∕ r e s t . e n s e m b l . o r g ∕ l o o k u p ∕ i d ∕ E N S G 0 0 0 0 0 1 5 7 7 6 4 ? ' - H ' C o n t e n t - t y p e : a p p l i c a t i o n ∕ j s o n ' will print: { " s o u r c e " : " e n s e m b l _ h a v a n a " , " o b j e c t _ t y p e " : " G e n e " , " l o g i c _ n a m e " : " e n s e m b l _ h a v a n a _ g e n e " , " v e r s i o n " : 1 2 , " s p e c i e s " : " h o m o _ s a p i e n s " , " d e s c r i p t i o n " : " B - R a f p r o t o - o n c o g e n e , s e r i n e ∕ t h r e o n i n e k i n a s e [ S . . .

Slide 28

Slide 28 text

Other example: UCSC mysql server UCSC can also produce data through a m y s q l server. m y s q l - h g e n o m e - m y s q l . c s e . u c s c . e d u - u g e n o m e - D h g 3 8 - N - A - e ' s e l e c t c h r o m , s t r a n d , t x S t a r t , c d s S t a r t f r o m k n o w n G e n e ' will produce: c h r 1 - 1 7 3 6 8 1 7 3 6 8 c h r 1 + 2 9 5 5 3 2 9 5 5 3 c h r 1 + 3 0 2 6 6 3 0 2 6 6 . . . Allows for very sophisticated joins and queries across the tables. You need to understand the data