Accessing data from the command line

Reproducibility. How to automate access to published data?

What is Scien ﬁc Reproducibility? There is a surprising amount
of uncertainty and confusion of what the word reproducibility means. Bioinformatics is a mixture of many sciences - many biologists making use bioinformatics methods don't understand them suf ciently. Often scientists are unable to articulate the details at the appropriate levels.

A deﬁni on of reproducibility We need to know what
someone did. Compare the two statements: 1. We have quality controlled the data with the tool called t r i m m o m a t i c keeping only data with an average quality of 30. 2. We ran t r i m m o m a t i c r e a d 1 . f q S L I D I N G W I N D O W : 4 : 3 0 for quality control. Think about the relative merits of each statement. If you had to redo the analysis which statement would you prefer?

What should be reproducible? The biological discovery. It is not
about getting the exact same les - we need to obtain results that support the same conclusion. What if a discovery can only be made with one particular approach?

A Curious Case of Reproducibility There is an ongoing feud
in the RNA-Seq world. Two software tools appear to be too similar. They reproduce each other too well. The author of the rst tool alleges that the second tool is a reimplemented copy of their algorithm. See the lecture page for links to: accusa on -> reply to accusa on -> rebu湥ਾal of reply. It is a unfortunate and sad story - yet fascinating.

Ingredients of reproducibility Ensuring that we start with the same
data Ensuring that we know what has been done. The above can be surprisingly challenging

How to tell if your analysis is reproducible?

Your analysis is reproducible if, and only if, YOU YOURSELF
are able to reproduce it at any me, and with ease.

Reproducibility starts RIGHT NOW Reproducibility is not something you do
later, at the end of the analysis. You should keep re‐doing it all the time. Erase and start over. Is it hard? Fix the hard part. A test: If you had to delete all your data and were allowed to keep just a single description of what you did how hard would it be to restore the results? That is "reproducibility".

How do we reproduce data? We are interested in accessing
the data and results for the study titled Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Published in 2014 in the Science research journal. A single cryptic line at the very end states: Sequence data are available at NCBI (NCBI BioGroup: PRJNA257197). “ “

Accession numbers A "well" de ned identi er that links
to an entry in a database. Same information may have multiple access points: 1 0 1 4 1 0 0 3, A F 0 8 6 8 3 3, A F 0 8 6 8 3 3 . 2 Different databases will use different standards. Your eye can learn to recognize accession numbers. P R N J A will be BioProjects.

NCBI speciﬁc numbers Unique ID : a unique (often numerical,
1 0 1 4 1 0 0 3) identi er as it is entered into the database. Their use is now discouraged but some tools still use them. Accession Numbers (Loci): an accession number: A F 0 8 6 8 3 3 applies to the complete database record and remains stable even if updates/revisions are made to the record. The data for a loci may change. Version number: an accession number A F 0 8 6 8 3 3 . 2 with a version number attached to it. These are unique and data under it never changes.

Typical work strategy 1. Familiarize yourself with the data structure.
Sometimes it is best done online to visualize it. 2. Write command line code that reproducibly obtains the data for accession numbers.

Using Entrez Find the page for P R J N
A 2 5 7 1 9 7 at the NCBI web-site. Click around on some of the data links. It states: 2 4 9 genomic features 2 2 4 0 protein sequences We can get these from the command line with Entrez Direct. Remember to activate your environment if you get command not found errors: s o u r c e a c t i v a t e b i o i n f o

Entrez Direct A command line interface to NCBI. It is
a set of tools: e s e a r c h to search databases e f e t c h to download sequences e l i n k to connect across databases e i n f o to obtain information on databases x t r a c t to process XML data See all databases you can search: e i n f o - d b s

Geŗng data with Entrez Direct Once you know an accession
number you can fetch it in different formats: e f e t c h - d b = n u c c o r e - f o r m a t = g b - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces: L O C U S A F 0 8 6 8 3 3 1 8 9 5 9 b p c R N A l i n e a r D E F I N I T I O N E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e . A C C E S S I O N A F 0 8 6 8 3 3 V E R S I O N A F 0 8 6 8 3 3 . 2 K E Y W O R D S . S O U R C E E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 ( E B O V - M a y )

Same data in other format We will cover the details
of the format in the future lecture. e f e t c h - d b = n u c c o r e - f o r m a t = f a s t a - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces > A F 0 8 6 8 3 3 . 2 E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e C G G A C A C A C A A A A A G A A A G A A G A A T T T T T A G G A T C T T T T G T G T G C G A A T A A C T A T G A G G A A G A T T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A C A C C T G G T C A G A G C C A C A T C A C A A A G A T A G A G A A C A A C C T A G G T C T C C G A A G G G A G C A A G G G C A T C A G T G T G T G A A A A T C C C T T G T C A A C A C C T A G G T C T T A T C A C A T C A C A A G T T C C A C C T C A G A C T C T G C A G G G A A C A A C C T T A A T A G A A A C A T T A T T G T T A A A G G A C A G C A T T A G T T C A C A G T C A A A C A A G C A A G A T

Searching linked informa on Our "magic" project number is P
R J N A 2 5 7 1 9 7 e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 The search alone reports the counts: < E N T R E Z _ D I R E C T > < D b > n u c c o r e < ∕ D b > < W e b E n v > N C I D _ 1 _ 1 1 8 2 8 3 9 7 3 _ 1 3 0 . 1 4 . 1 8 . 3 4 _ 9 0 0 1 _ 1 5 0 4 6 2 7 9 5 3 _ 2 0 1 4 3 4 9 2 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 4 9 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > This search could be asked to return 2 4 9 results.

The search is database speciﬁc Different databases show different counts:
e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 The search only reports if it has results or not: < E N T R E Z _ D I R E C T > < D b > p r o t e i n < ∕ D b > < W e b E n v > N C I D _ 1 _ 3 0 5 2 2 4 1 5 2 _ 1 3 0 . 1 4 . 2 2 . 2 1 5 _ 9 0 0 1 _ 1 5 0 4 6 3 1 6 7 5 _ 1 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 2 4 0 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > The search could be asked to return 2 2 4 0 results.

Obtaining the data To get the data, pipe the esearch
into a efetch: e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a What did we fetch: c a t n u c l e o t i d e . f a | h e a d - 4 Prints nucleotide sequences: > K R 1 0 5 3 4 5 . 1 Z a i r e e b o l a v i r u s i s o l a t e E b o l a v i r u s ∕ H . s a p i e n s - w t ∕ S L A T A A T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A T A C C G T T T C A G A G C C A T A T C A C C A A G A T A G A G A A C A A C C T A G G T C T C C G G A G G G G G C A A G G G C A T C A G C A G T T G A A A A T C C C T T G T C A A C A T C T A G G C C T T A T C A C A T C A C A A G T T C C G C C T T A A A C T C T G C

Gets database speciﬁc data To get the data, pipe the
search into a fetch: e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > p r o t e i n s . f a What did we fetch: c a t p r o t e i n s . f a | h e a d - 3 Prints protein sequences: > A K C 3 7 2 3 3 . 1 p o l y m e r a s e [ Z a i r e e b o l a v i r u s ] M A T Q H T Q Y P D A R L S S P I V L D Q C D L V T R A C G L Y S S Y S L N P Q L R N C K L P K H I Y R L K Y D V T V T K F L S L P I D F I V P I L L K A L S G N G F C P V E P R C Q Q F L D E I I K Y T M Q D A L F L K Y Y L K N V G A Q E D C V D D H F Q E

Search diﬀerent databases Different databases allow different information to be
retrieved. All belonging to the same publication. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t r u n i n f o > r u n i n f o . c s v Now we have a le with lots of columns: c a t r u n i n f o . c s v | c u t - d , - f 1 , 2 , 1 6 | h e a d - 3 prints: R u n , R e l e a s e D a t e , L i b r a r y L a y o u t S R R 1 9 7 2 9 1 7 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 9 : 2 4 , P A I R E D S R R 1 9 7 2 9 1 8 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 8 : 2 6 , P A I R E D

How to know what informa on is there? It is
a bit of trial and error. The documentation can be exceedingly obtuse. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t s u m m a r y > s u m m a r y . x m l For example the r u n i n f o is a csv le but the s u m m a r y is in XML. The s u m m a r y . x m l is surprisingly large and will download a lot slower.

Why is the NCBI site so hard to use? It
is not just NCBI. The vast majority of repositories are like that. The scope and needs of life scientists exceed the ability of these organizations to act on that need. Most of them are insular are not required to communicate well. Being a bioinformatician requires the skill to recognize and work around the endless of limitations and problems that might occur. You have to learn to deal with it.

Programma c access to other data sources Each data source
may have its own way of accessing the data Programmatically NCBI - Entrez Direct ENSEMBLE - REST interface UCSC - mysql database queries In this course we will use Entrez Direct. Note: you will need to manually install m y s q l if you want to try out access to UCSC

Why do we need programma c access? To be explicit
of what we did. Compare the two: You still need to gure out how to do it. Versus: # G e t n u c l e o t i d e d a t a . e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a See how easily you can get the same data? We have obtained the sequence data for project PRJNA257197. “ “

Other example: Ensembl REST interface Search for Ensembl REST interface
for examples: c u r l - s ' h t t p s : ∕ ∕ r e s t . e n s e m b l . o r g ∕ l o o k u p ∕ i d ∕ E N S G 0 0 0 0 0 1 5 7 7 6 4 ? ' - H ' C o n t e n t - t y p e : a p p l i c a t i o n ∕ j s o n ' will print: { " s o u r c e " : " e n s e m b l _ h a v a n a " , " o b j e c t _ t y p e " : " G e n e " , " l o g i c _ n a m e " : " e n s e m b l _ h a v a n a _ g e n e " , " v e r s i o n " : 1 2 , " s p e c i e s " : " h o m o _ s a p i e n s " , " d e s c r i p t i o n " : " B - R a f p r o t o - o n c o g e n e , s e r i n e ∕ t h r e o n i n e k i n a s e [ S . . .

Other example: UCSC mysql server UCSC can also produce data
through a m y s q l server. m y s q l - h g e n o m e - m y s q l . c s e . u c s c . e d u - u g e n o m e - D h g 3 8 - N - A - e ' s e l e c t c h r o m , s t r a n d , t x S t a r t , c d s S t a r t f r o m k n o w n G e n e ' will produce: c h r 1 - 1 7 3 6 8 1 7 3 6 8 c h r 1 + 2 9 5 5 3 2 9 5 5 3 c h r 1 + 3 0 2 6 6 3 0 2 6 6 . . . Allows for very sophisticated joins and queries across the tables. You need to understand the data

Accessing data from the command line

Accessing data from the command line

Istvan Albert

More Decks by Istvan Albert

Other Decks in Education

Featured

Transcript

Reproducibility. How to automate access to published data?

What is Scien ﬁc Reproducibility? There is a surprising amount

A deﬁni on of reproducibility We need to know what

What should be reproducible? The biological discovery. It is not

A Curious Case of Reproducibility There is an ongoing feud

Ingredients of reproducibility Ensuring that we start with the same

How to tell if your analysis is reproducible?

Your analysis is reproducible if, and only if, YOU YOURSELF

Reproducibility starts RIGHT NOW Reproducibility is not something you do

How do we reproduce data? We are interested in accessing

Accession numbers A "well" de ned identi er that links

NCBI speciﬁc numbers Unique ID : a unique (often numerical,

Typical work strategy 1. Familiarize yourself with the data structure.

Using Entrez Find the page for P R J N

Entrez Direct A command line interface to NCBI. It is

Geŗng data with Entrez Direct Once you know an accession

Same data in other format We will cover the details

Searching linked informa on Our "magic" project number is P

The search is database speciﬁc Different databases show different counts:

Obtaining the data To get the data, pipe the esearch

Gets database speciﬁc data To get the data, pipe the

Search diﬀerent databases Different databases allow different information to be

How to know what informa on is there? It is

Why is the NCBI site so hard to use? It

Programma c access to other data sources Each data source

Why do we need programma c access? To be explicit

Other example: Ensembl REST interface Search for Ensembl REST interface

Other example: UCSC mysql server UCSC can also produce data