of uncertainty and confusion of what the word reproducibility means. Bioinformatics is a mixture of many sciences - many biologists making use bioinformatics methods don't understand them suf ciently. Often scientists are unable to articulate the details at the appropriate levels.
someone did. Compare the two statements: 1. We have quality controlled the data with the tool called t r i m m o m a t i c keeping only data with an average quality of 30. 2. We ran t r i m m o m a t i c r e a d 1 . f q S L I D I N G W I N D O W : 4 : 3 0 for quality control. Think about the relative merits of each statement. If you had to redo the analysis which statement would you prefer?
about getting the exact same les - we need to obtain results that support the same conclusion. What if a discovery can only be made with one particular approach?
in the RNA-Seq world. Two software tools appear to be too similar. They reproduce each other too well. The author of the rst tool alleges that the second tool is a reimplemented copy of their algorithm. See the lecture page for links to: accusa on -> reply to accusa on -> rebu湥ਾal of reply. It is a unfortunate and sad story - yet fascinating.
later, at the end of the analysis. You should keep re‐doing it all the time. Erase and start over. Is it hard? Fix the hard part. A test: If you had to delete all your data and were allowed to keep just a single description of what you did how hard would it be to restore the results? That is "reproducibility".
the data and results for the study titled Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Published in 2014 in the Science research journal. A single cryptic line at the very end states: Sequence data are available at NCBI (NCBI BioGroup: PRJNA257197). “ “
to an entry in a database. Same information may have multiple access points: 1 0 1 4 1 0 0 3, A F 0 8 6 8 3 3, A F 0 8 6 8 3 3 . 2 Different databases will use different standards. Your eye can learn to recognize accession numbers. P R N J A will be BioProjects.
1 0 1 4 1 0 0 3) identi er as it is entered into the database. Their use is now discouraged but some tools still use them. Accession Numbers (Loci): an accession number: A F 0 8 6 8 3 3 applies to the complete database record and remains stable even if updates/revisions are made to the record. The data for a loci may change. Version number: an accession number A F 0 8 6 8 3 3 . 2 with a version number attached to it. These are unique and data under it never changes.
A 2 5 7 1 9 7 at the NCBI web-site. Click around on some of the data links. It states: 2 4 9 genomic features 2 2 4 0 protein sequences We can get these from the command line with Entrez Direct. Remember to activate your environment if you get command not found errors: s o u r c e a c t i v a t e b i o i n f o
a set of tools: e s e a r c h to search databases e f e t c h to download sequences e l i n k to connect across databases e i n f o to obtain information on databases x t r a c t to process XML data See all databases you can search: e i n f o - d b s
number you can fetch it in different formats: e f e t c h - d b = n u c c o r e - f o r m a t = g b - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces: L O C U S A F 0 8 6 8 3 3 1 8 9 5 9 b p c R N A l i n e a r D E F I N I T I O N E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e . A C C E S S I O N A F 0 8 6 8 3 3 V E R S I O N A F 0 8 6 8 3 3 . 2 K E Y W O R D S . S O U R C E E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 ( E B O V - M a y )
of the format in the future lecture. e f e t c h - d b = n u c c o r e - f o r m a t = f a s t a - i d = A F 0 8 6 8 3 3 | h e a d - 6 Produces > A F 0 8 6 8 3 3 . 2 E b o l a v i r u s - M a y i n g a , Z a i r e , 1 9 7 6 , c o m p l e t e g e n o m e C G G A C A C A C A A A A A G A A A G A A G A A T T T T T A G G A T C T T T T G T G T G C G A A T A A C T A T G A G G A A G A T T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A C A C C T G G T C A G A G C C A C A T C A C A A A G A T A G A G A A C A A C C T A G G T C T C C G A A G G G A G C A A G G G C A T C A G T G T G T G A A A A T C C C T T G T C A A C A C C T A G G T C T T A T C A C A T C A C A A G T T C C A C C T C A G A C T C T G C A G G G A A C A A C C T T A A T A G A A A C A T T A T T G T T A A A G G A C A G C A T T A G T T C A C A G T C A A A C A A G C A A G A T
R J N A 2 5 7 1 9 7 e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 The search alone reports the counts: < E N T R E Z _ D I R E C T > < D b > n u c c o r e < ∕ D b > < W e b E n v > N C I D _ 1 _ 1 1 8 2 8 3 9 7 3 _ 1 3 0 . 1 4 . 1 8 . 3 4 _ 9 0 0 1 _ 1 5 0 4 6 2 7 9 5 3 _ 2 0 1 4 3 4 9 2 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 4 9 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > This search could be asked to return 2 4 9 results.
e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 The search only reports if it has results or not: < E N T R E Z _ D I R E C T > < D b > p r o t e i n < ∕ D b > < W e b E n v > N C I D _ 1 _ 3 0 5 2 2 4 1 5 2 _ 1 3 0 . 1 4 . 2 2 . 2 1 5 _ 9 0 0 1 _ 1 5 0 4 6 3 1 6 7 5 _ 1 < Q u e r y K e y > 1 < ∕ Q u e r y K e y > < C o u n t > 2 2 4 0 < ∕ C o u n t > < S t e p > 1 < ∕ S t e p > < ∕ E N T R E Z _ D I R E C T > The search could be asked to return 2 2 4 0 results.
into a efetch: e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a What did we fetch: c a t n u c l e o t i d e . f a | h e a d - 4 Prints nucleotide sequences: > K R 1 0 5 3 4 5 . 1 Z a i r e e b o l a v i r u s i s o l a t e E b o l a v i r u s ∕ H . s a p i e n s - w t ∕ S L A T A A T T T T C C T C T C A T T G A A A T T T A T A T C G G A A T T T A A A T T G A A A T T G T T A C T G T A A T C A T A C C G T T T C A G A G C C A T A T C A C C A A G A T A G A G A A C A A C C T A G G T C T C C G G A G G G G G C A A G G G C A T C A G C A G T T G A A A A T C C C T T G T C A A C A T C T A G G C C T T A T C A C A T C A C A A G T T C C G C C T T A A A C T C T G C
search into a fetch: e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > p r o t e i n s . f a What did we fetch: c a t p r o t e i n s . f a | h e a d - 3 Prints protein sequences: > A K C 3 7 2 3 3 . 1 p o l y m e r a s e [ Z a i r e e b o l a v i r u s ] M A T Q H T Q Y P D A R L S S P I V L D Q C D L V T R A C G L Y S S Y S L N P Q L R N C K L P K H I Y R L K Y D V T V T K F L S L P I D F I V P I L L K A L S G N G F C P V E P R C Q Q F L D E I I K Y T M Q D A L F L K Y Y L K N V G A Q E D C V D D H F Q E
retrieved. All belonging to the same publication. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t r u n i n f o > r u n i n f o . c s v Now we have a le with lots of columns: c a t r u n i n f o . c s v | c u t - d , - f 1 , 2 , 1 6 | h e a d - 3 prints: R u n , R e l e a s e D a t e , L i b r a r y L a y o u t S R R 1 9 7 2 9 1 7 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 9 : 2 4 , P A I R E D S R R 1 9 7 2 9 1 8 , 2 0 1 5 - 0 4 - 1 4 1 3 : 5 8 : 2 6 , P A I R E D
a bit of trial and error. The documentation can be exceedingly obtuse. e s e a r c h - d b s r a - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t s u m m a r y > s u m m a r y . x m l For example the r u n i n f o is a csv le but the s u m m a r y is in XML. The s u m m a r y . x m l is surprisingly large and will download a lot slower.
is not just NCBI. The vast majority of repositories are like that. The scope and needs of life scientists exceed the ability of these organizations to act on that need. Most of them are insular are not required to communicate well. Being a bioinformatician requires the skill to recognize and work around the endless of limitations and problems that might occur. You have to learn to deal with it.
may have its own way of accessing the data Programmatically NCBI - Entrez Direct ENSEMBLE - REST interface UCSC - mysql database queries In this course we will use Entrez Direct. Note: you will need to manually install m y s q l if you want to try out access to UCSC
of what we did. Compare the two: You still need to gure out how to do it. Versus: # G e t n u c l e o t i d e d a t a . e s e a r c h - d b n u c c o r e - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > n u c l e o t i d e . f a See how easily you can get the same data? We have obtained the sequence data for project PRJNA257197. “ “
for examples: c u r l - s ' h t t p s : ∕ ∕ r e s t . e n s e m b l . o r g ∕ l o o k u p ∕ i d ∕ E N S G 0 0 0 0 0 1 5 7 7 6 4 ? ' - H ' C o n t e n t - t y p e : a p p l i c a t i o n ∕ j s o n ' will print: { " s o u r c e " : " e n s e m b l _ h a v a n a " , " o b j e c t _ t y p e " : " G e n e " , " l o g i c _ n a m e " : " e n s e m b l _ h a v a n a _ g e n e " , " v e r s i o n " : 1 2 , " s p e c i e s " : " h o m o _ s a p i e n s " , " d e s c r i p t i o n " : " B - R a f p r o t o - o n c o g e n e , s e r i n e ∕ t h r e o n i n e k i n a s e [ S . . .
through a m y s q l server. m y s q l - h g e n o m e - m y s q l . c s e . u c s c . e d u - u g e n o m e - D h g 3 8 - N - A - e ' s e l e c t c h r o m , s t r a n d , t x S t a r t , c d s S t a r t f r o m k n o w n G e n e ' will produce: c h r 1 - 1 7 3 6 8 1 7 3 6 8 c h r 1 + 2 9 5 5 3 2 9 5 5 3 c h r 1 + 3 0 2 6 6 3 0 2 6 6 . . . Allows for very sophisticated joins and queries across the tables. You need to understand the data