unfortunate name and unfortunate parameters and unfortunate documentation with a quirky and nerdy behavior all the above leading to semi obscurity living under a rock, unknown to most other than that it is a very useful tool
a t e _ b l a s t d b . p l - - p a s s i v e - - d e c o m p r e s s 1 6 S _ r i b o s o m a l _ R N A What is stored in this database? b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l | h e a d Prints the content of the database: > N R _ 1 1 8 8 8 9 . 1 A m y c o l a t o p s i s a z u r e a s t r a i n N R R L 1 1 4 1 2 1 6 S r i b o s o m a G G T C T N A T A C C G G A T A T A A C A A C T C A T G G C A T G G T T G G T A G T G G A A A G C T C C G G C G G T A C G G G A C A G C T T G T T G G T G G G G T A A T G G C C T A C C A A G G C G A C G A C G G G T A G C C G G C C T G A G A G G G T G A C C . . .
number: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y N R _ 0 2 6 0 8 8 | h e a d - 3 Prints: > N R _ 0 2 6 0 8 8 . 1 A l l o i o c o c c u s o t i t i s s t r a i n N C F B 2 8 9 0 1 6 S r i b o s o m a l G A G T T T G A T C C T G G C T C A G G A C G A A C G C T G G C G G C A T G C C T A A T A C A T G C A A G T C G A A C G C C A A G A A G A G G C G G A G T G G C G A A C G G G T G A G T A A C A C G T G A G G A A C T T G C C C A T A A G A G G G G G A T A A C
b l a s t d b c m d - h e l p Output formats: - o u t f m t < S t r i n g > O u t p u t f o r m a t , w h e r e t h e a v a i l a b l e f o r m a t s p e c i f i e r s a r e : % f m e a n s s e q u e n c e i n F A S T A f o r m a t % s m e a n s s e q u e n c e d a t a ( w i t h o u t d e f l i n e ) % a m e a n s a c c e s s i o n % i m e a n s s e q u e n c e i d % t m e a n s s e q u e n c e t i t l e % l m e a n s s e q u e n c e l e n g t h % h m e a n s s e q u e n c e h a s h v a l u e . . .
one entry: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y N R _ 0 2 6 0 8 8 - o u t f m t " % a % l " | h e a d - 3 N R _ 0 2 6 0 8 8 . 1 1 5 2 5 You can even select sequence ranges: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y N R _ 0 2 6 0 8 8 - o u t f m t " % a % l % s " - r a n g e 1 - 2 0 | h e a d - 3 prints accession number, sequence lenght and start: N R _ 0 2 6 0 8 8 . 1 1 5 2 5 G A G T T T G A T C C T G G C T C A G G
in the database: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l - o u t f m t " % a % l " | h e a d - 5 Prints: N R _ 1 1 8 8 8 9 . 1 1 3 0 0 N R _ 1 1 8 8 9 9 . 1 1 3 6 7 N R _ 0 7 4 3 3 4 . 1 1 4 9 2 N R _ 1 1 8 8 7 3 . 1 1 4 9 2 N R _ 1 1 9 2 3 7 . 1 1 4 9 2
the database b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l - o u t f m t " % a % l " | s o r t - k 2 r n , 2 | h e a d - 1 Prints: N G _ 0 4 6 3 8 4 . 1 3 6 0 0 What are the most common 16S sequence starts? b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l - o u t f m t " % s " - r a n g e 1 - 1 0 | s o r t | u n i q - c | s o r t - r n | h e a d
the ebola project e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > p r o t e i n s . f a Run a quick statistics on the le: s e q k i t s t a t p r o t e i n s . f a It is a good idea to investigate every step: f i l e f o r m a t t y p e n u m _ s e q s s u m _ l e n m i n _ l e n a v g _ p r o t e i n s . f a F A S T A P r o t e i n 2 , 2 4 0 1 , 3 6 7 , 0 2 5 2 1 7 6 1
the sequences: - d b t y p e p r o t When the FASTA header is in the NCBI format: - p a r s e _ s e q i d s The BLAST datbase build command is then: m a k e b l a s t d b - i n p r o t e i n s . f a - d b t y p e p r o t - o u t e b o l a - p a r s e _ s e q i d s
length: b l a s t d b c m d - d b e b o l a - e n t r y ' a l l ' - o u t f m t " % a % l " | h e a d - 5 will print: A K C 3 7 2 3 3 . 1 2 2 1 2 A K C 3 7 2 3 2 . 1 2 5 1 A K C 3 7 2 3 1 . 1 2 8 8 A K C 3 7 2 3 0 . 1 2 9 7 A K C 3 7 2 2 9 . 1 3 6 4
ebola virus change between the 1972 and the 2014 outbreaks? The sequence accession from the 1972 outbreak is A A D 1 4 5 8 9 (aka gene named L) Get the sequence (I like to get it in both formats because genbank has more information in it): e f e t c h - d b p r o t e i n - i d A A D 1 4 5 8 9 - f o r m a t g e n b a n k > L . g b e f e t c h - d b p r o t e i n - i d A A D 1 4 5 8 9 - f o r m a t f a s t a > L . f a
b l a s t p - d b e b o l a - q u e r y L . f a Reformat the results: b l a s t p - d b e b o l a - q u e r y L . f a - o u t f m t " 6 q a c c s a c c p i d e n t " | h e a Prints: A A D 1 4 5 8 9 . 1 A K C 3 6 0 0 1 9 8 . 7 3 4 A A D 1 4 5 8 9 . 1 A K C 3 6 4 3 3 9 8 . 7 3 4 A A D 1 4 5 8 9 . 1 A K C 3 6 7 0 2 9 8 . 7 3 4 A A D 1 4 5 8 9 . 1 A K C 3 7 0 3 5 9 8 . 7 3 4
hit (by percent identity): b l a s t p - d b e b o l a - q u e r y L . f a - o u t f m t " 6 q a c c s a c c p i d e n t " | s o r t - k 3 , 3 r n | h e a d - 4 We note how none of them are 100% match: A A D 1 4 5 8 9 . 1 A I E 1 1 8 0 5 9 8 . 7 7 9 A A D 1 4 5 8 9 . 1 A I E 1 1 8 1 4 9 8 . 7 7 9 A A D 1 4 5 8 9 . 1 A I E 1 1 8 2 3 9 8 . 7 7 9 A A D 1 4 5 8 9 . 1 A I E 1 1 8 3 2 9 8 . 7 7 9 Extract the top hit from the BLAST database: b l a s t d b c m d - d b e b o l a - e n t r y A I E 1 1 8 0 5 > A I E 1 1 8 0 5 . f a
l a s t p - q u e r y L . f a - s u b j e c t A I E 1 1 8 0 5 . f a Or use our scripts: g l o b a l - a l i g n . s h L . f a A I E 1 1 8 0 5 . f a Evaluate the alignments: . . . A A D 1 4 5 8 9 . 1 1 6 5 1 N R K Y L A R D S S T G S S T N N S D G H I E R S Q E Q T T R D P H D G T E R N L V L | | | . | . | : | | | | | | | | | | | | | | : | | | | | | | | | | | | | | | | : | | | A I E 1 1 8 0 5 . 1 1 6 5 1 N R K D L T R N S S T G S S T N N S D G H I K R S Q E Q T T R D P H D G T E R S L V L . . .
strategies. This is an ebola genome: e f e t c h - d b n u c l e o t i d e - i d K M 2 3 3 1 1 8 - - f o r m a t f a s t a > K M 2 3 3 1 1 8 . f a Make it into a blast database: m a k e b l a s t d b - i n K M 2 3 3 1 1 8 . f a - d b t y p e n u c l - o u t K M 2 3 3 1 1 8 - p a r s e _ s e
bases of the genome: b l a s t d b c m d - d b K M 2 3 3 1 1 8 - e n t r y K M 2 3 3 1 1 8 - r a n g e 1 - 4 0 > q u e r y . f a Align with blast: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a It works! Q u e r y 1 A A T C A T A C C T G G T T T G T T T C A G A G C C A T A T C A C C A A G A T A 4 0 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | S b j c t 1 A A T C A T A C C T G G T T T G T T T C A G A G C C A T A T C A C C A A G A T A 4 0
query shorter: b l a s t d b c m d - d b K M 2 3 3 1 1 8 - e n t r y K M 2 3 3 1 1 8 - r a n g e 1 - 1 5 > q u e r y . f a Now the same alignment: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a Prints: * * * * * N o h i t s f o u n d * * * * *
e g a b l a s t change it to b l a s t n: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a - t a s k b l a s t n Now it works again: Q u e r y 1 A A T C A T A C C T G G T T T 1 5 | | | | | | | | | | | | | | | S b j c t 1 A A T C A T A C C T G G T T T 1 5
d b c m d - d b K M 2 3 3 1 1 8 - e n t r y K M 2 3 3 1 1 8 - r a n g e 1 - 7 > q u e r y . f a Now b l a s t n does not work either: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a - t a s k b l a s t n But a different task will: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a - t a s k b l a s t n - s h o r t In general we have to be cautios when BLAST generates no hits. Sometimes the query properties make BLAST lter out the results.
q i d s opᨀ㨅on Using the "parse_seqids" ag when invoking makeblastdb allows the retrieval of sequences based upon sequence identi ers. But in that case, each sequence must have a unique identi er, and that identi er must have a speci c format See also section 5.14 Limiᨀ㨅ng a Search with a List of Idenᨀ㨅fiers in the NCBI BLAST+ handbook