Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BLAST databases

Istvan Albert
October 30, 2019

BLAST databases

Learn to create BLAST databases.

Istvan Albert

October 30, 2019
Tweet

More Decks by Istvan Albert

Other Decks in Science

Transcript

  1. Running BLAST 1. Requires a BLAST database 2. Requires a

    BLAST program makeblastdb creates blast databases. blastdbcmd queries blast databases. update_blastdb.pl downloads databases. blastn, blastp, blastx, tblastn, tblastx are blast programs
  2. blastdbcmd ‐ the unsung hero' a useful tool with an

    unfortunate name and unfortunate parameters and unfortunate documentation with a quirky and nerdy behavior all the above leading to semi obscurity living under a rock, unknown to most other than that it is a very useful tool
  3. blastdbcmd in "acᨀ㨅on" Get the 16S data u p d

    a t e _ b l a s t d b . p l - - p a s s i v e - - d e c o m p r e s s 1 6 S _ r i b o s o m a l _ R N A What is stored in this database? b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l | h e a d Prints the content of the database: > N R _ 1 1 8 8 8 9 . 1 A m y c o l a t o p s i s a z u r e a s t r a i n N R R L 1 1 4 1 2 1 6 S r i b o s o m a G G T C T N A T A C C G G A T A T A A C A A C T C A T G G C A T G G T T G G T A G T G G A A A G C T C C G G C G G T A C G G G A C A G C T T G T T G G T G G G G T A A T G G C C T A C C A A G G C G A C G A C G G G T A G C C G G C C T G A G A G G G T G A C C . . .
  4. Get data by accession number If you knew the accession

    number: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y N R _ 0 2 6 0 8 8 | h e a d - 3 Prints: > N R _ 0 2 6 0 8 8 . 1 A l l o i o c o c c u s o t i t i s s t r a i n N C F B 2 8 9 0 1 6 S r i b o s o m a l G A G T T T G A T C C T G G C T C A G G A C G A A C G C T G G C G G C A T G C C T A A T A C A T G C A A G T C G A A C G C C A A G A A G A G G C G G A G T G G C G A A C G G G T G A G T A A C A C G T G A G G A A C T T G C C C A T A A G A G G G G G A T A A C
  5. It can format outputs Detailed but hard to read help

    b l a s t d b c m d - h e l p Output formats: - o u t f m t < S t r i n g > O u t p u t f o r m a t , w h e r e t h e a v a i l a b l e f o r m a t s p e c i f i e r s a r e : % f m e a n s s e q u e n c e i n F A S T A f o r m a t % s m e a n s s e q u e n c e d a t a ( w i t h o u t d e f l i n e ) % a m e a n s a c c e s s i o n % i m e a n s s e q u e n c e i d % t m e a n s s e q u e n c e t i t l e % l m e a n s s e q u e n c e l e n g t h % h m e a n s s e q u e n c e h a s h v a l u e . . .
  6. A variety of outputs Get the name and lenght for

    one entry: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y N R _ 0 2 6 0 8 8 - o u t f m t " % a % l " | h e a d - 3 N R _ 0 2 6 0 8 8 . 1 1 5 2 5 You can even select sequence ranges: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y N R _ 0 2 6 0 8 8 - o u t f m t " % a % l % s " - r a n g e 1 - 2 0 | h e a d - 3 prints accession number, sequence lenght and start: N R _ 0 2 6 0 8 8 . 1 1 5 2 5 G A G T T T G A T C C T G G C T C A G G
  7. Get all entries in the database Iterate over all entries

    in the database: b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l - o u t f m t " % a % l " | h e a d - 5 Prints: N R _ 1 1 8 8 8 9 . 1 1 3 0 0 N R _ 1 1 8 8 9 9 . 1 1 3 6 7 N R _ 0 7 4 3 3 4 . 1 1 4 9 2 N R _ 1 1 8 8 7 3 . 1 1 4 9 2 N R _ 1 1 9 2 3 7 . 1 1 4 9 2
  8. You can reformat/sort the output The longest 16S genes in

    the database b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l - o u t f m t " % a % l " | s o r t - k 2 r n , 2 | h e a d - 1 Prints: N G _ 0 4 6 3 8 4 . 1 3 6 0 0 What are the most common 16S sequence starts? b l a s t d b c m d - d b 1 6 S _ r i b o s o m a l _ R N A - e n t r y a l l - o u t f m t " % s " - r a n g e 1 - 1 0 | s o r t | u n i q - c | s o r t - r n | h e a d
  9. Let's perform a protein alignment Get the proteins submitted for

    the ebola project e s e a r c h - d b p r o t e i n - q u e r y P R J N A 2 5 7 1 9 7 | e f e t c h - f o r m a t f a s t a > p r o t e i n s . f a Run a quick statistics on the le: s e q k i t s t a t p r o t e i n s . f a It is a good idea to investigate every step: f i l e f o r m a t t y p e n u m _ s e q s s u m _ l e n m i n _ l e n a v g _ p r o t e i n s . f a F A S T A P r o t e i n 2 , 2 4 0 1 , 3 6 7 , 0 2 5 2 1 7 6 1
  10. Create the blast datatbase Important settings. Set the type of

    the sequences: - d b t y p e p r o t When the FASTA header is in the NCBI format: - p a r s e _ s e q i d s The BLAST datbase build command is then: m a k e b l a s t d b - i n p r o t e i n s . f a - d b t y p e p r o t - o u t e b o l a - p a r s e _ s e q i d s
  11. Invesᨀ㨅gate your BLAST database Print the accession number and sequence

    length: b l a s t d b c m d - d b e b o l a - e n t r y ' a l l ' - o u t f m t " % a % l " | h e a d - 5 will print: A K C 3 7 2 3 3 . 1 2 2 1 2 A K C 3 7 2 3 2 . 1 2 5 1 A K C 3 7 2 3 1 . 1 2 8 8 A K C 3 7 2 3 0 . 1 2 9 7 A K C 3 7 2 2 9 . 1 3 6 4
  12. Using BLAST Quesᨀ㨅on: How did the polymerase gene of the

    ebola virus change between the 1972 and the 2014 outbreaks? The sequence accession from the 1972 outbreak is A A D 1 4 5 8 9 (aka gene named L) Get the sequence (I like to get it in both formats because genbank has more information in it): e f e t c h - d b p r o t e i n - i d A A D 1 4 5 8 9 - f o r m a t g e n b a n k > L . g b e f e t c h - d b p r o t e i n - i d A A D 1 4 5 8 9 - f o r m a t f a s t a > L . f a
  13. Run the protein BLAST This generates too much of output:

    b l a s t p - d b e b o l a - q u e r y L . f a Reformat the results: b l a s t p - d b e b o l a - q u e r y L . f a - o u t f m t " 6 q a c c s a c c p i d e n t " | h e a Prints: A A D 1 4 5 8 9 . 1 A K C 3 6 0 0 1 9 8 . 7 3 4 A A D 1 4 5 8 9 . 1 A K C 3 6 4 3 3 9 8 . 7 3 4 A A D 1 4 5 8 9 . 1 A K C 3 6 7 0 2 9 8 . 7 3 4 A A D 1 4 5 8 9 . 1 A K C 3 7 0 3 5 9 8 . 7 3 4
  14. Sort the blast formaᨀꌅed output Which is the most similar

    hit (by percent identity): b l a s t p - d b e b o l a - q u e r y L . f a - o u t f m t " 6 q a c c s a c c p i d e n t " | s o r t - k 3 , 3 r n | h e a d - 4 We note how none of them are 100% match: A A D 1 4 5 8 9 . 1 A I E 1 1 8 0 5 9 8 . 7 7 9 A A D 1 4 5 8 9 . 1 A I E 1 1 8 1 4 9 8 . 7 7 9 A A D 1 4 5 8 9 . 1 A I E 1 1 8 2 3 9 8 . 7 7 9 A A D 1 4 5 8 9 . 1 A I E 1 1 8 3 2 9 8 . 7 7 9 Extract the top hit from the BLAST database: b l a s t d b c m d - d b e b o l a - e n t r y A I E 1 1 8 0 5 > A I E 1 1 8 0 5 . f a
  15. Invesᨀ㨅gate the top hits Blast also supports pairwise alignment: b

    l a s t p - q u e r y L . f a - s u b j e c t A I E 1 1 8 0 5 . f a Or use our scripts: g l o b a l - a l i g n . s h L . f a A I E 1 1 8 0 5 . f a Evaluate the alignments: . . . A A D 1 4 5 8 9 . 1 1 6 5 1 N R K Y L A R D S S T G S S T N N S D G H I E R S Q E Q T T R D P H D G T E R N L V L | | | . | . | : | | | | | | | | | | | | | | : | | | | | | | | | | | | | | | | : | | | A I E 1 1 8 0 5 . 1 1 6 5 1 N R K D L T R N S S T G S S T N N S D G H I K R S Q E Q T T R D P H D G T E R S L V L . . .
  16. What are blast tasks? Blast is tuned to different search

    strategies. This is an ebola genome: e f e t c h - d b n u c l e o t i d e - i d K M 2 3 3 1 1 8 - - f o r m a t f a s t a > K M 2 3 3 1 1 8 . f a Make it into a blast database: m a k e b l a s t d b - i n K M 2 3 3 1 1 8 . f a - d b t y p e n u c l - o u t K M 2 3 3 1 1 8 - p a r s e _ s e
  17. Extract a sequence and align it Get the rst 40

    bases of the genome: b l a s t d b c m d - d b K M 2 3 3 1 1 8 - e n t r y K M 2 3 3 1 1 8 - r a n g e 1 - 4 0 > q u e r y . f a Align with blast: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a It works! Q u e r y 1 A A T C A T A C C T G G T T T G T T T C A G A G C C A T A T C A C C A A G A T A 4 0 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | S b j c t 1 A A T C A T A C C T G G T T T G T T T C A G A G C C A T A T C A C C A A G A T A 4 0
  18. BLAST task needs to suit the job Now make the

    query shorter: b l a s t d b c m d - d b K M 2 3 3 1 1 8 - e n t r y K M 2 3 3 1 1 8 - r a n g e 1 - 1 5 > q u e r y . f a Now the same alignment: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a Prints: * * * * * N o h i t s f o u n d * * * * *
  19. Select a different BLAST task The default task is m

    e g a b l a s t change it to b l a s t n: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a - t a s k b l a s t n Now it works again: Q u e r y 1 A A T C A T A C C T G G T T T 1 5 | | | | | | | | | | | | | | | S b j c t 1 A A T C A T A C C T G G T T T 1 5
  20. Make the query even shorter b l a s t

    d b c m d - d b K M 2 3 3 1 1 8 - e n t r y K M 2 3 3 1 1 8 - r a n g e 1 - 7 > q u e r y . f a Now b l a s t n does not work either: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a - t a s k b l a s t n But a different task will: b l a s t n - d b K M 2 3 3 1 1 8 - q u e r y q u e r y . f a - t a s k b l a s t n - s h o r t In general we have to be cautios when BLAST generates no hits. Sometimes the query properties make BLAST lter out the results.
  21. makeblastdb - p a r s e _ s e

    q i d s opᨀ㨅on Using the "parse_seqids" ag when invoking makeblastdb allows the retrieval of sequences based upon sequence identi ers. But in that case, each sequence must have a unique identi er, and that identi er must have a speci c format See also section 5.14 Limiᨀ㨅ng a Search with a List of Idenᨀ㨅fiers in the NCBI BLAST+ handbook