Lecture 5: Making sense of a list of genes

F0c759df8f624be8ebcc5d8b9bb81089?s=47 Istvan Albert
September 09, 2020

Lecture 5: Making sense of a list of genes

F0c759df8f624be8ebcc5d8b9bb81089?s=128

Istvan Albert

September 09, 2020
Tweet

Transcript

  1. What do I do with a list of genes? Making

    sense of data BMMB 852: Applied Bioinforma cs
  2. So you have a list of genes Now what?

  3. Results produced by algorithms are rarely immediately meaningful

  4. Interpre ng the results is a dis nct and separate

    process.
  5. Being good at data interpreta on is an extremely valuable

    skill
  6. Many bioinforma cians focuson exessively on algorithms, tools etc. Making

    sense of data some mes becomes an a erthought! Don't be one of them.
  7. What your results will look like A surprising number of

    " nal" data analysis results will be on of: 1. A list of names 2. A list of names with a single value 3. A list of names with a matrix of values The names may be gene, transcript or other feature names. The question becomes to how do you interpret the list?
  8. A list of genes See the online lecture information for

    g e n e - l i s t . t x t, g e n e - v a l u e s . t x t and g e n e - m a t r i x . t x t c a t g e n e - l i s t . t x t | h e a d - 5 prints: A C E A D R B 2 A D R B 3 A G R P A K R 1 C 2 The list is limited to a small subset of all names!
  9. A list of genes with values Displays a name with

    a single measurement: c a t g e n e - v a l u e s . t x t | h e a d - 5 prints: g e n e v a l u e T m e m 1 3 2 a 1 . 0 4 E - 1 2 M y l 3 6 . 6 7 E - 1 4 M y l 4 3 . 2 7 E - 0 9 H s p b 7 1 . 2 7 E - 0 7 One "value" for each row.
  10. A gene matrix Displays a name with multiple measurements c

    a t g e n e - m a t r i x . t x t | h e a d - 5 prints g e n e C 1 C 2 C 3 M 1 M 2 M 3 T m e m 1 3 2 a 7 3 4 9 . 5 1 0 6 0 4 . 4 1 1 4 0 0 . 6 6 9 4 . 7 7 0 9 . 3 7 6 0 . 2 M y l 3 1 2 0 7 . 1 1 3 4 5 . 0 1 2 4 7 . 6 2 2 2 2 . 9 3 0 4 1 . 3 2 8 1 9 . 0 M y l 4 2 4 6 8 . 6 2 5 8 8 . 5 2 8 4 0 . 4 3 9 6 3 . 2 5 0 4 4 . 9 4 8 2 4 . 7 H s p b 7 5 6 2 . 5 6 1 0 . 3 6 4 7 . 0 9 4 7 . 0 1 3 0 0 . 0 1 1 4 4 . 7 Multiple "values" for each row.
  11. But what does it all mean?

  12. 2002, Donald Rumsfeld (Secretary of Defence): On the Iraq war:

    There are known knowns. There are things we know that we know. There are known unknowns. That is to say, there are things that we now know we don't know. But there are also unknown unknowns. There are things we do not know we don't know Much ridiculed back then - yet quite deep and thoughtful - especially regarding scienti c inquiry
  13. None
  14. Is this a complete enumera on? 1. Known knowns 2.

    Known unknowns 3. Unknown unknowns Is there a fourth category?
  15. Unkown knows! The things we don't know that we know.

  16. "There are known knowns" First you ought to understand: 1.

    How is the current knowledge represented. 2. How is the current knowledge searched. Every method will search your data against the known knowns.
  17. Recap: The Gene Ontology What is the GO? De nition

    of GO Terms layed out as a tree: G O : 0 1 ∕ \ ∕ \ G O : 0 2 G O : 0 3 ∕ \ ∕ \ G O : 0 4 G O : 0 5 And an association of gene products with terms. G e n e A G O : 0 5
  18. Mul ple types of edges G O : 0 1

    ∕ \ ∕ \ G O : 0 2 G O : 0 3 ∕ \ ∕ \ G O : 0 4 G O : 0 5 i s _ a - a subtype p a r t _ o f - complete containment h a s _ p a r - partial containment r e g u l a t e s - partial effect
  19. Direct and indirect links G O : 0 1 ∕

    \ ∕ \ G O : 0 2 G O : 0 3 ∕ \ ∕ \ G O : 0 4 G O : 0 5 Everything labeled as G O : 0 5 will also "be" a G O : 0 3 and G O : 0 1 A gene known to be G O : 0 3 could also be G O : 0 4 or G O : 0 5, perhaps both, or perhaps neither.
  20. Leaf nodes ‐ nodes with no children These are the

    most detailed representation. G O : 0 0 0 8 1 5 0 - > b i o l o g i c a l p r o c e s s G O : 0 0 3 2 5 0 2 - > d e v e l o p m e n t p r o c e s s G O : 0 0 4 8 8 5 6 - > a n a t o m i c a l s t r u c t u r e d e v e l o p m e n t G O : 0 0 4 8 5 1 3 - > a n i m a l o r g a n d e v e l o p m e n G O : 0 0 0 7 5 0 7 - > h e a r t d e v e l o p m e n t G O : 0 0 0 7 5 1 2 - > a d u l t h e a r t d e v e l o p m e n t
  21. GO analysis from command line Install with: # I n

    s t a l l s o f t w a r e c o n d a i n s t a l l g o a t o o l s # G e t t h e g o . o b o f i l e w g e t h t t p : ∕ ∕ g e n e o n t o l o g y . o r g ∕ o n t o l o g y ∕ g o - b a s i c . o b o # V i s u a l i z e a t e r m p l o t _ g o _ t e r m . p y - - t e r m G O : 0 0 0 7 5 0 7 g o . o b o
  22. None
  23. See what this gives you p l o t _

    g o _ t e r m . p y - - t e r m G O : 0 0 0 0 0 0 1
  24. Tacit limita ons Makes use of only the i s

    _ a relationships between concepts. Does not use make the p a r t _ o f or other relationships. It is very typical of bioinformatics tools to not be explicit of their limitations.
  25. Inves gate the GO data to understand it The gene

    ontology de nition le (detailed commands in the book) w g e t h t t p : ∕ ∕ p u r l . o b o l i b r a r y . o r g ∕ o b o ∕ g o - b a s i c . o b o how many lines: c a t g o . o b o | w c - l # 6 3 2 1 4 0 Page through it with m o r e g o - b a s i c . o b o
  26. GO Terms The content of the core g o -

    b a s i c . o b o le is constructed of records in the form: [ T e r m ] i d : G O : 0 0 0 0 0 0 2 n a m e : m i t o c h o n d r i a l g e n o m e m a i n t e n a n c e n a m e s p a c e : b i o l o g i c a l _ p r o c e s s d e f : " T h e m a i n t e n a n c e o f t h e s t r u c t u r e a n d i n t e g r i t y o f t h e m i t o c h o n d r i a l g e n o m e ; i n c l u d e s r e p l i c a t i o n a n d s e g r e g a t i o n o f t h e m i t o c h o n d r i a l c h r o m o s o m e . " [ G O C : a i , G O C : v w ] i s _ a : G O : 0 0 0 7 0 0 5 ! m i t o c h o n d r i o n o r g a n i z a t i o n The i s _ a line indicates the parent of the term. If your "concept" is not in this le tools will not nd
  27. Manipulate GO files from command line Once you identify the

    patterns in the records, then you can search for various content: c a t g o . o b o | g r e p " n a m e s p a c e : b i o l o g i c a l _ p r o c e s s " | w c - l # 3 0 5 8 3 c a t g o . o b o | g r e p " n a m e s p a c e : m o l e c u l a r _ f u n c t i o n " | w c - l # 1 2 1 2 3 c a t g o . o b o | g r e p " n a m e s p a c e : c e l l u l a r _ c o m p o n e n t " | w c - l # 4 3 0 0 Every functional enrichment tool uses this le as its basis.
  28. Search the file for func ons You can also get

    previous and following lines by passing the - B (before) - A (after) options to g r e p. c a t g o . o b o | g r e p " l a c t a s e a c t i v i t y " - B 2 - A 5 | h e a d - 8 Prints: [ T e r m ] i d : G O : 0 0 0 0 0 1 6 n a m e : l a c t a s e a c t i v i t y n a m e s p a c e : m o l e c u l a r _ f u n c t i o n d e f : " C a t a l y s i s o f t h e r e a c t i o n : l a c t o s e + H 2 O = D - g l u c o s e + D - g s y n o n y m : " l a c t a s e - p h l o r i z i n h y d r o l a s e a c t i v i t y " B R O A D [ E C : 3 . 2 . 1 . s y n o n y m : " l a c t o s e g a l a c t o h y d r o l a s e a c t i v i t y " E X A C T [ E C : 3 . 2 . 1 . 1 0 8 x r e f : E C : 3 . 2 . 1 . 1 0 8
  29. The associa on file for Homo Sapiens From the GO

    download page, copy the link then: w g e t h t t p : ∕ ∕ g e n e o n t o l o g y . o r g ∕ g e n e - a s s o c i a t i o n s ∕ g o a _ h u m a n . g a f . g z # U n z i p t h e c o m p r e s s e d f i l e . g u n z i p g o a _ h u m a n . g a f . g z How big is the resulting le: c a t g o a _ h u m a n . g a f | w c - l # 4 2 5 9 0 1 There you have it. 4 2 5 , 9 0 1 known functions for the human genes.
  30. What is in the associa on file? There is a

    readme with the le (on the web) and you can download that the same way. You can also page through the le c a t g o a _ h u m a n . g a f | m o r e Comments are spec ed with ! the rest are tab separated and column oriented data. Remove the lines starting with ! to simplify it. c a t g o a _ h u m a n . g a f | g r e p - v ' ! ' > a s s o c . t x t
  31. What proper es do the data have? The GAF format

    states that column 3 has to be a c a t a s s o c . t x t | c u t - f 3 | h e a d Prints D N A J C 2 5 - G N G 1 0 D N A J C 2 5 - G N G 1 0 D N A J C 2 5 - G N G 1 0 H D G F R P 3 a symbol that means something to a biologist wherever possible (a gene symbol, for example) “ “
  32. How many gene symbols? c a t a s s

    o c . t x t | c u t - f 3 | s o r t | u n i q - c | w c - l # 1 9 4 2 1 Most genes appear to have at least one entry. 4 2 5 , 9 0 1 over 1 9 , 4 2 1 genes means on average about 2 2 annotation per gene. But the annotations are not evenly distributed.
  33. Annota on distribu on Redirect output into g e n

    e _ c o u n t s . t x t: c a t a s s o c . t x t | c u t - f 3 | s o r t | u n i q - c | s o r t - k 1 , 1 n r > g e n e _ c a t g e n e _ c o u n t s . t x t | h e a d The "top" genes have annotations way above the 2 2 7 2 4 T P 5 3 6 6 9 G R B 2 6 3 7 E G F R 6 3 7 U B C 5 8 0 R P S 2 7 A 5 7 0 U B B 5 6 5 U B A 5 2 5 1 1 C T N N B 1 4 2 2 S R C
  34. Command line analy cs with datamash A handy tool called

    d a t a m a s h lets you do data analytics at command line. # A c t i v a t e y o u r e n v i r o m e n t s o u r c e a c t i v a t e b i o i n f o # G e t h e l p o n d a t a m a s h d a t a m a s h - - h e l p Unfortunately the u n i q - c command pads numbers with a variable number of spaces. We need to s q u e e z e those into a single space. t r - s can do that. c a t g e n e _ c o u n t s . t x t | t r - s ' '
  35. Working with datamash Average number annotations: c a t g

    e n e _ c o u n t s . t x t | t r - s ' ' | d a t a m a s h - t ' ' m e a n 2 # 2 1 . 9 2 8 1 7 0 5 3 7 0 4 8 You can list multiple operations at a time: c a t g e n e _ c o u n t s . t x t | t r - s ' ' | d a t a m a s h - t ' ' m e a n 2 m i n 2 # 2 1 . 9 2 8 1 7 0 5 3 7 0 4 8 1 7 2 4
  36. Func onal analysis Typically described with "overloaded" words, roughly the

    following categories: Overrepresentation analysis (ORA) Gene set enrichment analysis (GSEA) Pathway analysis The de nitions are a bit nebulous and overlapping.
  37. Overrepresenta on Are known biological functions over-represented (enriched) in an

    list relative to an expectation. For example: Out of my list of 20 genes I see 10 annotated as heart development (GO:0007507). Does that mean this function is enriched?
  38. A typical overrepresenta on study # H o w m

    a n y g e n e s a r e a n n o t a t e d i n t o t a l c a t g o a _ h u m a n . g a f | c u t - f 3 | s o r t | u n i q - c | w c - l 1 9 6 7 5 # H o w m a n y g e n e s a n n o t a t e d i n t h e s u b t r e e o f G O : 0 0 0 7 5 0 7 c a t g o a _ h u m a n . g a f | g r e p G O : 0 0 0 7 5 0 7 | w c - l 2 0 7 c a t g o a _ h u m a n . g a f | g r e p G O : 0 0 0 7 5 0 8 | w c - l 0 c a t g o a _ h u m a n . g a f | g r e p G O : 0 0 0 7 5 1 2 | w c - l 1 6 Thus we expect to see 223 out of 19675 (1%) but we see 10 out of 20 (50%). Is heart development overrepresented?
  39. List some poten al pi alls

  40. Many correc ons are needed 1. Multiple testing 2. Functional

    categories are not independent (GO is a tree!) 3. The unknown unknowns Every publication and tool claims to have addresses one or more weaknesses. But there is no "gold" standard to evaluate against! Why?
  41. We don't know when we are correct The unknown known!

  42. There is a whole co age industry of func onal

    tools. Never clear which one works "be er" than others People pick the one that seems to work for them ‐ take that SCIENCE!
  43. Gene set enrichment analysis Typically 1. Requires a gene list

    with values 2. All genes/transcripts should be present 3. The values should allow a ranking of the rows (high/low) The question it answers: Are there lists of functions common to genes listed in the top/bottom
  44. The main difference Overrepresentation: the subset is already known -

    -> get functions Gene set enrichment: a ranking is known -> get functions that are enriched in the top/bottom
  45. Analysis tools The book shows several options (many more exists):

    g:Profiler Panther DAVID ermineJ Tools come with different tradeoffs and my better suited for different problem sets. It is not clear beforehand which tool works for a given problem.
  46. Will different tools produce different results? Biostar Quote of the

    Day: Why does each GO enrichment method give different results? I'm new to GO terms. In the beginning it was fun, as long as I stuck to one algorithm. But then I found that there are many out there, each with its own advantages and caveats (the quality of graphic representation, for instance). [...] As a biologist, what should I trust? Deciding on this or that algorithm may change the whole story! “ “
  47. Data selec on for the lecture If you are pursuing

    a research project already you may have your own data available. Use that when you follow the examples in the book. You can also make your own data. Example: take the top 20 most annotated genes from GO and see what is common about them. You may also download the g e n e - l i s t . t x t from the lecture website. The same applies to the homework.