Lecture 5: Making sense of a list of genes

What do I do with a list of genes? Making
sense of data BMMB 852: Applied Bioinforma cs

So you have a list of genes Now what?

Results produced by algorithms are rarely immediately meaningful

Interpre ng the results is a dis nct and separate
process.

Being good at data interpreta on is an extremely valuable
skill

Many bioinforma cians focuson exessively on algorithms, tools etc. Making
sense of data some mes becomes an a erthought! Don't be one of them.

What your results will look like A surprising number of
" nal" data analysis results will be on of: 1. A list of names 2. A list of names with a single value 3. A list of names with a matrix of values The names may be gene, transcript or other feature names. The question becomes to how do you interpret the list?

A list of genes See the online lecture information for
g e n e - l i s t . t x t, g e n e - v a l u e s . t x t and g e n e - m a t r i x . t x t c a t g e n e - l i s t . t x t | h e a d - 5 prints: A C E A D R B 2 A D R B 3 A G R P A K R 1 C 2 The list is limited to a small subset of all names!

A list of genes with values Displays a name with
a single measurement: c a t g e n e - v a l u e s . t x t | h e a d - 5 prints: g e n e v a l u e T m e m 1 3 2 a 1 . 0 4 E - 1 2 M y l 3 6 . 6 7 E - 1 4 M y l 4 3 . 2 7 E - 0 9 H s p b 7 1 . 2 7 E - 0 7 One "value" for each row.

A gene matrix Displays a name with multiple measurements c
a t g e n e - m a t r i x . t x t | h e a d - 5 prints g e n e C 1 C 2 C 3 M 1 M 2 M 3 T m e m 1 3 2 a 7 3 4 9 . 5 1 0 6 0 4 . 4 1 1 4 0 0 . 6 6 9 4 . 7 7 0 9 . 3 7 6 0 . 2 M y l 3 1 2 0 7 . 1 1 3 4 5 . 0 1 2 4 7 . 6 2 2 2 2 . 9 3 0 4 1 . 3 2 8 1 9 . 0 M y l 4 2 4 6 8 . 6 2 5 8 8 . 5 2 8 4 0 . 4 3 9 6 3 . 2 5 0 4 4 . 9 4 8 2 4 . 7 H s p b 7 5 6 2 . 5 6 1 0 . 3 6 4 7 . 0 9 4 7 . 0 1 3 0 0 . 0 1 1 4 4 . 7 Multiple "values" for each row.

But what does it all mean?

2002, Donald Rumsfeld (Secretary of Defence): On the Iraq war:
There are known knowns. There are things we know that we know. There are known unknowns. That is to say, there are things that we now know we don't know. But there are also unknown unknowns. There are things we do not know we don't know Much ridiculed back then - yet quite deep and thoughtful - especially regarding scienti c inquiry

Is this a complete enumera on? 1. Known knowns 2.
Known unknowns 3. Unknown unknowns Is there a fourth category?

Unkown knows! The things we don't know that we know.

"There are known knowns" First you ought to understand: 1.
How is the current knowledge represented. 2. How is the current knowledge searched. Every method will search your data against the known knowns.

Recap: The Gene Ontology What is the GO? De nition
of GO Terms layed out as a tree: G O : 0 1 ∕ \ ∕ \ G O : 0 2 G O : 0 3 ∕ \ ∕ \ G O : 0 4 G O : 0 5 And an association of gene products with terms. G e n e A G O : 0 5

Mul ple types of edges G O : 0 1
∕ \ ∕ \ G O : 0 2 G O : 0 3 ∕ \ ∕ \ G O : 0 4 G O : 0 5 i s _ a - a subtype p a r t _ o f - complete containment h a s _ p a r - partial containment r e g u l a t e s - partial effect

Direct and indirect links G O : 0 1 ∕
\ ∕ \ G O : 0 2 G O : 0 3 ∕ \ ∕ \ G O : 0 4 G O : 0 5 Everything labeled as G O : 0 5 will also "be" a G O : 0 3 and G O : 0 1 A gene known to be G O : 0 3 could also be G O : 0 4 or G O : 0 5, perhaps both, or perhaps neither.

Leaf nodes ‐ nodes with no children These are the
most detailed representation. G O : 0 0 0 8 1 5 0 - > b i o l o g i c a l p r o c e s s G O : 0 0 3 2 5 0 2 - > d e v e l o p m e n t p r o c e s s G O : 0 0 4 8 8 5 6 - > a n a t o m i c a l s t r u c t u r e d e v e l o p m e n t G O : 0 0 4 8 5 1 3 - > a n i m a l o r g a n d e v e l o p m e n G O : 0 0 0 7 5 0 7 - > h e a r t d e v e l o p m e n t G O : 0 0 0 7 5 1 2 - > a d u l t h e a r t d e v e l o p m e n t

GO analysis from command line Install with: # I n
s t a l l s o f t w a r e c o n d a i n s t a l l g o a t o o l s # G e t t h e g o . o b o f i l e w g e t h t t p : ∕ ∕ g e n e o n t o l o g y . o r g ∕ o n t o l o g y ∕ g o - b a s i c . o b o # V i s u a l i z e a t e r m p l o t _ g o _ t e r m . p y - - t e r m G O : 0 0 0 7 5 0 7 g o . o b o

See what this gives you p l o t _
g o _ t e r m . p y - - t e r m G O : 0 0 0 0 0 0 1

Tacit limita ons Makes use of only the i s
_ a relationships between concepts. Does not use make the p a r t _ o f or other relationships. It is very typical of bioinformatics tools to not be explicit of their limitations.

Inves gate the GO data to understand it The gene
ontology de nition le (detailed commands in the book) w g e t h t t p : ∕ ∕ p u r l . o b o l i b r a r y . o r g ∕ o b o ∕ g o - b a s i c . o b o how many lines: c a t g o . o b o | w c - l # 6 3 2 1 4 0 Page through it with m o r e g o - b a s i c . o b o

GO Terms The content of the core g o -
b a s i c . o b o le is constructed of records in the form: [ T e r m ] i d : G O : 0 0 0 0 0 0 2 n a m e : m i t o c h o n d r i a l g e n o m e m a i n t e n a n c e n a m e s p a c e : b i o l o g i c a l _ p r o c e s s d e f : " T h e m a i n t e n a n c e o f t h e s t r u c t u r e a n d i n t e g r i t y o f t h e m i t o c h o n d r i a l g e n o m e ; i n c l u d e s r e p l i c a t i o n a n d s e g r e g a t i o n o f t h e m i t o c h o n d r i a l c h r o m o s o m e . " [ G O C : a i , G O C : v w ] i s _ a : G O : 0 0 0 7 0 0 5 ! m i t o c h o n d r i o n o r g a n i z a t i o n The i s _ a line indicates the parent of the term. If your "concept" is not in this le tools will not nd

Manipulate GO ﬁles from command line Once you identify the
patterns in the records, then you can search for various content: c a t g o . o b o | g r e p " n a m e s p a c e : b i o l o g i c a l _ p r o c e s s " | w c - l # 3 0 5 8 3 c a t g o . o b o | g r e p " n a m e s p a c e : m o l e c u l a r _ f u n c t i o n " | w c - l # 1 2 1 2 3 c a t g o . o b o | g r e p " n a m e s p a c e : c e l l u l a r _ c o m p o n e n t " | w c - l # 4 3 0 0 Every functional enrichment tool uses this le as its basis.

Search the ﬁle for func ons You can also get
previous and following lines by passing the - B (before) - A (after) options to g r e p. c a t g o . o b o | g r e p " l a c t a s e a c t i v i t y " - B 2 - A 5 | h e a d - 8 Prints: [ T e r m ] i d : G O : 0 0 0 0 0 1 6 n a m e : l a c t a s e a c t i v i t y n a m e s p a c e : m o l e c u l a r _ f u n c t i o n d e f : " C a t a l y s i s o f t h e r e a c t i o n : l a c t o s e + H 2 O = D - g l u c o s e + D - g s y n o n y m : " l a c t a s e - p h l o r i z i n h y d r o l a s e a c t i v i t y " B R O A D [ E C : 3 . 2 . 1 . s y n o n y m : " l a c t o s e g a l a c t o h y d r o l a s e a c t i v i t y " E X A C T [ E C : 3 . 2 . 1 . 1 0 8 x r e f : E C : 3 . 2 . 1 . 1 0 8

The associa on ﬁle for Homo Sapiens From the GO
download page, copy the link then: w g e t h t t p : ∕ ∕ g e n e o n t o l o g y . o r g ∕ g e n e - a s s o c i a t i o n s ∕ g o a _ h u m a n . g a f . g z # U n z i p t h e c o m p r e s s e d f i l e . g u n z i p g o a _ h u m a n . g a f . g z How big is the resulting le: c a t g o a _ h u m a n . g a f | w c - l # 4 2 5 9 0 1 There you have it. 4 2 5 , 9 0 1 known functions for the human genes.

What is in the associa on ﬁle? There is a
readme with the le (on the web) and you can download that the same way. You can also page through the le c a t g o a _ h u m a n . g a f | m o r e Comments are spec ed with ! the rest are tab separated and column oriented data. Remove the lines starting with ! to simplify it. c a t g o a _ h u m a n . g a f | g r e p - v ' ! ' > a s s o c . t x t

What proper es do the data have? The GAF format
states that column 3 has to be a c a t a s s o c . t x t | c u t - f 3 | h e a d Prints D N A J C 2 5 - G N G 1 0 D N A J C 2 5 - G N G 1 0 D N A J C 2 5 - G N G 1 0 H D G F R P 3 a symbol that means something to a biologist wherever possible (a gene symbol, for example) “ “

How many gene symbols? c a t a s s
o c . t x t | c u t - f 3 | s o r t | u n i q - c | w c - l # 1 9 4 2 1 Most genes appear to have at least one entry. 4 2 5 , 9 0 1 over 1 9 , 4 2 1 genes means on average about 2 2 annotation per gene. But the annotations are not evenly distributed.

Annota on distribu on Redirect output into g e n
e _ c o u n t s . t x t: c a t a s s o c . t x t | c u t - f 3 | s o r t | u n i q - c | s o r t - k 1 , 1 n r > g e n e _ c a t g e n e _ c o u n t s . t x t | h e a d The "top" genes have annotations way above the 2 2 7 2 4 T P 5 3 6 6 9 G R B 2 6 3 7 E G F R 6 3 7 U B C 5 8 0 R P S 2 7 A 5 7 0 U B B 5 6 5 U B A 5 2 5 1 1 C T N N B 1 4 2 2 S R C

Command line analy cs with datamash A handy tool called
d a t a m a s h lets you do data analytics at command line. # A c t i v a t e y o u r e n v i r o m e n t s o u r c e a c t i v a t e b i o i n f o # G e t h e l p o n d a t a m a s h d a t a m a s h - - h e l p Unfortunately the u n i q - c command pads numbers with a variable number of spaces. We need to s q u e e z e those into a single space. t r - s can do that. c a t g e n e _ c o u n t s . t x t | t r - s ' '

Working with datamash Average number annotations: c a t g
e n e _ c o u n t s . t x t | t r - s ' ' | d a t a m a s h - t ' ' m e a n 2 # 2 1 . 9 2 8 1 7 0 5 3 7 0 4 8 You can list multiple operations at a time: c a t g e n e _ c o u n t s . t x t | t r - s ' ' | d a t a m a s h - t ' ' m e a n 2 m i n 2 # 2 1 . 9 2 8 1 7 0 5 3 7 0 4 8 1 7 2 4

Func onal analysis Typically described with "overloaded" words, roughly the
following categories: Overrepresentation analysis (ORA) Gene set enrichment analysis (GSEA) Pathway analysis The de nitions are a bit nebulous and overlapping.

Overrepresenta on Are known biological functions over-represented (enriched) in an
list relative to an expectation. For example: Out of my list of 20 genes I see 10 annotated as heart development (GO:0007507). Does that mean this function is enriched?

A typical overrepresenta on study # H o w m
a n y g e n e s a r e a n n o t a t e d i n t o t a l c a t g o a _ h u m a n . g a f | c u t - f 3 | s o r t | u n i q - c | w c - l 1 9 6 7 5 # H o w m a n y g e n e s a n n o t a t e d i n t h e s u b t r e e o f G O : 0 0 0 7 5 0 7 c a t g o a _ h u m a n . g a f | g r e p G O : 0 0 0 7 5 0 7 | w c - l 2 0 7 c a t g o a _ h u m a n . g a f | g r e p G O : 0 0 0 7 5 0 8 | w c - l 0 c a t g o a _ h u m a n . g a f | g r e p G O : 0 0 0 7 5 1 2 | w c - l 1 6 Thus we expect to see 223 out of 19675 (1%) but we see 10 out of 20 (50%). Is heart development overrepresented?

List some poten al pi alls

Many correc ons are needed 1. Multiple testing 2. Functional
categories are not independent (GO is a tree!) 3. The unknown unknowns Every publication and tool claims to have addresses one or more weaknesses. But there is no "gold" standard to evaluate against! Why?

We don't know when we are correct The unknown known!

There is a whole co age industry of func onal
tools. Never clear which one works "be er" than others People pick the one that seems to work for them ‐ take that SCIENCE!

Gene set enrichment analysis Typically 1. Requires a gene list
with values 2. All genes/transcripts should be present 3. The values should allow a ranking of the rows (high/low) The question it answers: Are there lists of functions common to genes listed in the top/bottom

The main diﬀerence Overrepresentation: the subset is already known -
-> get functions Gene set enrichment: a ranking is known -> get functions that are enriched in the top/bottom

Analysis tools The book shows several options (many more exists):
g:Proﬁler Panther DAVID ermineJ Tools come with different tradeoffs and my better suited for different problem sets. It is not clear beforehand which tool works for a given problem.

Will diﬀerent tools produce diﬀerent results? Biostar Quote of the
Day: Why does each GO enrichment method give different results? I'm new to GO terms. In the beginning it was fun, as long as I stuck to one algorithm. But then I found that there are many out there, each with its own advantages and caveats (the quality of graphic representation, for instance). [...] As a biologist, what should I trust? Deciding on this or that algorithm may change the whole story! “ “

Data selec on for the lecture If you are pursuing
a research project already you may have your own data available. Use that when you follow the examples in the book. You can also make your own data. Example: take the top 20 most annotated genes from GO and see what is common about them. You may also download the g e n e - l i s t . t x t from the lecture website. The same applies to the homework.

Lecture 5: Making sense of a list of genes

Lecture 5: Making sense of a list of genes

More Decks by Istvan Albert

Other Decks in Education

Featured

Transcript