Lecture 30: RNA-Seq data analysis

Lecture 30 RNA-Seq Data Analysis

RNA-Seq steps Traditional way: 1. Align 2. Quantify 3. Compare

A new style of analysis Around 2014 - new style
of analysis 1. Align + Quantify (in a single step) 2. Compare

Fast Abundance Estimation A new school of thought that applies
to transcriptomes. We may not need alignments at all. Turns RNA-Seq into a classi cation problem. Two tools: Kallisto, Salmon

A curious case of too much reproducibility See link in
Lecure 6 for the Kallisto vs Salmon controversy of 2017. The appoaches are extremely fast and convenient. Will probably replace traditional methods (for cases when a transcriptome is available).

How to make sense of data Two major classes of
problems: 1. Pairwise comparisons (reasonably well de ned methods) Compare two conditions: C1 vs C2 2. Non-pairwise comparisons (needs a a matching design statistical modeling) Compare more than conditions: C1 and C2 and C3

Pairwise comparisons You end up with a count table with
conditions C1, C2 and replicates R1, R2... C1_R1 C1_R2 C2_R1 C2_R2 Transcript A 100 200 10 120 Transcript B 200 320 88 39 Transcript C 150 123 63 8 P-value meaning: When compared across replicates and conditions what is the chance of observing the variation of the size observed right now.

How can do different methods produce different p-values? Each p-value
is de ned the same way: Which one is right? What is the typically unstated, tacit assumption? The p-value is the chance of observing the variation of the size observed right now. “ “

The truth about p-values The missing statement that you need
to say before any de ntion is: If our model is correct then The p-value is the chance of observing the variation of the size observed right now. “ “

Ok. Is the statistical model ever correct?

Ahem. No. A model is NEVER fully correct.

The corrected statement The missing statement that you need to
say before any de ntion is: If our statistical model were correct then Since our model is not quite correct we still hope the difference is not that substantial so the p-value will still apply to some extent. Good luck mate. The p-value would be the chance of observing the variation of the size observed right now. “ “

P-values are guidance. Additional evidence.

Scientists love to argue about methods My method is better
than your method. You can perform your pairwise comparison in many ways: 1. Deseq1 2. Deseq2 3. edgeR The handbook has many detailed explanations on each and a script to do all three in parallel.

I ran my comparison script now what? You end up
with a list of transcripts, genes, features. Most publications are about interpreting these lists of genes. Go back to Lecture 5: How do I interpret a list of genes?

And now for something different The Handbook is going a
new direction

Bioinformatics Recipes We are putting scripts on the web: https://psu.bioinformatics.recipes
Tip: You can nish the last homework by looking at the results of the RNA-Seq Recipe. Help us make it better Turn your project a recipe.

Current state We expect to change a lot.

What is a recipe? A "megaton" scipt with a web
interface. Runnable by someone else. Shareable between users. Modifyiable, customizable. It is still a script, but a web enabled one. Borrow each others scripts, learn and create full examples.

The RNA-Seq Recipe

Bioinformatics Recipes This is the rst public announcement! Developed while
teaching this course. Now joining other tools that came out from this course: Galaxy, Biostar, Biostar Handbook and now the the new baby: Bioinformatics Recipes We have very high hopes and expectations - I think this time next year the bioinformatics world will be runnning on recipes.

Lecture 30: RNA-Seq data analysis

Lecture 30: RNA-Seq data analysis

Istvan Albert

More Decks by Istvan Albert

Other Decks in Science

Featured

Transcript