Slide 1

Slide 1 text

Bioinformatics Seminars Bioinformatics Seminars Leonardo Collado Torres [email protected] November 18th, 2011 1 / 61

Slide 2

Slide 2 text

Bioinformatics Seminars Bio data Why Biostatistics? Understanding Biostats Methods development Why R? Basic EDA Setting up your toolbox 2 / 61

Slide 3

Slide 3 text

Bioinformatics Seminars Bio data Biology If you studying the Undergrad on Genomic Sciences, you are definitely interested in Biology. Well, I hope so! So, what is Biology? It’s the study of living beings, right? So, how do we study? You are already part of academia as an undergrad student. Your voice counts and we’d love to listen to your ideas. If you think that you are not in academia, well at least laugh at http://sotak.info/sci.jpg 3 / 61

Slide 4

Slide 4 text

Bioinformatics Seminars Bio data Using the Scientific Method According to Wikipedia: Scientific method refers to a body of techniques for investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge. To be termed scientific, a method of inquiry must be based on gathering empirical and measurable evidence subject to specific principles of reasoning. The Oxford English Dictionary says that scientific method is: a method of procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses. 4 / 61

Slide 5

Slide 5 text

Bioinformatics Seminars Bio data Parenthesis: is formulating hypothesis out of date? In a way, yes it’s old. You don’t have to formulate a hyphotesis before exploring the data. For a more complete phylosophical discussion look at the paper by Glass and Hall 2008, Cell. An interesting book is The Grand Design by Stephen Hawking and Leonard Mlodinow. 5 / 61

Slide 6

Slide 6 text

Bioinformatics Seminars Bio data Why study genomics? Cure cancer? You will surely be interested in the new development of medicines thanks to our understanding of Immunology. You like animals and/or microbes and/or plants and you think that studying their genome will allow us to understand them better? Because it’s new? Sounds fancy? My case: I liked math and computing, and biology was going to provide me with the interesting and complicated problems to solve using math and computers. Anyhow, you’ll notice that Genomics is just a big name and you can pretty much major in any hard-science of your choice: biochem, molecular bio, biocomputing, . . . biostatistics? 6 / 61

Slide 7

Slide 7 text

Bioinformatics Seminars Bio data Why has Biology gone high-throughput? Say that we are studying something of your interest and we observe the object at time 1 and at time 2. Can you tell me what happened between time 1 and 2? 7 / 61

Slide 8

Slide 8 text

Bioinformatics Seminars Bio data What happened between the red and blue point? > plot(c(1, 2), c(1, 2), pch = 16, + col = c("blue", "red"), xlab = "Time", + ylab = "Some measured units", + ylim = c(0.8, 2)) > points(1.5, 1.5, col = "orange", + pch = 16) > points(1.2, 1.8, col = "purple", + pch = 16) > points(1.7, 0.9, col = "forest green", + pch = 16) > abline(0, 1, lty = 3) 8 / 61

Slide 9

Slide 9 text

Bioinformatics Seminars Bio data What happened between the red and blue point? q q 1.0 1.2 1.4 1.6 1.8 2.0 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Time Some measured units q q q 9 / 61

Slide 10

Slide 10 text

Bioinformatics Seminars Bio data In short Well, we can’t tell much of what happened in between. One solution? More data! More observations! Why we didn’t do it before? Technological limitations (some existed but were hella expensive). That’s why Biology (and many other areas including your social network) has gone high-throughput! Note that it does not depend on whether what you are studying is discrete or continuous. 10 / 61

Slide 11

Slide 11 text

Bioinformatics Seminars Bio data It’s all noisy data! Noise is everywhere. Your PCRs Brain imaging Genomics, proteomics, . . . Why? Biology: not all cells have the same number of molecules. Bias: technological, who is measuring, etc Artifacts 11 / 61

Slide 12

Slide 12 text

Bioinformatics Seminars Why Biostatistics? Why do we need Biostatistics? It’s fun to play with data :) Datasets are huge! Unless you can tell me that you can read the human genome in a couple seconds. I’ll even give you 5 minutes! You need the tools 1 to actually explore the data in a sensible way. Of course, you can always team up with a Biostatistician and have him analyze your data. Though, LCG is an interdisciplinary program so you should at least be able to speak the same language. 1Biostatistical methods. 12 / 61

Slide 13

Slide 13 text

Bioinformatics Seminars Why Biostatistics? Quick exercise Generate 10 thousand random values from a Normal distribution with mean 0 and standard deviation 4. Plot the 10 thousand values. What is the mean of the 10 thousand values? 13 / 61

Slide 14

Slide 14 text

Bioinformatics Seminars Why Biostatistics? What is the mean? > apropos("norm") [1] "dlnorm" "dnorm" [3] "norm" "normalizePath" [5] "plnorm" "pnorm" [7] "qlnorm" "qnorm" [9] "qqnorm" "qqnorm.default" [11] "rlnorm" "rnorm" > args(rnorm) function (n, mean = 0, sd = 1) NULL 14 / 61

Slide 15

Slide 15 text

Bioinformatics Seminars Why Biostatistics? What is the mean? > x <- round(rnorm(10000, mean = 0, + sd = 4), digits = 3) > mean(x) [1] -0.0502423 > plot(x, cex = 0.05) 15 / 61

Slide 16

Slide 16 text

Bioinformatics Seminars Why Biostatistics? What is the mean? q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 2000 4000 6000 8000 10000 −15 −10 −5 0 5 10 15 Index x 16 / 61

Slide 17

Slide 17 text

Bioinformatics Seminars Why Biostatistics? Quick example That was a simple example and the mean is not that hard to get, but that’s the starting point. Imagine analyzing 200 million short-sequences on the 3.2 gb human genome. 17 / 61

Slide 18

Slide 18 text

Bioinformatics Seminars Understanding Biostats Biostats? Huh? I’m not a math-y person and just thinking about integrals gives me headaches . . . So, I’ll collaborate with a Biostatistician and problem solved :) Hm, ok. At least understand the language. After all, a Bio-statistician also understands (or should) the biological language. Ok. I know what a P-value is so I’m good, right? 18 / 61

Slide 19

Slide 19 text

Bioinformatics Seminars Understanding Biostats Parenthesis: 1-2-3 P-value It’s the probability that I observe a value more extreme than the one I’m observing under the distribution from my data. So, what’s the P-value of observing a value greater than 2 in a standard Normal distribution 2 ? Now, what’s the P-value of observing a value more extreme than |2| in a standard Normal dist? Note that a Normal distribution is symmetric :) 2Standard Normal dist has mean 0 and standard deviation 1. 19 / 61

Slide 20

Slide 20 text

Bioinformatics Seminars Understanding Biostats Normal > x <- seq(-4, 4, by = 0.01) > plot(x, dnorm(x), type = "l", col = "blue", + main = "Density Function for a Standard Normal", + xlab = "x", ylab = "Density") > abline(v = 2, col = "red") > abline(v = -2, col = "orange") > pnorm(2, lower.tail = FALSE) [1] 0.02275013 > pnorm(2, lower.tail = FALSE) + + pnorm(-2) [1] 0.04550026 20 / 61

Slide 21

Slide 21 text

Bioinformatics Seminars Understanding Biostats Normal > pnorm(2, lower.tail = FALSE) * + 2 [1] 0.04550026 21 / 61

Slide 22

Slide 22 text

Bioinformatics Seminars Understanding Biostats Normal −4 −2 0 2 4 0.0 0.1 0.2 0.3 0.4 Density Function for a Standard Normal x Density 22 / 61

Slide 23

Slide 23 text

Bioinformatics Seminars Understanding Biostats P-value is not enough We’ll, I’m here to tell you that you need to know more than what a P-value is. First of all, there are 3 schools in Biostatistics: frequentism, likelihood-based and Bayesian. 23 / 61

Slide 24

Slide 24 text

Bioinformatics Seminars Understanding Biostats Frequentist point of view In short, the frequency that we observe an event tells us it’s probability. In statistical inference (when we observe some data and want to estimate the dist parameters), we imagine that we repeated the experiment many many times such that we can extrapolate from our single experiment and estimate the parameters of the underlying distribution. The practice of using un-observed data sounds fishy! 24 / 61

Slide 25

Slide 25 text

Bioinformatics Seminars Understanding Biostats Likelihood Our data under a given distribution informs of the most likely value for the parameters of the distribution. Doesn’t make bold statements with un-observed data but then again, it only tells us the basic information of the parameters. 25 / 61

Slide 26

Slide 26 text

Bioinformatics Seminars Understanding Biostats Bayesian Bayesian Statistics is based on the concept of conditional probability: P(A|B) = P(A ∩ B) P(B) P(B|A) = P(A ∩ B) P(A) P(A|B) = P(B|A)P(A) P(B) The idea is that you believe that your parameter has a given distribution (a priori) and you’ll update your beliefs with what you observed from your data giving you the posterior distribution. 26 / 61

Slide 27

Slide 27 text

Bioinformatics Seminars Understanding Biostats Bayesian How to choose the a priori distribution? It’s subjective although you can argue in many situations that it will not affect as much the posterior distribution. This is what keeps some from going to Bayesian stats. 27 / 61

Slide 28

Slide 28 text

Bioinformatics Seminars Understanding Biostats Confidence intervals You’ve likely heard the term before. For example, what is the 95% confidence interval for the mean of 1000 random values from a standard normal? > x <- rnorm(1000) > mean(x) - qnorm(0.975) * sd(x)/sqrt(1000) [1] 9.491511e-05 > mean(x) + qnorm(0.975) * sd(x)/sqrt(1000) [1] 0.1212455 Can you tell me what is the probability that the mean of the underlying distribution is inside the 95% confidence interval? If you do, Ale and Carlos will give you full grades for the semester! 28 / 61

Slide 29

Slide 29 text

Bioinformatics Seminars Understanding Biostats Caught you! Don’t cheat! 29 / 61

Slide 30

Slide 30 text

Bioinformatics Seminars Understanding Biostats More on CIs The 95% confidence (frequentist) interval we constructed does not tell us anything about the probability that the true parameter is contained in the interval. From the frequentist approach, it is basically telling us that if we were to repeat the experiment many many many many times, 95% of the times the mean (or whatever statistic we are interested in) will be inside the interval. 30 / 61

Slide 31

Slide 31 text

Bioinformatics Seminars Understanding Biostats CIs . . . So, the probability that the mean is inside the 95% CI is either 0 or 1. Which one is it? We cannot know. A bayesian credible interval does tell us the probability that the parameter is inside the interval. But remember that this is based on our a priori beliefs. 31 / 61

Slide 32

Slide 32 text

Bioinformatics Seminars Understanding Biostats Confused? Don’t be :) As you can see, the three schools of Statistics have their pro’s and con’s and you’ll eventually have to choose which one you like. There are tricky ways, like using Bayesian statistics to construct methods but use them in a Frequentist setting. 32 / 61

Slide 33

Slide 33 text

Bioinformatics Seminars Understanding Biostats Math you’ll need If you want to understand the Statistical language, you’ll need to review your math before. Calculus (derivation, integration, change of variables), linear algebra, Real Analysis, are helpful. You don’t have to get too deep if you don’t want to, but at least take a course in statistics and another one in probability :) 33 / 61

Slide 34

Slide 34 text

Bioinformatics Seminars Methods development What method do I use for my data set? Thanks to the Central Limit Theorem, many (large) data sets are analyzable using the Normal distribution and associated methods. What is the CLT? Well, for n randomly sampled values from Xi identically distributed random variables with mean µ and variance σ2, ¯ X = Xi /n is distributed as a N(µ, σ2/n). However, in this high-throughput biological world new methods are needed. The goal is to keep things simple! If you like the idea consider studying a masters or PhD in Biostats! 34 / 61

Slide 35

Slide 35 text

Bioinformatics Seminars Methods development CLT example We’ll have our Xi be random uniform distributions from 0 to 1. As you can see in http://en.wikipedia.org/wiki/ Uniform_distribution_(continuous) the mean of each Xi will be µ = 0.5 and the variance is σ2 = 1/12. Next, we will get 10 random values from each Xi and do this 1,000 times. Then, we’ll calculate the mean for each of the 1,000 Xi . Finally, we’ll plot the distribution of ¯ X and compare it to the theoretical distribution of a Normal distribution with mean µ and variance σ2/n. 35 / 61

Slide 36

Slide 36 text

Bioinformatics Seminars Methods development CLT example > x <- matrix(runif(1000 * 10), nrow = 1000) > dim(x) [1] 1000 10 > y <- apply(x, 1, mean) > z <- seq(0, 1, by = 0.001) > w <- dnorm(z, mean = 0.5, sd = sqrt(1/12)/sqrt(10)) > hist(y, freq = FALSE, col = "light blue", + breaks = 10, ylim = c(0, 4.5)) > lines(density(y), col = "red") > lines(z, w, col = "orange") 36 / 61

Slide 37

Slide 37 text

Bioinformatics Seminars Methods development CLT example Histogram of y y Density 0.3 0.4 0.5 0.6 0.7 0.8 0 1 2 3 4 37 / 61

Slide 38

Slide 38 text

Bioinformatics Seminars Why R? Main advantages It’s free. Available on Windows, Mac, Linux/Unix. Easily to customize and expand, specially with packages Fairly straightforward to reproduce results. After all, in academia and enterprises we seek reproducibility. 38 / 61

Slide 39

Slide 39 text

Bioinformatics Seminars Why R? So . . . why didn’t I hear about R before? In high-school you probably didn’t need anything more than Excel. Note that Excel is not great with large datasets and has limited analysis tools. The learning curve for R can be steep. However, I’m confident that more people are using R! It’s already a requirement in many jobs in the bioinformatics field. 39 / 61

Slide 40

Slide 40 text

Bioinformatics Seminars Why R? Bioconductor It’s the biggest repository (bank) of biologically-related R packages which are required to be well documented including a pdf file explaining how to link the functions and what they do (this is the vignette). Other repositories: CRAN, Rforge. There you’ll find nice infraestructure tools to manage your high-throughput data, new methods to analyze it, and interesting ways to visualize it! Again, it’s free to use. 40 / 61

Slide 41

Slide 41 text

Bioinformatics Seminars Why R? Taking reproducibility seriously Sweave 3 uses L A TEXand the final product is a nice pdf file with text, the code, and the output from R. Stangle 4 creates R files so others don’t have to type the code or extract it from the pdf file. If you run the same code on any computer using the same version of R (and packages), you will produce the same results :) 3Rnw extension 4R extension: a simple R code file 41 / 61

Slide 42

Slide 42 text

Bioinformatics Seminars Why R? SessionInfo > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] tools stats graphics [4] grDevices utils datasets [7] methods base 42 / 61

Slide 43

Slide 43 text

Bioinformatics Seminars Basic EDA EDA What do you do when you are given a data set? Well, the first step is to do an exploratory data analysis (EDA). How? Plotting the data in different ways. 43 / 61

Slide 44

Slide 44 text

Bioinformatics Seminars Basic EDA Which plots should I know how to do? The key plotting functions in R are: plot: great for plotting x-values vs y-values hist: short for histogram boxplot: useful for large (n > 25 or more) data sets qqplot: what is this? It will help you compare the quantiles of your data set vs a given distribution. qqnorm is a special case. points, lines, abline and legend Also take a look at this site for Producing simple graphs with R: http://www.harding.edu/fmccown/R/ 44 / 61

Slide 45

Slide 45 text

Bioinformatics Seminars Basic EDA Example: plot, lines, abline, legend > plot(1:5, 1:5, type = "o", col = "blue") > lines(5:1, 1:5, lty = 2, type = "o", + col = "orange") > abline(a = 3, b = 0, col = "red") > legend("top", legend = c("Line 1", + "Line 2", "Line 3"), col = c("blue", + "orange", "orange"), lty = c(1, + 2, 1), bty = "n") 45 / 61

Slide 46

Slide 46 text

Bioinformatics Seminars Basic EDA Example: plot, lines, abline, legend q q q q q 1 2 3 4 5 1 2 3 4 5 1:5 1:5 q q q q q Line 1 Line 2 Line 3 46 / 61

Slide 47

Slide 47 text

Bioinformatics Seminars Basic EDA Example:hist > hist.data <- rnorm(100) + runif(100) > hist(hist.data, col = "light blue") 47 / 61

Slide 48

Slide 48 text

Bioinformatics Seminars Basic EDA Example:hist Histogram of hist.data hist.data Frequency −2 −1 0 1 2 3 0 5 10 15 48 / 61

Slide 49

Slide 49 text

Bioinformatics Seminars Basic EDA Example:hist 2 > hist(hist.data, col = "light blue", + freq = FALSE) 49 / 61

Slide 50

Slide 50 text

Bioinformatics Seminars Basic EDA Example:hist 2 Histogram of hist.data hist.data Density −2 −1 0 1 2 3 0.00 0.10 0.20 0.30 50 / 61

Slide 51

Slide 51 text

Bioinformatics Seminars Basic EDA Boxplot: using y from the CLT > boxplot(y, col = "orange") 51 / 61

Slide 52

Slide 52 text

Bioinformatics Seminars Basic EDA Boxplot: using y from the CLT q q q 0.3 0.4 0.5 0.6 0.7 0.8 52 / 61

Slide 53

Slide 53 text

Bioinformatics Seminars Basic EDA qqnorm: using y from the CLT > qqnorm(y) > abline(0.5, 0.1, col = "red") 53 / 61

Slide 54

Slide 54 text

Bioinformatics Seminars Basic EDA qqnorm: using y from the CLT q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −3 −2 −1 0 1 2 3 0.3 0.4 0.5 0.6 0.7 0.8 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles 54 / 61

Slide 55

Slide 55 text

Bioinformatics Seminars Setting up your toolbox Choose a way to interact with R Native R in terminal. It’s ok for some quick tests, but not for writing code. R GUI? Not something you will want to stick too. It doesn’t have any text markup, and is not good for writing Rnw files. Emacs Shortcuts are not easy to learn, but it’s nice to have your code and R in the same program. It’s also works as a text markup for other languages including L A TEX. I personally use Aquamacs (an Emacs version) on my Mac, but GNU Emacs and XEmacs are good distributions. The last 2 work in Windows, Mac, Linux. 55 / 61

Slide 56

Slide 56 text

Bioinformatics Seminars Setting up your toolbox Choose a way to interact with R Notepad++ With the R plugin it’s very easy to use. Only works in Windows though. You’ll be using a text markup notepad and the R GUI. RStudio Seems interesting. Look at this post. 56 / 61

Slide 57

Slide 57 text

Bioinformatics Seminars Setting up your toolbox What is the latest new thing in R? BioC? etc It’s not easy to stay updated manually. One way is to read the posts you find interesting from R-bloggers: http://www.r-bloggers.com/ A very interesting blog in my opinion is SimplyStatistics: http://simplystatistics.tumblr.com/ For visualizing data, the R Gallery provides very neat plots made with R: http://addictedtor.free.fr/graphiques/ Mailing lists are an option too, like the Bioconductor one and the BioC-hts. But it’s harder to follow. 57 / 61

Slide 58

Slide 58 text

Bioinformatics Seminars Setting up your toolbox So many blogs As in many aspects of life, we want to be efficient with our time. One way to read blogs easily is using Google Reader: http://www.google.com/reader/ You won’t have to remember to get into a specific site and you can quickly skim through new posts. You can add your favorite comics like www.phdcomics.com or more misc ones like Salmoblog and Ciencia Explicada 58 / 61

Slide 59

Slide 59 text

Bioinformatics Seminars Setting up your toolbox Organizing your work files Ever heard of CVS? Mercurial, git, svn are version control systems. Highly useful for simple text files like R code. Great for any coding project you’ll do. Specially when collaborating with others :) I recommend Mercurial because it’s easy and Bitbucket offers unlimited space to those of us in academia. 5 5Bitbucket also works with git nowadays. 59 / 61

Slide 60

Slide 60 text

Bioinformatics Seminars Setting up your toolbox In short We need Biostatistics to analyze data in our modern study of Biology. Understanding some biostats and being able to use R allows you to perform analysis on your own data. Starting by the more common EDA. Spend some time setting up your R interface and find an efficient way to stay updated in R (and/or anything of your choice!). Bioinformaticians need a toolbox that they easily use :) R is free, highly used and highly reproducible, why not start with it? 60 / 61

Slide 61

Slide 61 text

Bioinformatics Seminars SessionInfo > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] tools stats graphics [4] grDevices utils datasets [7] methods base 61 / 61