Big data and reproducibility

4bd13719da0ba2c5bd2a446e14f78187?s=47 Jeff L.
June 28, 2014

Big data and reproducibility

Talk at JHU summer institute.

4bd13719da0ba2c5bd2a446e14f78187?s=128

Jeff L.

June 28, 2014
Tweet

Transcript

  1. Big data and reproducibility

  2. None
  3. N = SAMPLE SIZE

  4. N = ($ YOU HAVE) ($ PER SAMPLE)

  5. Year $ per (human) Genome

  6. rna-seq 2008 N≈2 2010 N≈70 2013 N≈900 PMIDS: 19056941, 20220758,

    24092820
  7. www.geni.com

  8. http://erlichlab.wi.mit.edu/familinx/index.html

  9. None
  10. None
  11. None
  12. None
  13. what went wrong? 2 things

  14. what went wrong? transparency The data/code weren’t reproducible

  15. what went wrong? transparency There was a lack of cooperation

  16. what went wrong? expertise They used silly prediction rules (Pr(FEC)

     =  5/8[Pr(F)  +  Pr(E)  +  Pr(C)]  –  ¼)  
  17. what went wrong? expertise They had study design problems (Batch

     effects)  
  18. what went wrong? expertise Their predictions weren’t locked down Today:

     Pr(FEC)  =  0.8   Tomorrow:  Pr(FEC)  =  0.1    
  19. At the end of the day the Potti analysis was

    fully reproducible The problem is that the analysis was wrong
  20. 1st Discussion Point: What is reproducibility?

  21. The goal: a result that is reproducible (the code and

    data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
  22. The goal: a result that is reproducible (the code and

    data can be used to recreate the results) and replicable (you can perform the experiment again and get the same answer)
  23. Who  Reproduces  Research?   The  truth  is  A   I

     don’t   care   The  truth  is  B   The  truth  is  not  A   Original  InvesRgator   Reproducers   The  truth  is  A   ScienRsts   General   Public   ???   Slide courtesy R. Peng
  24. hVps://github.com/jtleek/datasharing  

  25. 2nd Discussion Point: Statistical modeling is only part of the

    process
  26. What  is  Data  Analysis?   Raw  Data   Cleaning  /

      ValidaRon   Pre-­‐processing   Exploratory   data  analysis   StaRsRcal  model   development   SensiRvity   analysis   Finalize   results  /  report   StaRsRcs!   Slide courtesy R. Peng
  27. 3rd Discussion Point: Analysis is (often) an afterthought

  28. hVp://bit.ly/OgW3xv  

  29. None
  30. 4th Discussion Point: Traditional statistics & epidemiology ideas still matter

    for big data
  31. association between shoe size and literacy

  32. None
  33. None
  34. None
  35. 1. Reproducibility by data sharing 2. Big data is not

    just statistics   3. Analysis is often an afterthought   4. Traditional ideas still matter  
  36. jhudatascience.org

  37. 9 classes 1 month long Every month

  38. Cumulative Enrollment

  39. jtleek.com/talks