Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Phoenix Data Conference 2014 - Ken Buetow

Phoenix Data Conference 2014 - Ken Buetow

teamclairvoyant

October 25, 2014
Tweet

More Decks by teamclairvoyant

Other Decks in Technology

Transcript

  1. COMPLEX ADAPTIVE SYSTEMS Using  a  1st  genera,on  Data  Science  

      research  pla4orm  –  the     Next  Genera*on  Cyber  Capability     to  transform  Big  Data  into  biomedical   insight.   Ken  Buetow,  Ph.D   Director,  Computa,on  Science  and  Informa,cs,   Complex  Adap,ve  Systems  @  ASU   Professor,  School  of  Life  Science   Kenneth.Buetow@ASU  .edu     1  
  2. Whole  Genome  Analysis   Assembly   (BAM)   Variant  

      Iden,fica,on   (VCF)   Primary  Data   (fastq)   Func,onal   Annota,on   440GB/sample*   200GB/sample   *  Assuming  60X  coverage   BWA,   Bow$e   24  cores,   72  hours   GATK,   Bambino   16  cores,   24  hours   ANNOVAR   “mapping”   “variant          calling”   100MB/sample   8  cores,   15  minutes   14  
  3. Cancer…                  

                                      …it’s  complicated  
  4. base state alteration selection alteration selection alteration selection alteration selection

    malignant state Cancer  is  a  Complex  Adap,ve  System  
  5. base state malignant state mutation copy number gene expression micro

    RNA epigenetic glycolosis Cancer  is  a  Complex  Adap,ve  System  
  6. evade apoptosis telomere maintenance cell cycle regulation angiogenesis growth factor

    independence Cancer  is  a  Complex  Adap,ve  System   base state malignant state
  7. base state malignant state Cancer  is  a  Complex  Adap,ve  System

      evade apoptosis telomere maintenance cell cycle regulation angiogenesis growth factor independence
  8. base state malignant state Cancer  is  a  Complex  Adap,ve  System

      evade apoptosis telomere maintenance cell cycle regulation angiogenesis growth factor independence
  9. Cancer  is  a  Complex  Adap,ve  System   base state malignant

    state genetic constitution •  angiogenesis •  cellular matrix •  immune response •  chemical •  virus •  hormone •  nutrition
  10. OK…                 …it’s

     VERY  complicated  
  11. “Big                  

       Data”   volume   variety   velocity   Laney:    Gartner  2001,  2012   NSF/NIH  2012  
  12. genome   “big              

     data”   phenome   exposome  
  13. Phenome  Data   •  Diverse  types   –  Clinical  Observa,on

      –  Clinical  Laboratory   –  Imaging   –  Registry   –  Biospecimens   –  Reference     •  Distributed  sources   –  Research  Center   –  Care  Delivery  Segng   •  Hospital   •  Prac,ce   •  Laboratory   –  Registry   –  Industry   –  Consumer    
  14. Source:    www.ihs.com  World  Market  for  Telehealth  –  2014  Edi,on

      shipments  of  telehealth  devices  grow  to  about  7  million  by  2018   cardiovascular   diabetes   fitness   Real  ,me  consumer  data:  the  next  Big  Data  Challenge  
  15. 4th  Paradigm  Science   •  A  new  method  of  pushing

      forward  the  fron,ers  of   knowledge,  enabled  by  new   technologies  for  gathering,   manipula,ng,  analyzing  and   displaying  data.     •  Complemen,ng     data-­‐genera*ng  science  with   data-­‐driven  science   •  Ecumenical     –  Astronomy   –  Physics   –  Economics   –  Climate   –  Genomics   •  Transdisciplinary   48  
  16. 49   Data  Scien*st:  The  Sexiest  Job   of  the

     21st  Century   October  2012   by  Thomas  H.  Davenport  and  D.J.  Pa,l  
  17. (July  11,  1838  –  December  12,  1922)  was  a  United

     States  merchant,  religious   leader,  civic  and  poli,cal  figure,  considered  by  some  to  be  the  father  of  modern   adver,sing  and  a  "pioneer  in  marke,ng.“    (Wikipedia  2014)  
  18. ‘The best minds of my generation are thinking about how

    to make people click ads… …that sucks.’ Jeff Hammerbacher In  2006  (at  23),    as  one  of  Facebook’s  1st  100  employees  his   assignment  was  uncovering  why  Facebook  took  off  at  some   universi,es  and  flopped  at  others,  (Bloomberg  Business  Week  2011).   Currently  Assistant  Professor,  Gene,cs  and  Genomic  Sciences,  Mount   Sinai  Hospital,  New  York,  New  York     Photo  New  York  Times,  Big  Data  June  19,  2013  
  19. Crea,ng  a  new  Data  Science   “Instrument”:   A  Next

     Genera,on  Cyber  Capability   (NGCC)   54  
  20. A Biomedical Informatics e-Ecosystem big data resource big data resource

    Metadata Resources Services Registry ultra-high speed connectivity datamart datamart datamart datamart app app app app app app app app app app app app app app app app app app app high speed connectivity high speed connectivity high speed connectivity high speed connectivity app app app app app app app big data analytics big data anaytics Security Services
  21. The  ASU  NGCC  Data  Science  “Instrument”-­‐   an  elemental  whole

     composed  of:   •  Physical  Capacity   –  Ultra-­‐high  bandwidth   Networks   –  Large-­‐scale  storage   –  Mulitple  “flavors”  of   computa,on   •  Logical  Capabili,es   –  Sorware   –  Metadata   –  Seman,cs   •  Human  Resources   –  Transdisciplinary  Teams   57  
  22. NGCC  Physical  Infrastructure   Data  Reservoir   Scratch  Space  

    Big  Data   Transac*onal   HPC  SMA   HPC  parallel   High  Speed  Connec*vity  
  23. Capacity     Context   •  Ontologies   •  Data

        Elements/   Informa,on   Models   •  Middleware   Transact   •  Clinical   Research   •  Life   Science   Research   •  Qualita,ve   Research     Data   Resources   •  File  System   •  Rela,onal   •  Key/Value     Analy*c   •  General   Purpose   •  Genomic   •  Big  Data   TransCORE  Framework  Knowledge  Engine   Data  Reservoir   Scratch  Space   Big  Data   Transac*onal   HPC  SMA   HPC  parallel   High  Speed  Connec*vity  
  24. Capacity     Context   •  Ontologies   •  Data

        Elements/   Informa,on   Models   •  Middleware   Transact   •  Clinical   Research   •  Life   Science   Research   •  Qualita,ve   Research     Data   Resources   •  File  System   •  Rela,onal   •  Key/Value     Analy*c   •  General   Purpose   •  Genomic   •  Big  Data   NGCC  Data  Science  “Instrument”   Data  Reservoir   Scratch  Space   Big  Data   Transac*onal   HPC  SMA   HPC  parallel   High  Speed  Connec*vity   Content     •  CRF   •  Instrument   •  EHR   •  Document   •  Filing   •  Prac,ce   Guidelines   •  Physician   Experience   •  Web  site   •  Wiki   •  Media   •  Social   media       61  
  25. logical   physical   staff   Standing   Capabili,es  

    On-­‐demand  Capabili*es   Next  Genera*on  Cyber  Infrastructure  Elas*c  Capabili*es  
  26. NGCC  Business  Architecture   “Collaboratory”   •  Par,cipants   – Academia

      – Government   – Industry   •  Contribu,ons   – Resources   – Capabili,es   63  
  27. 66   Cancer   Liver   Disease   Cardio-­‐vascular  

    disease   Type  II   Diabetes   Obesity  
  28. BMI  associated  with   higher  risk  of:   •  Uterus

     (1.62)   •  Gall  Bladder  (1.31)   •  Kidney  (1.25)   •  Liver  (1.19)   •  Colon  (1.10)   •  (six  addi,onal)  
  29. HCC  Incidence  Trends  in  the  U.S.  1973-­‐2007   •  Worldwide:

     6th    most   common  cancer  and  3rd   most  common  cause  of   cancer  mortality     •  Hepatocellular   carcinoma  (HCC)     is  the  most  common   histologic  type  of  liver   cancer   0   2   4   6   8   10   12   1973-­‐77   1983-­‐87   1993-­‐97   2003-­‐07   Number/100,000   HCC  SEER  9  1973-­‐2007   All  races   White   Black   AIAN,  API   (McGlynn  et  al.   2010)  
  30. U.S.  Risk  of  HCC   Odds   ra*o    

         1.6%          0.5%          5.0%                       U.S.   prevalence   Adributable   risk          21%            6%          24%                                44.2          13.4              4.4                                 HCV   HBV   Alcohol                                                  8.0%      66.6%                                                 35%                                                          2.4              1.5             Diabetes   Obesity           (McGlynn  et  al.  2010)  
  31. Development  of  HCC   Healthy liver Acute hepatitis Liver cirrhosis

    HCC Liver fibrosis •     HBV   •     HCV   •     BMI   Risk  factors:   •     Diabetes   •     Alcohol   • AFB1  
  32. “mapping”  phenotype  to  underlying   process   •  Transmission  maps

      –  Family  studies   –  Associa,on   •  Gene-­‐based  signatures   –  Altered  gene  expression  associated  with  phenotype   •  Network  Interac,on  signatures   –  Altered  interac,ons  associated  with  phenotype   •  Ac,ve/inac,ve   •  Consistent/inconsistent  
  33. “mapping”  phenotype  to  underlying   process   •  Transmission  maps

      –  Family  studies   –  Associa,on   •  Gene-­‐based  signatures   –  Altered  gene  expression  associated  with  phenotype   •  Network  Interac*on  signatures   –  Altered  interac*ons  associated  with  phenotype   •  Ac*ve/inac*ve   •  Consistent/inconsistent  
  34. HCC  gene,c  analysis   •  Study  Popula*on  (Clifford  et  al.,

     Hepatology  2010)   –  386  Korean  HBV/HCV  posi*ve  HCC  cases   –  100  Korean  HBV/HCV  associated  cirrhosis  cases   –  587  Korean  controls   –  100  Chinese  HBV  posi*ve  controls   •  Affymetric  6.0  plagorm   –  More  than  906,600  SNPs:   –  More  than  946,000  copy  number  probes:   •  202,000  probes  targe,ng  5,677  CNV  regions  from  the  Toronto  Database  of   Genomic  Variants   •  744,000  probes,  evenly  spaced  along  the  genome  
  35. Biological  Pathways  Associated  with  HCC/Cirrhosis       Pathway  

    P-­‐value   Significant  genes  in  the  pathways   An,gen  processing  and  presenta,on   (Kegg)   1x10-­‐11   HLA-­‐A,  HLA-­‐B,  HLA-­‐C,  HLA-­‐DOA,  HLA-­‐DOB,  HLA-­‐ DQA1,  HLA-­‐DQA2,  HLA-­‐DQB1,  HLA-­‐DRA,  HLA-­‐ DRB1,  HLA-­‐DRB5,  HLA-­‐E,  HLA-­‐F,  HLA-­‐G,  LTA,   TAP1,  TAP2   Cell  adhesion  molecules  (CAMs)  (Kegg)   4x10-­‐10   CD58,  HLA-­‐A,  HLA-­‐B,  HLA-­‐C,  HLA-­‐DOA,  HLA-­‐DOB,   HLA-­‐DQA1,  HLA-­‐DQA2,  HLA-­‐DQB1,  HLA-­‐DRA,   HLA-­‐DRB1,  HLA-­‐DRB5,  HLA-­‐E,  HLA-­‐F,  HLA-­‐G,   NRCAM,  NRXN1,  NRXN2   An,gen  processing  and  presenta,on   (Biocarta)   6x10-­‐6   HLA-­‐A,  HLA-­‐DRA,  HLA-­‐DRB1,  TAP1,  TAP2   Classical  complement  pathway  (Biocarta)   2x10-­‐5   C1QB,  C2,  C4A,  C8A,  C8B   Lec,n  induced  complement  pathway   (Biocarta)   2x10-­‐4   C2,  C4A,  C8A,  C8B  
  36. Adacking  Diabesity:     Published  Gene,c  Study  Data   • 

    Type  2  Diabetes   – 54  data  sets   •  Obesity   – 13  data  sets   •  Liver  Disease   – 12  data  sets   78   TOTAL:  79  
  37. “Available”  Obesity-­‐related    Data   •  eMERGE  Genome-­‐Wide  Associa,on  Studies

     of  Obesity:     •  Popula,on  Architecture  Using  Genomics  and  Epidemiology   (PAGE):  Mul,ethnic  Cohort:     •  The  Thriry  Microbiome:  The  Role  of  the  Gut  Microbiota  in   Obesity  in  the  Amish:     •  Whole  Genome  Associa,on  Study  of  Visceral  Adiposity  in  the   Health,  Aging  and  Body  Composi,on  (Health  ABC)  Study:     •  Northwestern  Nugene  Project:  Type  2  Diabetes:     •  Starr  County  Health  Studies’  Gene,cs  of  Diabetes  Study:     •  A  Whole  Genome  Associa,on  Search  for  Type  2  Diabetes   Genes  in  African  Americans:     •  T2D-­‐GENES  Project  2:  San  Antonio  Mexican  American  Family   Studies:     •  GENEVA  Genes  and  Environmental  Ini,a,ves  in  Type  2   Diabetes  (Nurses’  Health  Study/  Health  Professionals  Follow-­‐ Up  Study:   Number  of     Par,cipants   21086   982     486     2802     3563     1980     2004     2802     6033   TOTAL:  9  
  38. Summary —  Personalized Medicine combines new abilities to characterize the

    molecular state of individual and disease with diverse clinical and lifestyle information —  Personalized Medicine is a Big Data Problem —  Data Science is a 4th Paradigm pursuit that extracts insight from Big Data —  New Computational Research Platforms will enable Data Science —  ASU is constructing a first generation Data Science research platform – the NGCC