Slide 1

Slide 1 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Work Log 06/12 Speaker: Liang Bo Wang 2014.06 Slides by Liang Bo Wang

Slide 2

Slide 2 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University BioCloud Architecture overview Development/project progress Paper/poster possible submission 2014.06 Slides by Liang Bo Wang

Slide 3

Slide 3 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Architecture overview 2014.06 Slides by Liang Bo Wang Technical detail •  VM for each user •  or Hadoop cluster … don’t care •  Communicate by defined API Web Frontend •  Report generator (our part) •  and user/analysis management (FXN)

Slide 4

Slide 4 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Workflow (view by functions) 2014.06 Slides by Liang Bo Wang Explicitly, we are working on this

Slide 5

Slide 5 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Objectives for report generator •  From view of either NGS service provider or web developer, this report generator should –  Generate a static/local/portable analysis report for service user –  View a summary report on web after submitted job finishes •  Therefore our generator first takes local file input and produces local report •  Host the report on web (basically) 2014.06 Slides by Liang Bo Wang

Slide 6

Slide 6 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Manual for report generator •  A manual for result interpretation •  Use Sphinx for manual generation –  Take plain text (reStructured Text, rst) into html pages –  Easier than word to maintain •  How/who/when to fill all the contents? 2014.06 Slides by Liang Bo Wang Link to detailed manual page

Slide 7

Slide 7 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University How Sphinx works 2014.06 Slides by Liang Bo Wang by Sphinx and docutils from RST files

Slide 8

Slide 8 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Development progress •  Two pipelines result page status –  Tuxedo: remain Cufflink, Cuffdiff –  VarScan: almost done, changing lib to jsGrid •  STAR and GATK are still in progress •  Rewriting the generator to reuse same result subpage, such as FastQC, Tophat or BWA •  Writing the parser for real result data (generated last week) 2014.06 Slides by Liang Bo Wang

Slide 9

Slide 9 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Project progress •  Co-IP contract modification –  Received review (draft) from NTU consultant –  Expect to get advice on contract modification today •  Midterm report (due Jun. 27) –  Received template from Dr. Dai –  Cover NGS pipelines in use –  Reuse the content back to the manual for result page –  Most people here expected to be involved 2014.06 Slides by Liang Bo Wang

Slide 10

Slide 10 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University •  DNA-Seq pipeline documentation and script request from FXN –  Granted? •  1st poster on APCMBE (亞太醫工年會) (due Jun. 22) –  Subject on NGS data reading and QC processing –  Python package Nextbiopy –  With a example use case –  ARI co-author? •  Done survey about further poster submissions 2014.06 Slides by Liang Bo Wang

Slide 11

Slide 11 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Submission: GIW / ISCB Asia 2014 •  Dec 15-17, Tokyo (http://www.jsbi.org/giw2014) –  ISCB = International Society for Computational Biology –  GIW = Genome Informatics •  Proceedings acceptance such as Bioinformatics, BMC Genomics, JBCB and so on •  Deadline –  Jul 7 paper/oral –  Aug 25 poster submission 2014.06 Slides by Liang Bo Wang

Slide 12

Slide 12 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang

Slide 13

Slide 13 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Submission: AYRCOB •  Jan 19-20, 2015, Hsinchu (http://2015.ayrcob.org/) –  AYRCOB = Asian Young Researchers Conference on Computational and Omics Biology •  Jul 31 submission deadline (not sure poster or paper) •  Not sure about the date for acceptance announcement •  Too late? 2014.06 Slides by Liang Bo Wang

Slide 14

Slide 14 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Future Plan •  Make sure midterm report meet the deadline –  Fill the content by collaboration •  Continue on report generator / result parser development •  Abstract for APCMBE poster •  Initiate the structure for report manual 2014.06 Slides by Liang Bo Wang

Slide 15

Slide 15 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University MRSA Research Possible goal Available large datasets in lab ICGC related cancer project intro 2014.06 Slides by Liang Bo Wang

Slide 16

Slide 16 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Goal •  Identify a diagnosis that human or current technology can do well but difficult to scale –  Ex. Pathology analysis on biopsy –  Ex. Some somatic mutation confirmed to develop cancer based on SNP microarray •  Boost the prediction rate or speed up the prediction process by –  Distributed computation –  Multiple sources of data to do multiple instance learning 2014.06 Slides by Liang Bo Wang

Slide 17

Slide 17 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Multiple Instance Learning •  Some algorithm that Microsoft has been in leading position for years, used on video pattern recognition and linguistic analysis •  Require data of same observation from multiple sources –  Multiple sources of data (SNP, CNV, RNA-Seq, Chip-Seq data) –  Large data size for model training (this will be a complex model anyway) •  Asking if lab has such data sets (>100 samples) –  Replied: small sample size in NGS data but not sure about microarray data –  Better if accompanied with clinical data (concern about privacy issue) 2014.06 Slides by Liang Bo Wang

Slide 18

Slide 18 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Workaround for large datasets •  Search for large public dataset, but finding multiple source in public is hard •  Take a look on projects like TCGA –  Data policy has changed –  For level I/II, require application for data access •  Anyway, after some survey on such datasets, a summary about cancer genomic project 2014.06 Slides by Liang Bo Wang

Slide 19

Slide 19 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang

Slide 20

Slide 20 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang

Slide 21

Slide 21 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang The following content is directly extracted from slides made by bioinformatics.ca, which are shared under CC 2.5 BY license

Slide 22

Slide 22 text

ICGC BAM/FASTQ TCGA BAM/FASTQ ICGC Open Data (includes

Slide 23

Slide 23 text

Module 1: Cancer Genomic Databases bioinformatics.ca ICGC Map – November 2013

Slide 24

Slide 24 text

Module 1: Cancer Genomic Databases bioinformatics.ca ICGC datasets to date Dec-­‐11   Jan-­‐2012   Feb   March   April   June   July   Aug   Sept   Oct   May   Nov   Dec   Jan-­‐2013   Feb   March   April   May   June   July   Aug   Sept-­‐2013   1000   2000   3000   4000   5000   6000   7000   8000   9000   10,000   Release 7 Release 8 Release 9 Release 10 Release 11 Release 12 Release 13 Release 14 Number     of     Donors   ICGC  Data  Portal  Cumula.ve  Donor  Count  for  Member  Projects   Hardeep Nahal

Slide 25

Slide 25 text

•  Cancer types: 41 •  Donors: 8,532 (18,056 specimens) •  Simple somatic mutations: 1,995,134 •  Copy number mutations: 18,526,593 •  Structural rearrangements: 18,614 •  Genes affected* by simple somatic mutations: 22,074 •  Genes affected* by non-synonymous coding mutations: 19,150 Genes affected* by copy number mutations: 20,341 •  Genes affected* by structural rearrangements: 1,884 •  *out 22,259 protein coding genes annotated in Ensembl Human release 69 •  Open tier and controlled data currently available ICGC dataset version 14

Slide 26

Slide 26 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University End of the extraction 2014.06 Slides by Liang Bo Wang

Slide 27

Slide 27 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang

Slide 28

Slide 28 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang

Slide 29

Slide 29 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang Summary Mutated Genes Mutations Donors Publications Page Filters ! Mutation Impact Summary Code LUSC-KR Name " Lung Cancer - KR Primary Site Lung Tumour Type Lung cancer Tumour Subtype Squamous cell carcinoma Countries South Korea Total number of donors 111 Experimental Analyses WXS 111 samples from 111 donors # Download Sample Sheet Raw data is available at " European Genome-phenome Archive An approved " data access request is required. Available Data Types Clinical Data 111 donors Simple Somatic Mutations (SSM) 111 donors Copy Number Somatic Mutations (CNSM) -- Structural Somatic Mutations (StSM) -- Simple Germline Variants (SGV) -- Array-based DNA Methylation (METH-A) -- Sequence-based DNA Methylation (METH-S) -- Array-based Gene Expression (EXP-A) -- Sequence-based Gene Expression (EXP-S) -- Protein Expression (PEXP) -- Sequence-based miRNA Expression (miRNA) -- Exon junction (JCN) -- OPEN IN $ Data Repository OPEN IN ADVANCED SEARCH | GENOME VIEWER Most Frequently Mutated Genes ( Login P LUNG CANCER - KR ICGC Data Portal ) * + $Quick Search + , High , Low , Unknown ed 75

Slide 30

Slide 30 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang OPEN IN ADVANCED SEARCH | GENOME VIEWER % Most Frequently Mutated Genes Showing 10 of 27,347 genes Symbol Name Location Type # Donors affected # Mutations in LUSC-KR ! Across all Projects TTN titin chr2:179390716- 179695529 protein_coding 2,122 / 6,590 (32.20%) & 189 TTN-AS1 TTN antisense RNA 1 chr2:179385910- 179639402 antisense 2,029 / 6,590 (30.79%) & 178 TP53 tumor protein p53 chr17:7565097- 7590856 protein_coding 2,020 / 6,590 (30.65%) & 59 SNHG14 small nucleolar RNA host gene 14 (non-protein coding) chr15:25223730- 25664609 processed_transcript 778 / 6,590 (11.81%) & 92 % of Donors Affected TTN TTN -AS1 TP53 SN H G 14 RYR2 USH 2A M UC16 ZFH X4 M T-CO 1 CSM D 3 0 25 50 75 61 / 111 (54.95%) 60 / 111 (54.05%) 57 / 111 (51.35%) 53 / 111 (47.75%)

Slide 31

Slide 31 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang OPEN IN ADVANCED SEARCH | GENOME VIEWER % Most Frequent Mutations Showing 10 of 60,288 mutations ID DNA change Type Consequences # Donors affected in LUSC-KR ! Across all Project MU5219 chr3:g.178936091G>A single base substitution Missense: PIK3CA E545K Upstream: PIK3CA 144 / 6,590 (2.19%) & MU24637 chr17:g.7577120C>T single base substitution Missense: TP53 R141H, R273H NC Exon: TP53 Upstream: TP53 Downstream: TP53 Intron: TP53 72 / 6,590 (1.09%) & MU5286 chr17:g.7577121G>A single base substitution Missense: TP53 R273C, R141C NC Exon: TP53 65 / 6,590 (0.99%) & Donors affected M U5219 M U24637 M U5286 M U55099 M U64353 M U69856 M U17943 M U67642 M U66992 M U64201 0 100 200 144 / 111 (129.73%) 72 / 111 (64.86%) 65 / 111 (58.56%)

Slide 32

Slide 32 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University 2014.06 Slides by Liang Bo Wang

Slide 33

Slide 33 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Misc. VM server tool update Paper reading recap last week 2014.06 Slides by Liang Bo Wang

Slide 34

Slide 34 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Misc. •  VM image Debian Jessie is added •  RDP connection (Windows Remote Desktop) now possible •  FreeNX -> X2go –  FreeNX is outdated –  X2go based on NoMachine NX3 protocol (2 concurrent connection limit?) –  Some connection latency and failure encountered –  Still resolving problems 2014.06 Slides by Liang Bo Wang

Slide 35

Slide 35 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Misc. (cont’d) •  Paper last week recap –  Based on different event outcome, gene features can be more useful •  CCRT miRNA reanalysis –  Find differential expressed miRNA in different conditions –  Still discussing methods 2014.06 Slides by Liang Bo Wang

Slide 36

Slide 36 text

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Thank You! Q&A Time 2014.06 Slides by Liang Bo Wang