Taiwan University Architecture overview 2014.06 Slides by Liang Bo Wang Technical detail • VM for each user • or Hadoop cluster … don’t care • Communicate by defined API Web Frontend • Report generator (our part) • and user/analysis management (FXN)
Taiwan University Objectives for report generator • From view of either NGS service provider or web developer, this report generator should – Generate a static/local/portable analysis report for service user – View a summary report on web after submitted job finishes • Therefore our generator first takes local file input and produces local report • Host the report on web (basically) 2014.06 Slides by Liang Bo Wang
Taiwan University Manual for report generator • A manual for result interpretation • Use Sphinx for manual generation – Take plain text (reStructured Text, rst) into html pages – Easier than word to maintain • How/who/when to fill all the contents? 2014.06 Slides by Liang Bo Wang Link to detailed manual page
Taiwan University Development progress • Two pipelines result page status – Tuxedo: remain Cufflink, Cuffdiff – VarScan: almost done, changing lib to jsGrid • STAR and GATK are still in progress • Rewriting the generator to reuse same result subpage, such as FastQC, Tophat or BWA • Writing the parser for real result data (generated last week) 2014.06 Slides by Liang Bo Wang
Taiwan University Project progress • Co-IP contract modification – Received review (draft) from NTU consultant – Expect to get advice on contract modification today • Midterm report (due Jun. 27) – Received template from Dr. Dai – Cover NGS pipelines in use – Reuse the content back to the manual for result page – Most people here expected to be involved 2014.06 Slides by Liang Bo Wang
Taiwan University • DNA-Seq pipeline documentation and script request from FXN – Granted? • 1st poster on APCMBE (亞太醫工年會) (due Jun. 22) – Subject on NGS data reading and QC processing – Python package Nextbiopy – With a example use case – ARI co-author? • Done survey about further poster submissions 2014.06 Slides by Liang Bo Wang
Taiwan University Submission: GIW / ISCB Asia 2014 • Dec 15-17, Tokyo (http://www.jsbi.org/giw2014) – ISCB = International Society for Computational Biology – GIW = Genome Informatics • Proceedings acceptance such as Bioinformatics, BMC Genomics, JBCB and so on • Deadline – Jul 7 paper/oral – Aug 25 poster submission 2014.06 Slides by Liang Bo Wang
Taiwan University Submission: AYRCOB • Jan 19-20, 2015, Hsinchu (http://2015.ayrcob.org/) – AYRCOB = Asian Young Researchers Conference on Computational and Omics Biology • Jul 31 submission deadline (not sure poster or paper) • Not sure about the date for acceptance announcement • Too late? 2014.06 Slides by Liang Bo Wang
Taiwan University Future Plan • Make sure midterm report meet the deadline – Fill the content by collaboration • Continue on report generator / result parser development • Abstract for APCMBE poster • Initiate the structure for report manual 2014.06 Slides by Liang Bo Wang
Taiwan University Goal • Identify a diagnosis that human or current technology can do well but difficult to scale – Ex. Pathology analysis on biopsy – Ex. Some somatic mutation confirmed to develop cancer based on SNP microarray • Boost the prediction rate or speed up the prediction process by – Distributed computation – Multiple sources of data to do multiple instance learning 2014.06 Slides by Liang Bo Wang
Taiwan University Multiple Instance Learning • Some algorithm that Microsoft has been in leading position for years, used on video pattern recognition and linguistic analysis • Require data of same observation from multiple sources – Multiple sources of data (SNP, CNV, RNA-Seq, Chip-Seq data) – Large data size for model training (this will be a complex model anyway) • Asking if lab has such data sets (>100 samples) – Replied: small sample size in NGS data but not sure about microarray data – Better if accompanied with clinical data (concern about privacy issue) 2014.06 Slides by Liang Bo Wang
Taiwan University Workaround for large datasets • Search for large public dataset, but finding multiple source in public is hard • Take a look on projects like TCGA – Data policy has changed – For level I/II, require application for data access • Anyway, after some survey on such datasets, a summary about cancer genomic project 2014.06 Slides by Liang Bo Wang
Taiwan University 2014.06 Slides by Liang Bo Wang The following content is directly extracted from slides made by bioinformatics.ca, which are shared under CC 2.5 BY license
Dec-‐11 Jan-‐2012 Feb March April June July Aug Sept Oct May Nov Dec Jan-‐2013 Feb March April May June July Aug Sept-‐2013 1000 2000 3000 4000 5000 6000 7000 8000 9000 10,000 Release 7 Release 8 Release 9 Release 10 Release 11 Release 12 Release 13 Release 14 Number of Donors ICGC Data Portal Cumula.ve Donor Count for Member Projects Hardeep Nahal
Simple somatic mutations: 1,995,134 • Copy number mutations: 18,526,593 • Structural rearrangements: 18,614 • Genes affected* by simple somatic mutations: 22,074 • Genes affected* by non-synonymous coding mutations: 19,150 Genes affected* by copy number mutations: 20,341 • Genes affected* by structural rearrangements: 1,884 • *out 22,259 protein coding genes annotated in Ensembl Human release 69 • Open tier and controlled data currently available ICGC dataset version 14
Taiwan University 2014.06 Slides by Liang Bo Wang Summary Mutated Genes Mutations Donors Publications Page Filters ! Mutation Impact Summary Code LUSC-KR Name " Lung Cancer - KR Primary Site Lung Tumour Type Lung cancer Tumour Subtype Squamous cell carcinoma Countries South Korea Total number of donors 111 Experimental Analyses WXS 111 samples from 111 donors # Download Sample Sheet Raw data is available at " European Genome-phenome Archive An approved " data access request is required. Available Data Types Clinical Data 111 donors Simple Somatic Mutations (SSM) 111 donors Copy Number Somatic Mutations (CNSM) -- Structural Somatic Mutations (StSM) -- Simple Germline Variants (SGV) -- Array-based DNA Methylation (METH-A) -- Sequence-based DNA Methylation (METH-S) -- Array-based Gene Expression (EXP-A) -- Sequence-based Gene Expression (EXP-S) -- Protein Expression (PEXP) -- Sequence-based miRNA Expression (miRNA) -- Exon junction (JCN) -- OPEN IN $ Data Repository OPEN IN ADVANCED SEARCH | GENOME VIEWER Most Frequently Mutated Genes ( Login P LUNG CANCER - KR ICGC Data Portal ) * + $Quick Search + , High , Low , Unknown ed 75
Taiwan University 2014.06 Slides by Liang Bo Wang OPEN IN ADVANCED SEARCH | GENOME VIEWER % Most Frequent Mutations Showing 10 of 60,288 mutations ID DNA change Type Consequences # Donors affected in LUSC-KR ! Across all Project MU5219 chr3:g.178936091G>A single base substitution Missense: PIK3CA E545K Upstream: PIK3CA 144 / 6,590 (2.19%) & MU24637 chr17:g.7577120C>T single base substitution Missense: TP53 R141H, R273H NC Exon: TP53 Upstream: TP53 Downstream: TP53 Intron: TP53 72 / 6,590 (1.09%) & MU5286 chr17:g.7577121G>A single base substitution Missense: TP53 R273C, R141C NC Exon: TP53 65 / 6,590 (0.99%) & Donors affected M U5219 M U24637 M U5286 M U55099 M U64353 M U69856 M U17943 M U67642 M U66992 M U64201 0 100 200 144 / 111 (129.73%) 72 / 111 (64.86%) 65 / 111 (58.56%)
Taiwan University Misc. • VM image Debian Jessie is added • RDP connection (Windows Remote Desktop) now possible • FreeNX -> X2go – FreeNX is outdated – X2go based on NoMachine NX3 protocol (2 concurrent connection limit?) – Some connection latency and failure encountered – Still resolving problems 2014.06 Slides by Liang Bo Wang
Taiwan University Misc. (cont’d) • Paper last week recap – Based on different event outcome, gene features can be more useful • CCRT miRNA reanalysis – Find differential expressed miRNA in different conditions – Still discussing methods 2014.06 Slides by Liang Bo Wang