Slide 1

Slide 1 text

NGS Cloud Platform Survey Lung Cancer miRNA Dataset More on CCRT Analysis Tutorial Plan Work Log 08/16 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 2

Slide 2 text

Available NGS Cloud Platforms •  Genome Space, Broad Inst. http://www.genomespace.org/ •  DNAnexus, Google https://www.dnanexus.com/ •  Galaxy, UCSC http://genome.ucsc.edu/ Keep doing survey … 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 3

Slide 3 text

Lung Cancer Dataset 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 4

Slide 4 text

Collect Public sRNA-Seq Dataset •  InSilicoDB –  curated datasets –  manage, upload one’s own samples –  edit samples clinical infor –  share –  public data also 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  https://insilicodb.org/

Slide 5

Slide 5 text

Collected Dataset •  Its search interface is not designed for massive search without specific keywords •  But it is good for manage one’s own data 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  ID Title Sample Size Platform GSE37764 A high dimensional deep sequencing study of non-small cell lung adenocarcinoma in never-smoker Korean females [Seq] 24 (6x2N2T) GAIIx

Slide 6

Slide 6 text

Collect Public sRNA-Seq Dataset •  GEOmetadb –  also available for NGS data –  filter result by various custom fields –  previous result can be re-used 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  GEOmetadb: GEO Microarray Search Tool What is GEOmetadb? GEOmetadb is an attempt to make access to the metadata associated with the NCBI Gene Expression Omnibus (GEO) samples, platforms, and datasets much more feasible for common biologists and bioinformatians/statistians. read more GEOmetadb Paper (Paper Link) Bioinformatics 2008 24(23):2798-2800; doi:10.1093/bioinformatics/btn520. What's new about GEOmetadb? GEOmetadb has been upgraded to version 2.0. New feature includes: Database tables and search interfaces have been modified significantly Search performance has been improved Several user-friendly functions have been added, e.g. drill-down search, download search results, ... read more GEOmetadb Web Interface: - GEO Microarray Online Search Tool GEOmetadb Distributions: - BioConductor Package/SQLite Database: Get Started: Joint Search | GSE Search Main features: Search by individual data types Search by GSE-GPL-GSM cross data types GEO entities are linked by relationships between them Multiple field query Query within results List creation Flexible display options Export or view details Read More If you want to find GEO microarray data of interest directly within R by using power of SQL? Please try combination use of GEOmetadb and GEOquery. BioConductor package: GEOmetadb (in BioC 2.2 with R2.7 ) SQLite3 database: GEOmetadb.sqlite.gz ( 176.9 MB, August 10 2013 15:16:31. ) - Matlab GEOtools: Download: MATLAB_GEOtools.zip (Mac OS X, Intel) Document: GEOmetadb_matlab.pdf (pdf) - FileMaker distribution: Download: GEOmetaDB.fp7.zip (32.5 MB, 08/01/2008) Readme: SQLite2FileMakerPro.Readme.txt Meltzerlab/GB/CCR/NCI/NIH @2008 Contact: Powered by BxAF Search Meltzerlab | GEO Site Home | GSE-GPL-GSM | GPL | GSE | GSM | GDS | GDS Subset | sMatrix | Help

Slide 7

Slide 7 text

GPL Platform to Query 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  GEO Accession Title Organism GSE Count GSM Count GPL9115 Illumina Genome Analyzer II Homo sapiens 375 3680 GPL10999 Illumina Genome Analyzer IIx Homo sapiens 258 3286 GPL16791 Illumina HiSeq 2500 Homo sapiens 3 11 GPL11154 Illumina HiSeq 2000 Homo sapiens 268 3390 GPL15433 Illumina HiSeq 1000 Homo sapiens 6 10 GPL15456 Illumina HiScanSQ Homo sapiens 4 50 GPL15520 Illumina MiSeq Homo sapiens 4 9 GPL10329 Illumina Genome Analyzer Homo sapiens; Mus musculus 1 2 GPL16061 Illumina Genome Analyzer IIx Homo sapiens; Mus musculus 2 10 GPL17232 Illumina Genome Analyzer Iix Homo sapiens 1 6

Slide 8

Slide 8 text

More on CCRT Analysis 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 9

Slide 9 text

We want •  Detailed numbers of –  how many reads are mapped to miRBase 20 ? –  how many reads are not mapper to miRBase but still mapped to genome reference (hg19) ? –  how many reads are unmapped ? •  We dropped temp files of the previous run –  require a re-run of analysis –  verify the result if remained same 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 10

Slide 10 text

2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 11

Slide 11 text

2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  New ID R1 R2 R3 R4 Sample Name 878T Eco-Ca-246 838T 818T miRBase 20 1,079,369 177,113 484,026 868,084 Genome 3,515,167 1,499,610 2,283,319 3,634,012 Unmapped 11,497,325 12,756,920 9,553,949 11,720,879 Total reads 15,012,492 14,256,530 11,837,268 15,354,891 New ID N1 N2 N3 N4 Sample Name 870T 884T Eco-Ca-373 65T miRBase 20 100,060 1,971,123 309,174 332,712 Genome 757,564 5,183,532 1,659,204 2,911,938 Unmapped 7,807,853 15,623,583 17,202,518 11,709,561 Total reads 8,565,417 20,807,115 18,861,722 14,621,499

Slide 12

Slide 12 text

Tutorial Plan 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 13

Slide 13 text

Tutorial Plan Covers •  Python 3 syntax •  Python Standard Library •  Useful Python packages: IPython, Pandas, Numpy If involved next project •  Markdown, RST documentation •  Version Control – Git 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 14

Slide 14 text

2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  $ Start NOW initialize environment ... done setup fundamental tools ... done initialize first mission ... [y/N]?

Slide 15

Slide 15 text

Server OS Upgrade 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 

Slide 16

Slide 16 text

It’s time for Cent OS 6 •  Software complicated dependency •  Seriously, the main reason to update OS is due to the grandpa gcc glibc version •  They do not provide some essential features 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  $ sudo upgrade OS

Slide 17

Slide 17 text

•  which means we need to use older version of most software if we stick to Cent 5.x •  It is possible to have newer version of these libraries, but the dependency tree will be tangled and hard to maintain, … and not easy to do so. 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  $ gcc --version gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54) $ ldd --version (GNU libc) 2.5 Copyright (C) 2006 Free Software Foundation, Inc.

Slide 18

Slide 18 text

Upgrading to Cent OS 6.4 •  Start with old and less used machine –  maybe 172.16.0.15x •  If possible, also upgrade 171 and 173 –  趁碩班學長姐畢業這時 2013.08 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang