Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work Log - 11/30

Liang Bo Wang
November 30, 2012
180

Work Log - 11/30

Liang Bo Wang

November 30, 2012
Tweet

Transcript

  1. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 1

    Ve r i f i c a t i o n w o r k f l o w, P a r a l l e l c o m p u t i n g , Ve r i f i c a t i o n r e s u l t , A u t o m a t i o n  Work Progress – 11/30
  2. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 2

    •  Not collapsing for disk usage and BLAST function •  Summarize results by myself •  Quality control will be added in near future •  NGS QC Toolkit (建樂學長) may be used •  Automation & Parallelization NCBI SRA-Toolkit dump to fasta/q format Original datasets on GEO SRRxxx.sra, SRRyyy.sra, ... Fastq format with QC and sequencing details SRRxxx.fastq, SRRyyy.fastq, ... FastX Toolkit clip off 3' adapter Only clipped sequences left SRRxxx_clipped.fastq, SRRyyy_clipped.fastq, ... Quality Control discard low score reads Fasta Converter file format conversion Simpler file format: fasta SRRxxx_clipped.fasta, SRRyyy_clipped.fasta, ... BLAST+ make blast database Original datasets on GEO SRRxxx.sra, SRRyyy.sra, ... BLAST+ blastn query for every candidates on every dataset novel miRNA candidates candidate01.fa, candidate02.fa, ... Handmade Script summary all queries BLAST detail results for every query candidate01_xxx.csv, candidate02_xxx.csv, … candidate01_yyy.csv, candidate02_yyy.csv, … …, …, … Summarized read count for all candidates candidate01-10_xxx-zzz.csv Excel table output Script Automation
  3. Parallel Computing Bioinformatics and Biostatistics Core, NTU Center of Genomic

    Medicine 3 CPU 1 CPU 2 Task A Task B CPU 3 Task C … … run run run run run run run run run run CPU 1 CPU 2 Task A 01.file Task A 02.file CPU 3 Task A 03.file … … run run run run run run run run run run
  4. Tasks with heavy I/O may not be benefited from parallel

    computing Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 4 CPU 2 Task A 02.file CPU 3 Task A 03.file … … run run run run run run run run run I/O system call run run run I/O I/O run I/O run I/O run I/O CPU 2 Task A 02.file CPU 3 Task A 03.file … … run run run run run run run run run I/O system call run run run I/O I/O run I/O run I/O run I/O CPU Bound Task I/O Bound Task
  5. Parallel Computing + Automation •  NGS computing takes extra long

    time •  use script to perform tasks •  Python script is used here •  Ex. 60 (total 1800) tasks are pooled for blasting •  General input •  task performed can be changed easily •  code usability Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 8
  6. in silico verification  •  GSE33858 lung cancer small RNA-seq

    •  platform: Illumina Genome Analyzer IIx •  tissue type: NT + T for same patient •  62 samples in total •  about 1,500,000,000 reads(after clipped) per sample •  originally 2,000,000,000 reads per sample •  With new code being used, whole workflow can complete within 8 hours •  about 4x faster ! •  auto Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 9
  7. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 10

    ID score normal tumor total total in previous verification original chr7_8791 32.8 104 49 153 5 69 chr11_13342 30.6 76 164 240 9 49 chr22_20736 17.5 240 152 392 4 32 chr13_14817 15.7 4 4 8 9 18 chr17_17828 14 2 0 2 10 26 chr20_19494 13.9 32 29 61 2 22 chr20_19450 12.6 40 17 57 1 20 chr18_18769 12.5 0 0 0 9 21 chr11_13709 12 1 0 1 0 15 chr2_2356 9.8 412 266 678 4 6 chr4_5692 9 95 75 170 2 11 chr11_12760 8.6 6 31 37 0 14 chr3_3910 7.5 0 0 0 131 19 chr1_692 6.8 0 4 4 1 9 chr10_12452 6.4 10 17 27 1 6 chr6_7548 6 58 206 264 2 24 chr1_944 6 1 2 3 1 3 chr6_8151 6 58 206 264 2 18 chr11_13239 5.8 150 152 302 0 1 chr7_8991 5.6 39 37 76 1 14 chr7_8849 5.4 17 25 42 4 26 chr22_20809 5.4 2115 1846 3961 2392 13 chr2_3487 5.4 1 1 2 170 19 chr7_8673 5.3 0 3 3 0 15 chr14_15459 5.3 180 206 386 16 11 chr20_19463 5.2 10 21 31 0 11 chr6_8250 5.2 0 5 5 211 199 chr19_19992 5.1 7 3519 3526 4531 15 chr1_689 5.1 1 0 1 1 17 chr17_17785 5 0 22 22 2326 483 lung breast breast •  expression should be normalized in RKPM •  low expression •  threshold for blast was set too high •  E Value < 0.001 •  previous verification will be performed again to follow same criteria •  find novel candidates from other datasets?
  8. To do •  Automation for miRDeep •  continue verification • 

    study Galaxy extension •  Version control – GIT  Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 11