Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work Log 12/21

Liang Bo Wang
December 20, 2012
100

Work Log 12/21

Liang Bo Wang

December 20, 2012
Tweet

Transcript

  1. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine O

    O P + P a r a l l e l C o m p u t i n g , C l i p p i n g t o o l , D a t a s e t S t a t i s t i c s  Work Progress 12/21
  2. Topics – Parallel Tool •  Parallel tool now can …

    •  auto distribute new tasks when some CPUs rest •  if output error -> report •  auto dump output to both csv, html files •  no more “$ xxxxx > yyyy.log” •  represent all task as a table •  OOP hierarchy •  overwrite and custom output format if one wants •  support multi-type tasks in initial order •  one can “extract A, run something on A, compress output of A, then delete A”, then run B in same way … •  combined many parallel task easily •  format-conversion -> clip -> blast … Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 3
  3. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 6

    max_process = 1 name output_filename myparallel run handle all parallel stuff dump2csv save results directly setupTaskPool write2html parse2csv functions that user can overwrite max_process = 8 name = 'SRA to FASTA' output_filename FileConversion run dump2csv setupTaskPool add *.sra into list call fast-dump write2html parse2csv here many contains info ilke total seq. ... max_process = 4 name = 'blast all candidates' output_filename BLAST run dump2csv setupTaskPool add all candidates sequences call blastn (with many parameters) write2html parse2csv save counts gouped by different tissue types
  4. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 7

    1st type of task echo a word 2nd type of task open the result
  5. Result (html)  Bioinformatics and Biostatistics Core, NTU Center of

    Genomic Medicine 9 1st task 2nd task HTML supports multi-line better
  6. Share Experience •  Python •  language used in this script

    •  Git – Version control •  After final exam … •  More on Git Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 11
  7. Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 0.E+00

    5.E+08 1.E+09 2.E+09 2.E+09 3.E+09 3.E+09 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 SRR372617 SRR372618 SRR372619 SRR372620 SRR372621 SRR372622 SRR372623 SRR372624 SRR372625 SRR372626 SRR372627 SRR372628 SRR372629 SRR372630 SRR372631 SRR372632 SRR372633 SRR372634 SRR372635 SRR372636 SRR372637 SRR372638 SRR372639 SRR372640 SRR372641 SRR372642 SRR372643 SRR372644 SRR372645 SRR372646 SRR372647 SRR372648 SRR372649 SRR372650 SRR372651 SRR372652 SRR372653 SRR372654 SRR372655 SRR372656 SRR372657 SRR372658 SRR372659 SRR372660 SRR372661 SRR372662 SRR372663 SRR372664 SRR372665 SRR372666 SRR372667 SRR372668 SRR372669 SRR372670 SRR372671 SRR372672 read count (UNKOWN UNIT) N reads non-clipped adapter-only too-short output 0.E+00 5.E+06 1.E+07 2.E+07 2.E+07 3.E+07 3.E+07 4.E+07 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 read count (UNKOWN UNIT) N reads non-clipped adapter-only too-short output •  Size of different datasets varies in 2 orders •  Different Sequencing machine used: •  SRRXXXX [11-16]: •  HWI-EAS438_42AHVAAXX •  SRRXXXX [17-72]: •  ILLUMINA-053F9F 15
  8. Further into these statistics •  First I thought that the

    tool logged the reads in terms of their length, •  After I computed the ratio of input(output) to number of sequence, •  All ratios are exactly 53. •  Still don’t know why, since they run the same argument •  Probably due to the different platform they used Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 16
  9. Read count mismatch among different tools •  It’s been reported

    that ‘FASTX-Toolkit’ may output wrong read count •  This is fine since it still produces correct clipped sequences. •  While some even questioned the accuracy of clipping adapters •  Test not finished this week. (most on previous work) •  will be verified by other tools •  Alternatives •  CASAVA – Flickers •  FastqMcf •  cutadapt •  NGS QC Toolkit Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 17