Work Log 12/21

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine O
O P + P a r a l l e l C o m p u t i n g , C l i p p i n g t o o l , D a t a s e t S t a t i s t i c s Work Progress 12/21

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 2
OOP + Parallel Computing

Topics – Parallel Tool •  Parallel tool now can …
•  auto distribute new tasks when some CPUs rest •  if output error -> report •  auto dump output to both csv, html files •  no more “$ xxxxx > yyyy.log” •  represent all task as a table •  OOP hierarchy •  overwrite and custom output format if one wants •  support multi-type tasks in initial order •  one can “extract A, run something on A, compress output of A, then delete A”, then run B in same way … •  combined many parallel task easily •  format-conversion -> clip -> blast … Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 3

OOP = Object Oriented Programming Bioinformatics and Biostatistics Core, NTU
Center of Genomic Medicine 4

OOP Hierarchy Bioinformatics and Biostatistics Core, NTU Center of Genomic
Medicine 5

max_process = 1 name output_filename myparallel run handle all parallel stuff dump2csv save results directly setupTaskPool write2html parse2csv functions that user can overwrite max_process = 8 name = 'SRA to FASTA' output_filename FileConversion run dump2csv setupTaskPool add *.sra into list call fast-dump write2html parse2csv here many contains info ilke total seq. ... max_process = 4 name = 'blast all candidates' output_filename BLAST run dump2csv setupTaskPool add all candidates sequences call blastn (with many parameters) write2html parse2csv save counts gouped by different tissue types

1st type of task echo a word 2nd type of task open the result

Result (html) Bioinformatics and Biostatistics Core, NTU Center of
Genomic Medicine 9 1st task 2nd task HTML supports multi-line better

Share Experience •  Python •  language used in this script
•  Git – Version control •  After final exam … •  More on Git Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 11

Clipping Tool Problem Encountered Recently

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 0.E+00
5.E+08 1.E+09 2.E+09 2.E+09 3.E+09 3.E+09 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 SRR372617 SRR372618 SRR372619 SRR372620 SRR372621 SRR372622 SRR372623 SRR372624 SRR372625 SRR372626 SRR372627 SRR372628 SRR372629 SRR372630 SRR372631 SRR372632 SRR372633 SRR372634 SRR372635 SRR372636 SRR372637 SRR372638 SRR372639 SRR372640 SRR372641 SRR372642 SRR372643 SRR372644 SRR372645 SRR372646 SRR372647 SRR372648 SRR372649 SRR372650 SRR372651 SRR372652 SRR372653 SRR372654 SRR372655 SRR372656 SRR372657 SRR372658 SRR372659 SRR372660 SRR372661 SRR372662 SRR372663 SRR372664 SRR372665 SRR372666 SRR372667 SRR372668 SRR372669 SRR372670 SRR372671 SRR372672 read count (UNKOWN UNIT) N reads non-clipped adapter-only too-short output 0.E+00 5.E+06 1.E+07 2.E+07 2.E+07 3.E+07 3.E+07 4.E+07 SRR372611 SRR372612 SRR372613 SRR372614 SRR372615 SRR372616 read count (UNKOWN UNIT) N reads non-clipped adapter-only too-short output •  Size of different datasets varies in 2 orders •  Different Sequencing machine used: •  SRRXXXX [11-16]: •  HWI-EAS438_42AHVAAXX •  SRRXXXX [17-72]: •  ILLUMINA-053F9F 15

Further into these statistics •  First I thought that the
tool logged the reads in terms of their length, •  After I computed the ratio of input(output) to number of sequence, •  All ratios are exactly 53. •  Still don’t know why, since they run the same argument •  Probably due to the different platform they used Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 16

Read count mismatch among different tools •  It’s been reported
that ‘FASTX-Toolkit’ may output wrong read count •  This is fine since it still produces correct clipped sequences. •  While some even questioned the accuracy of clipping adapters •  Test not finished this week. (most on previous work) •  will be verified by other tools •  Alternatives •  CASAVA – Flickers •  FastqMcf •  cutadapt •  NGS QC Toolkit Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 17

Watch the sky! Bioinformatics and Biostatistics Core, NTU Center of
Genomic Medicine 18

Work Log 12/21

Work Log 12/21

Liang Bo Wang

More Decks by Liang Bo Wang

Featured

Transcript

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine O

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 2

Topics – Parallel Tool •  Parallel tool now can …

OOP = Object Oriented Programming Bioinformatics and Biostatistics Core, NTU

OOP Hierarchy Bioinformatics and Biostatistics Core, NTU Center of Genomic

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 6

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 7

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 8

Result (html) Bioinformatics and Biostatistics Core, NTU Center of

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 10

Share Experience •  Python •  language used in this script

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 12

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 13

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 14

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 0.E+00

Further into these statistics •  First I thought that the

Read count mismatch among different tools •  It’s been reported

Watch the sky! Bioinformatics and Biostatistics Core, NTU Center of