Work log 11/23

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine Wa
n g L i a n g B o Work Progress 11/23

Workflow Automation Bioinformatics and Biostatistics Core, NTU Center of Genomic
Medicine 2 Progress 1 (Conversion) Progress 2 (Conversion) Progress 3 (Merge) input : sample00.a sample01.a sample02.a … (total 60) output : sample00.b sample01.b sample02.b … (total 60) input : sample00.b sample01.b sample02.b … (total 60) output : sample00.c sample01.c sample02.c … (total 60) input : sample00.c sample01.c sample02.c … (total 60) output : merged_sample_db Progress 4 (ex Mapping...) Progress 1 (Conversion) Progress 2 (Conversion) Progress 3 (Merge) input : sample00.a sample01.a sample02.a … (total 60) output : sample00.b sample01.b sample02.b … (total 60) input : sample00.b sample01.b sample02.b … (total 60) output : sample00.c sample01.c sample02.c … (total 60) input : sample00.c sample01.c sample02.c … (total 60) output : merged_sample_db Progress 4 (ex Mapping...) for x in range(60): process sample_i.a for x in range(60): process sample_i.b CPU Usage: 100%

Workflow Automation (Multiplexing) Bioinformatics and Biostatistics Core, NTU Center of
Genomic Medicine 3 Progress 1 (Conversion) Progress 2 (Conversion) Progress 3 (Merge) input : sample00.a sample01.a sample02.a … (total 60) output : sample00.b sample01.b sample02.b … (total 60) Total 10 each input : sample00.c sample01.c sample02.c … (total 60) output : merged_sample_db Progress 4 (ex Mapping...) for x in range(60): process sample_i.a Progress 2 (Conversion) Progress 2 (Conversion) Progress 2 (Conversion) Progress 2 (Conversion) Progress 2 (Conversion) CPU Usage: 600%

Reasons for Automation, Multiplexing •  Long computing time (est. 1-3hr/sample)
•  The tools used don’t use all the computing resources •  24 CPUs, only use 1 or 2 of them •  144 GB Ram, only use <1% •  I use Python for automation scripting Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 4

Reason use Python •  Higher level programming language •  simple
to write everything •  popular •  rank about 5 – 10 •  used by Google, Youtube, NASA •  Galaxy uses same programming language Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 5

Problems Encountered •  Large disk usage •  size of final
output is approximately 0.5 - 1TB per dataset •  intermediate files have similar order of file size •  too large •  Automation •  in previous work, I do everything using by hand •  fast for only 50 miRNAs •  exhausting and prone to err when processing 50 many datasets •  will be ported to Galaxy framework in the future Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 6

Work log 11/23

Work log 11/23

Liang Bo Wang

More Decks by Liang Bo Wang

Featured

Transcript

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine Wa

Workflow Automation Bioinformatics and Biostatistics Core, NTU Center of Genomic

Workflow Automation (Multiplexing) Bioinformatics and Biostatistics Core, NTU Center of

Reasons for Automation, Multiplexing •  Long computing time (est. 1-3hr/sample)

Reason use Python •  Higher level programming language •  simple

Problems Encountered •  Large disk usage •  size of final

Work log 11/23

Work log 11/23

Liang Bo Wang

More Decks by Liang Bo Wang

Featured

Transcript

Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine Wa

Workflow Automation Bioinformatics and Biostatistics Core, NTU Center of Genomic

Workflow Automation (Multiplexing) Bioinformatics and Biostatistics Core, NTU Center of

Reasons for Automation, Multiplexing • Long computing time (est. 1-3hr/sample)

Reason use Python • Higher level programming language • simple

Problems Encountered • Large disk usage • size of final

Reasons for Automation, Multiplexing •  Long computing time (est. 1-3hr/sample)

Reason use Python •  Higher level programming language •  simple

Problems Encountered •  Large disk usage •  size of final