Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work log 11/23

Liang Bo Wang
November 23, 2012
210

Work log 11/23

Liang Bo Wang

November 23, 2012
Tweet

Transcript

  1. Workflow Automation Bioinformatics and Biostatistics Core, NTU Center of Genomic

    Medicine 2 Progress 1 (Conversion) Progress 2 (Conversion) Progress 3 (Merge) input : sample00.a sample01.a sample02.a … (total 60) output : sample00.b sample01.b sample02.b … (total 60) input : sample00.b sample01.b sample02.b … (total 60) output : sample00.c sample01.c sample02.c … (total 60) input : sample00.c sample01.c sample02.c … (total 60) output : merged_sample_db Progress 4 (ex Mapping...) Progress 1 (Conversion) Progress 2 (Conversion) Progress 3 (Merge) input : sample00.a sample01.a sample02.a … (total 60) output : sample00.b sample01.b sample02.b … (total 60) input : sample00.b sample01.b sample02.b … (total 60) output : sample00.c sample01.c sample02.c … (total 60) input : sample00.c sample01.c sample02.c … (total 60) output : merged_sample_db Progress 4 (ex Mapping...) for x in range(60): process sample_i.a for x in range(60): process sample_i.b CPU Usage: 100%
  2. Workflow Automation (Multiplexing) Bioinformatics and Biostatistics Core, NTU Center of

    Genomic Medicine 3 Progress 1 (Conversion) Progress 2 (Conversion) Progress 3 (Merge) input : sample00.a sample01.a sample02.a … (total 60) output : sample00.b sample01.b sample02.b … (total 60) Total 10 each input : sample00.c sample01.c sample02.c … (total 60) output : merged_sample_db Progress 4 (ex Mapping...) for x in range(60): process sample_i.a Progress 2 (Conversion) Progress 2 (Conversion) Progress 2 (Conversion) Progress 2 (Conversion) Progress 2 (Conversion) CPU Usage: 600%
  3. Reasons for Automation, Multiplexing •  Long computing time (est. 1-3hr/sample)

    •  The tools used don’t use all the computing resources •  24 CPUs, only use 1 or 2 of them •  144 GB Ram, only use <1% •  I use Python for automation scripting Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 4
  4. Reason use Python •  Higher level programming language •  simple

    to write everything •  popular •  rank about 5 – 10 •  used by Google, Youtube, NASA •  Galaxy uses same programming language Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 5
  5. Problems Encountered •  Large disk usage •  size of final

    output is approximately 0.5 - 1TB per dataset •  intermediate files have similar order of file size •  too large •  Automation •  in previous work, I do everything using by hand •  fast for only 50 miRNAs •  exhausting and prone to err when processing 50 many datasets •  will be ported to Galaxy framework in the future Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine 6