Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pwrake: Distributed Workflow Engine based on Rake

Pwrake: Distributed Workflow Engine based on Rake

Presentation for RubyKaigi2016@Kyoto
http://rubykaigi.org/2016/presentations/masa16tanaka.html

Masahiro Tanaka 田中昌宏

September 09, 2016
Tweet

More Decks by Masahiro Tanaka 田中昌宏

Other Decks in Programming

Transcript

  1. Pwrake: Distributed Workflow Engine based on Rake Masahiro TANAKA -

    田中昌宏 Center for Computational Science, University of Tsukuba 筑波大学 計算科学研究センター Sep.9, 2016 RubyKaigi2016@Kyoto 1 Japan Science and Technology Agency
  2. Masahiro Tanaka ▶ Research Fellow at ◦ Center for Computational

    Sciences, University of Tsukuba ▶ Majored in Astronomy ▶ The author of Ruby/NArray since 1999 ◦ Equivalent of Numpy ◦ Presentation at RubyKaigi 2010 at Tsukuba • http://rubykaigi.org/2010/ja/events/83 Sep.9, 2016 RubyKaigi2016@Kyoto 2
  3. NArray Progress ▶ The name of new version: Numo::NArray. ◦

    https://github.com/ruby-numo/narray ▶ Basic functionality work is almost complete. ▶ Future work: ◦ binding to numerical libraries. ◦ binding to plotting libraries. ◦ interface to I/O. ◦ speedup with SIMD etc. ◦ binding to GPU APIs. ◦ use case such as machine learning ◦ … ▶ Contributions are welcome. Sep.9, 2016 RubyKaigi2016@Kyoto 3
  4. Today’s Topic: Pwrake ▶ Parallel Workflow extension for Rake ▶

    Purpose: ◦ Scientific data processing on computer cluster. ▶ Run Rake tasks concurrently. ▶ Execute sh command line on remote computing nodes. Sep.9, 2016 RubyKaigi2016@Kyoto 4 Pwrake cc -o a.o -c a.c cc -o b.o -c b.c cc -o c.o -c c.c rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end Rakefile https://github.com/masa16/pwrake
  5. Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶

    Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 5
  6. Contents ▶Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake

    Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 6
  7.  Combine multiple-shots of Astronomical Images and produce a custom

    Mosaic Image. http://montage.ipac.caltech.edu/ Sep.9, 2016 RubyKaigi2016@Kyoto 7 Workflow Example: Montage
  8.  Combine multiple-shots of Astronomical Images and produce a custom

    Mosaic Image. http://montage.ipac.caltech.edu/ Sep.9, 2016 RubyKaigi2016@Kyoto 8 Workflow Example: Montage
  9. Example of scientific workflow: Montage (Astronomy image processing) mProjectPP mDiff+mFitplane

    mBGModel mBackground mShrink mAdd mAdd mJPEG Output image Workflow is expressed as DAG (Directed Acyclic Graph) Input images … Process file output input Task Sep.9, 2016 RubyKaigi2016@Kyoto 9
  10. Contents ▶ Background: Scientific Workflow ▶Workflow Definition Language ▶ Pwrake

    Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 10
  11. ▶ Use Markup Language (e.g. XML) ◦ e.g. DAX (for

    Pegasus workflow system) ◦ Necessary to write a script to define many tasks. ▶ Design a new language ◦ e.g. Swift (Wilde et al. 2011) (Different language from Apple’s) ◦ Learning cost, small user community. ▶ Use an existing language ◦ e.g. GXP Make (Taura et al. 2013) Workflow Definition Language Sep.9, 2016 RubyKaigi2016@Kyoto 11
  12. Workflow to Build Program Sep.9, 2016 13 RubyKaigi2016@Kyoto a.o a.c

    b.o b.c c.o c.c foo DAG Makefile (GNU make) SRCS := $(wildcard *.c) OBJS := $(subst .c,.o,$(SRCS)) all: foo %.o : %.c cc -o $@ -c $< foo: $(OBJS) cc -o $@ $^
  13. Workflow to Build Program Sep.9, 2016 14 Makefile (GNU make)

    Rakefile RubyKaigi2016@Kyoto SRCS := $(wildcard *.c) OBJS := $(subst .c,.o,$(SRCS)) all: foo %.o : %.c cc -o $@ -c $< foo: $(OBJS) cc -o $@ $^ SRCS = FileList["*.c"] OBJS = SRCS.ext("o") task :default => "foo" rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end file "foo" => OBJS do |x| sh "cc -o #{x} #{OBJS}" end Rakefile is a Ruby Script.
  14. ▶ For-Loop Ruby Scripting enabled by Internal DSL Sep.9, 2016

    RubyKaigi2016@Kyoto 16 INPUT = FileList["r/*.fits"] OUTPUT = [] for src in INPUT OUTPUT << dst = "p/"+File.basename(src) file dst => src do |t| sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr" end end task :default => OUTPUT
  15. ▶ Replaces %-format to the specified part of the path

    name. ▶ Applicable to FileList, String, Prerequisites Pathmap Sep.9, 2016 RubyKaigi2016@Kyoto 17 INPUT = FileList["r/*.fits"] OUTPUT = INPUT.pathmap("p/%f") rule /^p¥/.*¥.fits$/ => "r/%n.fits" do |t| sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr" end task :default => OUTPUT
  16. Pathmap Examples ▶ See Rake manual for detail. p 'a/b/c/file.txt'.pathmap("%p")

    #=> "a/b/c/file.txt" p 'a/b/c/file.txt'.pathmap("%f") #=> "file.txt" p 'a/b/c/file.txt'.pathmap("%n") #=> "file" p 'a/b/c/file.txt'.pathmap("%x") #=> ".txt" p 'a/b/c/file.txt'.pathmap("%X") #=> "a/b/c/file" p 'a/b/c/file.txt'.pathmap("%d") #=> "a/b/c" p 'a/b/c/file.txt'.pathmap("%2d") #=> "a/b" p 'a/b/c/file.txt'.pathmap("%-2d") #=> "b/c“ p 'a/b/c/file.txt'.pathmap("%d%s%{file,out}f") #=> "a/b/c/out.txt“ p 'a/b/c/file.txt'.pathmap("%X%{.*,*}x"){|ext| ext.upcase} #=> "a/b/c/file.TXT“ Sep.9, 2016 RubyKaigi2016@Kyoto 18
  17. ▶ Requires two files as prerequisites. ▶ Useful to define

    complex workflows. Prerequisite Map by Block Sep.9, 2016 RubyKaigi2016@Kyoto 19 FILEMAP = {"d/d00.fits“=>["p/p00.fits","p/01.fits"], ...} rule /^d¥/.*¥.fits$/ => proc{|x| FILEMAP[x]} do |t| p1,p2 = t.prerequisites sh "mDiff #{p1} #{p2} #{t.name} region.hdr" end
  18. Rake as a WfDL ▶ Rake is a powerful WfDL

    to define Complex and Many-Task Scientific Workflow. ◦ Rule ◦ Pathmap ◦ Internal DSL • For-Loop • Prerequisite map by block ▶ We use Rake as WfDL for Pwrake system. Sep.9, 2016 RubyKaigi2016@Kyoto 20
  19. Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶Pwrake

    Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 21
  20. Pwrake master fiber pool fiber sh fiber sh fiber sh

    pwrake worker communicator Pwrake Structure Worker nodes Master node enq deq Task Graph Task Queue fiber sh pwrake worker communicator Scheduling Gfarm files files files files process process process process SSH SSH Sep.9, 2016 22 RubyKaigi2016@Kyoto
  21. Task Queueing ▶ Rake ◦ Depth-First Search • Same as

    Topological Sort • No parallelization ▶ Pwrake ◦ Task Queue • Search Ready-to-Execute Tasks • Enqueue it. • Scheduling = Select Task on deq Sep.9, 2016 RubyKaigi2016@Kyoto 23 enq Topological Sort A B D C E F Task Queue A B C Workflow DAG A B C D E F deq
  22. Thread vs. Fiber ▶ Pwrake is initially implemented using Thread.

    ▶ Thread is annoying… ◦ Limited by max user processes (ulimit –u) ◦ Hard to find the reason of deadlock. ◦ Which part of code should be synchronized??? ◦ Need to synchronize puts. ▶ Fiber is currently used. ◦ Most of time, waiting I/O from worker nodes. ◦ Easier coding due to explicit context switch. ▶ But requires Asynchronous I/O. Sep.9, 2016 RubyKaigi2016@Kyoto 24
  23. Asynchronous I/O ▶ Bartender (Asynchronous I/O) by Seki-san ◦ https://github.com/seki/bartender

    ◦ Single Fiber for one I/O ▶ Pwrake Asynchronous I/O ◦ Multiple Fibers for one I/O ◦ Timeout handling Sep.9, 2016 RubyKaigi2016@Kyoto 25
  24. Other Features ▶ Task options defined with desc ◦ ncore,

    allow, deny, … ▶ Logging ▶ Report statics as a HTML page. ▶ Output DAG in Graphviz form. Sep.9, 2016 RubyKaigi2016@Kyoto 26
  25. File Sharing ▶ File sharing is necessary for multi-node workflows

    ▶ File Staging by Workflow Systems ◦ Transfer files to/from worker nodes. ◦ Managed by Workflow Systems. ▶ File Sharing with Distributed File System (DFS) ◦ NFS, Lustre, GPFS, Gluster, … ◦ We choose Gfarm file system for Pwrake. Sep.9, 2016 RubyKaigi2016@Kyoto 27
  26. Comparison of Network File Systems Sep.9, 2016 RubyKaigi2016@Kyoto 28 Storage

    CPU CPU CPU file1 file2 file3 Storage CPU CPU CPU Storage Storage file1 file2 file3 NFS Distributed file systems (Lutre, GPFS, etc) Storage CPU CPU CPU Storage Storage file1 file2 file3 Gfarm file system Concentration of storage Network limitation Scalable Performance with local access
  27. Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶

    Pwrake Structure ▶Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 29
  28. Gfarm File System ▶ http://oss-tsukuba.org/software/gfarm ▶ Distributed File System constructed

    by local storage of compute nodes. ▶ Designed for wide-area file sharing ◦ across institutes connected through the Internet. ▶ Open-Source project by Prof. Tatebe ◦ Since 2000. ◦ Gfarm ver. 2 since 2007. ◦ Current version: version 2.6.12 ▶ Reference: ◦ Osamu Tatebe, Kohei Hiraga, Noriyuki Soda, "Gfarm Grid File System", New Generation Computing, 2010, Vol.28, issue 3, p.257 Sep.9, 2016 RubyKaigi2016@Kyoto 30
  29. Global Directory Tree / /dir1 file1 file2 /dir2 file3 file4

    file2 file3 file1 Metadata Server (MDS) Store content of file to local storage. Manages inode and file location. File System Nodes (FSN) Local Storage Client Directory lookup File access … compute process local access FSN is also compute node. Gfarm File System Components Sep.9, 2016 RubyKaigi2016@Kyoto 31
  30. Use Cases of Gfarm ▶ HPCI (High Performance Computing Infrastructure)

    ◦ http://www.hpci-office.jp/ ◦ Computational environment connecting the K computer and other supercomputers of research institutions in Japan by SINET5. ▶ NICT Science Cloud ◦ http://sc-web.nict.go.jp/ ▶ Commercial Uses ◦ Active! mail by QUALITIA • http://www.qualitia.co.jp/product/am/ Sep.9, 2016 RubyKaigi2016@Kyoto 32
  31. Gfarm Features ▶ Scalable Capacity ◦ By adding FSN ◦

    Commodity hardware ▶ Fault Tolerance ◦ Standby slave MDS ◦ Automatic file replication (mirroring) ▶ High Performance ◦ Parallel access scales ◦ Local access Sep.9, 2016 RubyKaigi2016@Kyoto 33
  32. Gfarm Issues ▶ MDS is stand alone, not scalable. ◦

    File creation speed is limited by DB performance. ◦ Use SSD for MDS DB storage. ▶ Sequential access performance does not increase. ◦ Gfarm does not support network RAID except 1. ◦ Use RAID0/5 for FSN spool. ▶ Maybe improved in the future. Sep.9, 2016 RubyKaigi2016@Kyoto 34
  33. Gfarm Information Source ▶ NPO - OSS Tsukuba ◦ http://oss-tsukuba.org/

    ▶ Gfarm Symposium/Workshop ◦ http://oss-tsukuba.org/event ◦ Next Workshop: Oct 21, 2016 @Kobe • http://oss-tsukuba.org/event/gw2016 ▶ Mailing List ◦ https://sourceforge.net/p/gfarm/mailman/ ▶ Paid Support ◦ http://oss-tsukuba.org/support Sep.9, 2016 RubyKaigi2016@Kyoto 35
  34. Pwrake master Supporting Gfarm by Pwrake Sep.9, 2016 RubyKaigi2016@Kyoto 36

    Pwrake worker process process / tmp/ pwrake_john_000/ Rakefile file01.dat file02.dat pwrake_john_001/ Rakefile file01.dat file02.dat / tmp/ john/ Rakefile file01.dat file02.dat Gfarm MDS Gfarm / Rakefile file01.dat file02.dat Find FSN where a file is stored. (gfwhere-pipe) Mount Gfarm FS for each core. (gfarm2fs) Pwrake Master Node Worker Node mount mount Check Gfarm FS? communicator
  35. Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶

    Pwrake Structure ▶ Gfarm Distributed File System ▶Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 37
  36. Locality in Gfarm File System ▶ Large Scientific Data ◦

    File I/O is a bottleneck ▶ Data Locality is a key ◦ Write a File • Select local storage • to write output file ◦ Read a File • Assign a task to the node where input file exits • Workflow System Sep.9, 2016 38 RubyKaigi2016@Kyoto Local Storage Local Storage File Task Task Local Storage Local Storage Task File File Write Read
  37. ▶ NodeQueue in TaskQueue: Assigned to worker node. ▶ enq:

    put a task into NodeQueue assigned to candidate nodes. ▶ deq: get a task from NodeQueue assigned to worker thread node. ▶ Load-balancing by deq-ing from another NodeQueue (Task-stealing) Sep.9, 2016 RubyKaigi2016@Kyoto Locality-Aware Task Queue TaskQueue Node 1 Node 2 Node 3 deq enq NodeQueue RemoteQueue Task 39 worker thread
  38. 1. Naïve Locality Scheduling ◦ Define “candidate nodes” where input

    files are stored. ◦ Default of Pwrake 2. Scheduling based on Graph Partitioning ◦ Method using MCGP (Multi-Constraint Graph Partitioning) ◦ Publication: CCGrid 2012 • M. Tanaka and O. Tatebe, "Workflow Scheduling to Minimize Data Movement Using Multi-constraint Graph Partitioning," in CCGrid 2012, p.65. Locality-aware Scheduling Methods Sep.9, 2016 RubyKaigi2016@Kyoto 40
  39. ▶ Find Candidate Nodes for Task based on Input File

    Location ◦ Note: Input files for a task can be stored in multiple nodes. ▶ Method to define candidate node: ◦ Calculate the total size of input files for each node. ◦ Find candidate nodes having more file size than half of maximum size. Sep.9, 2016 RubyKaigi2016@Kyoto Naïve Locality Scheduling Task t × input files Node 3 Node 2 Node 1 file A file B file C ½ max filesize file A file B file C ◦ ◦ file C file C 41
  40. Graph Partitioning on DAG Standard Graph Partitioning Proposed method using

    Multi-Constraint Graph Partitioning Node-A Node-B Node-C Node-D Former Tasks Latter Tasks Not aware of task parallelization Sep.9, 2016 RubyKaigi2016@Kyoto 42 Parallelize in every stage Node-A Node-B Node-C Node-D
  41. Platform for Evaluation ▶ Cluster used for Evaluation ▶ Input

    File: 2MASS Image CPU Xeon E5410 (2.3GHz) Main Memory 16 GB Network GbE # of Nodes 8 Total # of Cores 32 Data size of each File 2.1 MB or 1.7 MB # of Input Files 607 Total Data size of Input Files 1270 MB Data I/O size during Workflow ~24 GB Total # of Tasks = # of Vertices 3090 At first, All the Input files are stored at a single node. Sep.9, 2016 43 RubyKaigi2016@Kyoto
  42. Data Transfer between nodes 87.9 47.4 14.0 0 10 20

    30 40 50 60 70 80 90 100 A(Unconcern) B(Naïve locality) C(MCGP) Data Size Ratio (%) Sep.9, 2016 RubyKaigi2016@Kyoto 44
  43. Workflow Execution Time 0 20 40 60 80 100 120

    140 160 180 200 A(Unconcern) B(Naïve locality) C(MCGP) Elapsed Time (sec) 31% down 22% down Includes time to solve MCGP (30 ms) Sep.9, 2016 RubyKaigi2016@Kyoto 45
  44. Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶

    Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 46
  45. Fault Tolerance in Pwrake ▶ Master Failure: ◦ Rerun Pwrake

    and resume interrupted workflow. • based on time stamp of Input/Output files. ▶ Worker Failure: ◦ Policy: • Workflow does not stop even after one of worker nodes fails. ◦ Approaches: • Automatic file replication by Gfarm FS • Task retry, Worker dropout by Pwrake (ver 2.1) Sep.9, 2016 RubyKaigi2016@Kyoto 47
  46. Experiment of Worker Failure ▶ Kill worker process and gfsd

    at 20 sec. ▶ Reduce # of cores 64 → 56, Storage is unavailable after kill. ▶ Final result was correct. Workflow continued successfully. Sep.9, 2016 RubyKaigi2016@Kyoto 48 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 80 # of running processes time (sec) Kill processes and gfsd in a worker node
  47. Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶

    Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶Science Data Processing with Pwrake & Gfarm ◦ NICT Science Cloud ◦ HSC in Subaru Telescope Sep.9, 2016 RubyKaigi2016@Kyoto 49
  48. NICT Science Cloud Sep.9, 2016 RubyKaigi2016@Kyoto 50 Himawari-8 realtime web

    http://himawari8.nict.go.jp/ http://sc-web.nict.go.jp/ Presentation at Gfarm Symposium 2015 - http://oss-tsukuba.org/event/gs2015
  49. Hyper Suprime-Cam(HSC) in Subaru Telescope Sep.9, 2016 RubyKaigi2016@Kyoto 51 (Image

    credit: NAOJ・HSC project) HSC Outlook HSC focal plane CCD Subaru Telescope Field of View: 1.5 degree (x 3 than Suprime-Cam) # of CCDs: 116 CCD pixels: 4272×2272 Generates ~300 GB data per night One of HSC targets: Discovery of Super Nova
  50. Conclusion ▶ Scientific Workflow System is required for processing of

    science data on multi-node cluster. ▶ Rake is powerful as a Workflow Definition Language. ▶ Pwrake workflow system is developed based on Rake an Gfarm file system. ▶ Study on Locality-Aware Task Scheduling. ▶ Fault Tolerance features. ▶ Pwrake & Gfarm use cases: ◦ NICT Science Cloud ◦ Subaru HSC Sep.9, 2016 RubyKaigi2016@Kyoto 52