Pwrake: Distributed Workflow Engine based on Rake

Pwrake: Distributed Workflow Engine based on Rake Masahiro TANAKA -
田中昌宏 Center for Computational Science, University of Tsukuba 筑波大学計算科学研究センター Sep.9, 2016 RubyKaigi2016@Kyoto 1 Japan Science and Technology Agency

Masahiro Tanaka ▶ Research Fellow at ◦ Center for Computational
Sciences, University of Tsukuba ▶ Majored in Astronomy ▶ The author of Ruby/NArray since 1999 ◦ Equivalent of Numpy ◦ Presentation at RubyKaigi 2010 at Tsukuba • http://rubykaigi.org/2010/ja/events/83 Sep.9, 2016 RubyKaigi2016@Kyoto 2

NArray Progress ▶ The name of new version: Numo::NArray. ◦
https://github.com/ruby-numo/narray ▶ Basic functionality work is almost complete. ▶ Future work: ◦ binding to numerical libraries. ◦ binding to plotting libraries. ◦ interface to I/O. ◦ speedup with SIMD etc. ◦ binding to GPU APIs. ◦ use case such as machine learning ◦ … ▶ Contributions are welcome. Sep.9, 2016 RubyKaigi2016@Kyoto 3

Today’s Topic: Pwrake ▶ Parallel Workflow extension for Rake ▶
Purpose: ◦ Scientific data processing on computer cluster. ▶ Run Rake tasks concurrently. ▶ Execute sh command line on remote computing nodes. Sep.9, 2016 RubyKaigi2016@Kyoto 4 Pwrake cc -o a.o -c a.c cc -o b.o -c b.c cc -o c.o -c c.c rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end Rakefile https://github.com/masa16/pwrake

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶
Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 5

Contents ▶Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake
Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 6

 Combine multiple-shots of Astronomical Images and produce a custom
Mosaic Image. http://montage.ipac.caltech.edu/ Sep.9, 2016 RubyKaigi2016@Kyoto 7 Workflow Example: Montage

 Combine multiple-shots of Astronomical Images and produce a custom
Mosaic Image. http://montage.ipac.caltech.edu/ Sep.9, 2016 RubyKaigi2016@Kyoto 8 Workflow Example: Montage

Example of scientific workflow: Montage (Astronomy image processing) mProjectPP mDiff+mFitplane
mBGModel mBackground mShrink mAdd mAdd mJPEG Output image Workflow is expressed as DAG (Directed Acyclic Graph) Input images … Process file output input Task Sep.9, 2016 RubyKaigi2016@Kyoto 9

Contents ▶ Background: Scientific Workflow ▶Workflow Definition Language ▶ Pwrake

▶ Use Markup Language (e.g. XML) ◦ e.g. DAX (for
Pegasus workflow system) ◦ Necessary to write a script to define many tasks. ▶ Design a new language ◦ e.g. Swift (Wilde et al. 2011) (Different language from Apple’s) ◦ Learning cost, small user community. ▶ Use an existing language ◦ e.g. GXP Make (Taura et al. 2013) Workflow Definition Language Sep.9, 2016 RubyKaigi2016@Kyoto 11

Workflow to Build Program Sep.9, 2016 12 RubyKaigi2016@Kyoto DAG a.o
a.c b.o b.c c.o c.c foo

Workflow to Build Program Sep.9, 2016 13 RubyKaigi2016@Kyoto a.o a.c
b.o b.c c.o c.c foo DAG Makefile (GNU make) SRCS := $(wildcard *.c) OBJS := $(subst .c,.o,$(SRCS)) all: foo %.o : %.c cc -o $@ -c $< foo: $(OBJS) cc -o $@ $^

Workflow to Build Program Sep.9, 2016 14 Makefile (GNU make)
Rakefile RubyKaigi2016@Kyoto SRCS := $(wildcard *.c) OBJS := $(subst .c,.o,$(SRCS)) all: foo %.o : %.c cc -o $@ -c $< foo: $(OBJS) cc -o $@ $^ SRCS = FileList["*.c"] OBJS = SRCS.ext("o") task :default => "foo" rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end file "foo" => OBJS do |x| sh "cc -o #{x} #{OBJS}" end Rakefile is a Ruby Script.

Useful features of Rake ▶ Ruby Scripting ▶ Pathmap Sep.9,
2016 RubyKaigi2016@Kyoto 15

▶ For-Loop Ruby Scripting enabled by Internal DSL Sep.9, 2016
RubyKaigi2016@Kyoto 16 INPUT = FileList["r/*.fits"] OUTPUT = [] for src in INPUT OUTPUT << dst = "p/"+File.basename(src) file dst => src do |t| sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr" end end task :default => OUTPUT

▶ Replaces %-format to the specified part of the path
name. ▶ Applicable to FileList, String, Prerequisites Pathmap Sep.9, 2016 RubyKaigi2016@Kyoto 17 INPUT = FileList["r/*.fits"] OUTPUT = INPUT.pathmap("p/%f") rule /^p¥/.*¥.fits$/ => "r/%n.fits" do |t| sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr" end task :default => OUTPUT

Pathmap Examples ▶ See Rake manual for detail. p 'a/b/c/file.txt'.pathmap("%p")
#=> "a/b/c/file.txt" p 'a/b/c/file.txt'.pathmap("%f") #=> "file.txt" p 'a/b/c/file.txt'.pathmap("%n") #=> "file" p 'a/b/c/file.txt'.pathmap("%x") #=> ".txt" p 'a/b/c/file.txt'.pathmap("%X") #=> "a/b/c/file" p 'a/b/c/file.txt'.pathmap("%d") #=> "a/b/c" p 'a/b/c/file.txt'.pathmap("%2d") #=> "a/b" p 'a/b/c/file.txt'.pathmap("%-2d") #=> "b/c“ p 'a/b/c/file.txt'.pathmap("%d%s%{file,out}f") #=> "a/b/c/out.txt“ p 'a/b/c/file.txt'.pathmap("%X%{.*,*}x"){|ext| ext.upcase} #=> "a/b/c/file.TXT“ Sep.9, 2016 RubyKaigi2016@Kyoto 18

▶ Requires two files as prerequisites. ▶ Useful to define
complex workflows. Prerequisite Map by Block Sep.9, 2016 RubyKaigi2016@Kyoto 19 FILEMAP = {"d/d00.fits“=>["p/p00.fits","p/01.fits"], ...} rule /^d¥/.*¥.fits$/ => proc{|x| FILEMAP[x]} do |t| p1,p2 = t.prerequisites sh "mDiff #{p1} #{p2} #{t.name} region.hdr" end

Rake as a WfDL ▶ Rake is a powerful WfDL
to define Complex and Many-Task Scientific Workflow. ◦ Rule ◦ Pathmap ◦ Internal DSL • For-Loop • Prerequisite map by block ▶ We use Rake as WfDL for Pwrake system. Sep.9, 2016 RubyKaigi2016@Kyoto 20

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶Pwrake

Pwrake master fiber pool fiber sh fiber sh fiber sh
pwrake worker communicator Pwrake Structure Worker nodes Master node enq deq Task Graph Task Queue fiber sh pwrake worker communicator Scheduling Gfarm files files files files process process process process SSH SSH Sep.9, 2016 22 RubyKaigi2016@Kyoto

Task Queueing ▶ Rake ◦ Depth-First Search • Same as
Topological Sort • No parallelization ▶ Pwrake ◦ Task Queue • Search Ready-to-Execute Tasks • Enqueue it. • Scheduling = Select Task on deq Sep.9, 2016 RubyKaigi2016@Kyoto 23 enq Topological Sort A B D C E F Task Queue A B C Workflow DAG A B C D E F deq

Thread vs. Fiber ▶ Pwrake is initially implemented using Thread.
▶ Thread is annoying… ◦ Limited by max user processes (ulimit –u) ◦ Hard to find the reason of deadlock. ◦ Which part of code should be synchronized??? ◦ Need to synchronize puts. ▶ Fiber is currently used. ◦ Most of time, waiting I/O from worker nodes. ◦ Easier coding due to explicit context switch. ▶ But requires Asynchronous I/O. Sep.9, 2016 RubyKaigi2016@Kyoto 24

Asynchronous I/O ▶ Bartender (Asynchronous I/O) by Seki-san ◦ https://github.com/seki/bartender
◦ Single Fiber for one I/O ▶ Pwrake Asynchronous I/O ◦ Multiple Fibers for one I/O ◦ Timeout handling Sep.9, 2016 RubyKaigi2016@Kyoto 25

Other Features ▶ Task options defined with desc ◦ ncore,
allow, deny, … ▶ Logging ▶ Report statics as a HTML page. ▶ Output DAG in Graphviz form. Sep.9, 2016 RubyKaigi2016@Kyoto 26

File Sharing ▶ File sharing is necessary for multi-node workflows
▶ File Staging by Workflow Systems ◦ Transfer files to/from worker nodes. ◦ Managed by Workflow Systems. ▶ File Sharing with Distributed File System (DFS) ◦ NFS, Lustre, GPFS, Gluster, … ◦ We choose Gfarm file system for Pwrake. Sep.9, 2016 RubyKaigi2016@Kyoto 27

Comparison of Network File Systems Sep.9, 2016 RubyKaigi2016@Kyoto 28 Storage
CPU CPU CPU file1 file2 file3 Storage CPU CPU CPU Storage Storage file1 file2 file3 NFS Distributed file systems (Lutre, GPFS, etc) Storage CPU CPU CPU Storage Storage file1 file2 file3 Gfarm file system Concentration of storage Network limitation Scalable Performance with local access

Pwrake Structure ▶Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 29

Gfarm File System ▶ http://oss-tsukuba.org/software/gfarm ▶ Distributed File System constructed
by local storage of compute nodes. ▶ Designed for wide-area file sharing ◦ across institutes connected through the Internet. ▶ Open-Source project by Prof. Tatebe ◦ Since 2000. ◦ Gfarm ver. 2 since 2007. ◦ Current version: version 2.6.12 ▶ Reference： ◦ Osamu Tatebe, Kohei Hiraga, Noriyuki Soda, "Gfarm Grid File System", New Generation Computing, 2010, Vol.28, issue 3, p.257 Sep.9, 2016 RubyKaigi2016@Kyoto 30

Global Directory Tree / /dir1 file1 file2 /dir2 file3 file4
file2 file3 file1 Metadata Server (MDS) Store content of file to local storage. Manages inode and file location. File System Nodes (FSN) Local Storage Client Directory lookup File access … compute process local access FSN is also compute node. Gfarm File System Components Sep.9, 2016 RubyKaigi2016@Kyoto 31

Use Cases of Gfarm ▶ HPCI (High Performance Computing Infrastructure)
◦ http://www.hpci-office.jp/ ◦ Computational environment connecting the K computer and other supercomputers of research institutions in Japan by SINET5. ▶ NICT Science Cloud ◦ http://sc-web.nict.go.jp/ ▶ Commercial Uses ◦ Active! mail by QUALITIA • http://www.qualitia.co.jp/product/am/ Sep.9, 2016 RubyKaigi2016@Kyoto 32

Gfarm Features ▶ Scalable Capacity ◦ By adding FSN ◦
Commodity hardware ▶ Fault Tolerance ◦ Standby slave MDS ◦ Automatic file replication (mirroring) ▶ High Performance ◦ Parallel access scales ◦ Local access Sep.9, 2016 RubyKaigi2016@Kyoto 33

Gfarm Issues ▶ MDS is stand alone, not scalable. ◦
File creation speed is limited by DB performance. ◦ Use SSD for MDS DB storage. ▶ Sequential access performance does not increase. ◦ Gfarm does not support network RAID except 1. ◦ Use RAID0/5 for FSN spool. ▶ Maybe improved in the future. Sep.9, 2016 RubyKaigi2016@Kyoto 34

Gfarm Information Source ▶ NPO - OSS Tsukuba ◦ http://oss-tsukuba.org/
▶ Gfarm Symposium/Workshop ◦ http://oss-tsukuba.org/event ◦ Next Workshop: Oct 21, 2016 @Kobe • http://oss-tsukuba.org/event/gw2016 ▶ Mailing List ◦ https://sourceforge.net/p/gfarm/mailman/ ▶ Paid Support ◦ http://oss-tsukuba.org/support Sep.9, 2016 RubyKaigi2016@Kyoto 35

Pwrake master Supporting Gfarm by Pwrake Sep.9, 2016 RubyKaigi2016@Kyoto 36
Pwrake worker process process / tmp/ pwrake_john_000/ Rakefile file01.dat file02.dat pwrake_john_001/ Rakefile file01.dat file02.dat / tmp/ john/ Rakefile file01.dat file02.dat Gfarm MDS Gfarm / Rakefile file01.dat file02.dat Find FSN where a file is stored. (gfwhere-pipe) Mount Gfarm FS for each core. (gfarm2fs) Pwrake Master Node Worker Node mount mount Check Gfarm FS? communicator

Pwrake Structure ▶ Gfarm Distributed File System ▶Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 37

Locality in Gfarm File System ▶ Large Scientific Data ◦
File I/O is a bottleneck ▶ Data Locality is a key ◦ Write a File • Select local storage • to write output file ◦ Read a File • Assign a task to the node where input file exits • Workflow System Sep.9, 2016 38 RubyKaigi2016@Kyoto Local Storage Local Storage File Task Task Local Storage Local Storage Task File File Write Read

▶ NodeQueue in TaskQueue: Assigned to worker node. ▶ enq:
put a task into NodeQueue assigned to candidate nodes. ▶ deq: get a task from NodeQueue assigned to worker thread node. ▶ Load-balancing by deq-ing from another NodeQueue (Task-stealing) Sep.9, 2016 RubyKaigi2016@Kyoto Locality-Aware Task Queue TaskQueue Node 1 Node 2 Node 3 deq enq NodeQueue RemoteQueue Task 39 worker thread

1. Naïve Locality Scheduling ◦ Define “candidate nodes” where input
files are stored. ◦ Default of Pwrake 2. Scheduling based on Graph Partitioning ◦ Method using MCGP (Multi-Constraint Graph Partitioning) ◦ Publication: CCGrid 2012 • M. Tanaka and O. Tatebe, "Workflow Scheduling to Minimize Data Movement Using Multi-constraint Graph Partitioning," in CCGrid 2012, p.65. Locality-aware Scheduling Methods Sep.9, 2016 RubyKaigi2016@Kyoto 40

▶ Find Candidate Nodes for Task based on Input File
Location ◦ Note: Input files for a task can be stored in multiple nodes. ▶ Method to define candidate node: ◦ Calculate the total size of input files for each node. ◦ Find candidate nodes having more file size than half of maximum size. Sep.9, 2016 RubyKaigi2016@Kyoto Naïve Locality Scheduling Task t × input files Node 3 Node 2 Node 1 file A file B file C ½ max filesize file A file B file C ◦ ◦ file C file C 41

Graph Partitioning on DAG Standard Graph Partitioning Proposed method using
Multi-Constraint Graph Partitioning Node-A Node-B Node-C Node-D Former Tasks Latter Tasks Not aware of task parallelization Sep.9, 2016 RubyKaigi2016@Kyoto 42 Parallelize in every stage Node-A Node-B Node-C Node-D

Platform for Evaluation ▶ Cluster used for Evaluation ▶ Input
File： 2MASS Image CPU Xeon E5410 (2.3GHz) Main Memory 16 GB Network GbE # of Nodes 8 Total # of Cores 32 Data size of each File 2.1 MB or 1.7 MB # of Input Files 607 Total Data size of Input Files 1270 MB Data I/O size during Workflow ~24 GB Total # of Tasks = # of Vertices 3090 At first, All the Input files are stored at a single node. Sep.9, 2016 43 RubyKaigi2016@Kyoto

Data Transfer between nodes 87.9 47.4 14.0 0 10 20
30 40 50 60 70 80 90 100 A（Unconcern） B（Naïve locality） C（MCGP） Data Size Ratio (%) Sep.9, 2016 RubyKaigi2016@Kyoto 44

Workflow Execution Time 0 20 40 60 80 100 120
140 160 180 200 A（Unconcern） B（Naïve locality） C（MCGP） Elapsed Time (sec) 31% down 22% down Includes time to solve MCGP （30 ms） Sep.9, 2016 RubyKaigi2016@Kyoto 45

Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 46

Fault Tolerance in Pwrake ▶ Master Failure: ◦ Rerun Pwrake
and resume interrupted workflow. • based on time stamp of Input/Output files. ▶ Worker Failure: ◦ Policy: • Workflow does not stop even after one of worker nodes fails. ◦ Approaches: • Automatic file replication by Gfarm FS • Task retry, Worker dropout by Pwrake (ver 2.1) Sep.9, 2016 RubyKaigi2016@Kyoto 47

Experiment of Worker Failure ▶ Kill worker process and gfsd
at 20 sec. ▶ Reduce # of cores 64 → 56, Storage is unavailable after kill. ▶ Final result was correct. Workflow continued successfully. Sep.9, 2016 RubyKaigi2016@Kyoto 48 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 80 # of running processes time (sec) Kill processes and gfsd in a worker node

Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶Science Data Processing with Pwrake & Gfarm ◦ NICT Science Cloud ◦ HSC in Subaru Telescope Sep.9, 2016 RubyKaigi2016@Kyoto 49

NICT Science Cloud Sep.9, 2016 RubyKaigi2016@Kyoto 50 Himawari-8 realtime web
http://himawari8.nict.go.jp/ http://sc-web.nict.go.jp/ Presentation at Gfarm Symposium 2015 - http://oss-tsukuba.org/event/gs2015

Hyper Suprime-Cam(HSC) in Subaru Telescope Sep.9, 2016 RubyKaigi2016@Kyoto 51 (Image
credit: NAOJ・HSC project) HSC Outlook HSC focal plane CCD Subaru Telescope Field of View： 1.5 degree (x 3 than Suprime-Cam) # of CCDs: 116 CCD pixels： 4272×2272 Generates ~300 GB data per night One of HSC targets: Discovery of Super Nova

Conclusion ▶ Scientific Workflow System is required for processing of
science data on multi-node cluster. ▶ Rake is powerful as a Workflow Definition Language. ▶ Pwrake workflow system is developed based on Rake an Gfarm file system. ▶ Study on Locality-Aware Task Scheduling. ▶ Fault Tolerance features. ▶ Pwrake & Gfarm use cases: ◦ NICT Science Cloud ◦ Subaru HSC Sep.9, 2016 RubyKaigi2016@Kyoto 52

Pwrake: Distributed Workflow Engine based on Rake

Pwrake: Distributed Workflow Engine based on Rake

More Decks by Masahiro Tanaka 田中昌宏

Other Decks in Programming

Featured

Transcript