Slide 1

Slide 1 text

Pwrake: Distributed Workflow Engine based on Rake Masahiro TANAKA - 田中昌宏 Center for Computational Science, University of Tsukuba 筑波大学 計算科学研究センター Sep.9, 2016 RubyKaigi2016@Kyoto 1 Japan Science and Technology Agency

Slide 2

Slide 2 text

Masahiro Tanaka ▶ Research Fellow at ○ Center for Computational Sciences, University of Tsukuba ▶ Majored in Astronomy ▶ The author of Ruby/NArray since 1999 ○ Equivalent of Numpy ○ Presentation at RubyKaigi 2010 at Tsukuba • http://rubykaigi.org/2010/ja/events/83 Sep.9, 2016 RubyKaigi2016@Kyoto 2

Slide 3

Slide 3 text

NArray Progress ▶ The name of new version: Numo::NArray. ○ https://github.com/ruby-numo/narray ▶ Basic functionality work is almost complete. ▶ Future work: ○ binding to numerical libraries. ○ binding to plotting libraries. ○ interface to I/O. ○ speedup with SIMD etc. ○ binding to GPU APIs. ○ use case such as machine learning ○ … ▶ Contributions are welcome. Sep.9, 2016 RubyKaigi2016@Kyoto 3

Slide 4

Slide 4 text

Today’s Topic: Pwrake ▶ Parallel Workflow extension for Rake ▶ Purpose: ○ Scientific data processing on computer cluster. ▶ Run Rake tasks concurrently. ▶ Execute sh command line on remote computing nodes. Sep.9, 2016 RubyKaigi2016@Kyoto 4 Pwrake cc -o a.o -c a.c cc -o b.o -c b.c cc -o c.o -c c.c rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end Rakefile https://github.com/masa16/pwrake

Slide 5

Slide 5 text

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 5

Slide 6

Slide 6 text

Contents ▶Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 6

Slide 7

Slide 7 text

 Combine multiple-shots of Astronomical Images and produce a custom Mosaic Image. http://montage.ipac.caltech.edu/ Sep.9, 2016 RubyKaigi2016@Kyoto 7 Workflow Example: Montage

Slide 8

Slide 8 text

 Combine multiple-shots of Astronomical Images and produce a custom Mosaic Image. http://montage.ipac.caltech.edu/ Sep.9, 2016 RubyKaigi2016@Kyoto 8 Workflow Example: Montage

Slide 9

Slide 9 text

Example of scientific workflow: Montage (Astronomy image processing) mProjectPP mDiff+mFitplane mBGModel mBackground mShrink mAdd mAdd mJPEG Output image Workflow is expressed as DAG (Directed Acyclic Graph) Input images … Process file output input Task Sep.9, 2016 RubyKaigi2016@Kyoto 9

Slide 10

Slide 10 text

Contents ▶ Background: Scientific Workflow ▶Workflow Definition Language ▶ Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 10

Slide 11

Slide 11 text

▶ Use Markup Language (e.g. XML) ○ e.g. DAX (for Pegasus workflow system) ○ Necessary to write a script to define many tasks. ▶ Design a new language ○ e.g. Swift (Wilde et al. 2011) (Different language from Apple’s) ○ Learning cost, small user community. ▶ Use an existing language ○ e.g. GXP Make (Taura et al. 2013) Workflow Definition Language Sep.9, 2016 RubyKaigi2016@Kyoto 11

Slide 12

Slide 12 text

Workflow to Build Program Sep.9, 2016 12 RubyKaigi2016@Kyoto DAG a.o a.c b.o b.c c.o c.c foo

Slide 13

Slide 13 text

Workflow to Build Program Sep.9, 2016 13 RubyKaigi2016@Kyoto a.o a.c b.o b.c c.o c.c foo DAG Makefile (GNU make) SRCS := $(wildcard *.c) OBJS := $(subst .c,.o,$(SRCS)) all: foo %.o : %.c cc -o $@ -c $< foo: $(OBJS) cc -o $@ $^

Slide 14

Slide 14 text

Workflow to Build Program Sep.9, 2016 14 Makefile (GNU make) Rakefile RubyKaigi2016@Kyoto SRCS := $(wildcard *.c) OBJS := $(subst .c,.o,$(SRCS)) all: foo %.o : %.c cc -o $@ -c $< foo: $(OBJS) cc -o $@ $^ SRCS = FileList["*.c"] OBJS = SRCS.ext("o") task :default => "foo" rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end file "foo" => OBJS do |x| sh "cc -o #{x} #{OBJS}" end Rakefile is a Ruby Script.

Slide 15

Slide 15 text

Useful features of Rake ▶ Ruby Scripting ▶ Pathmap Sep.9, 2016 RubyKaigi2016@Kyoto 15

Slide 16

Slide 16 text

▶ For-Loop Ruby Scripting enabled by Internal DSL Sep.9, 2016 RubyKaigi2016@Kyoto 16 INPUT = FileList["r/*.fits"] OUTPUT = [] for src in INPUT OUTPUT << dst = "p/"+File.basename(src) file dst => src do |t| sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr" end end task :default => OUTPUT

Slide 17

Slide 17 text

▶ Replaces %-format to the specified part of the path name. ▶ Applicable to FileList, String, Prerequisites Pathmap Sep.9, 2016 RubyKaigi2016@Kyoto 17 INPUT = FileList["r/*.fits"] OUTPUT = INPUT.pathmap("p/%f") rule /^p¥/.*¥.fits$/ => "r/%n.fits" do |t| sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr" end task :default => OUTPUT

Slide 18

Slide 18 text

Pathmap Examples ▶ See Rake manual for detail. p 'a/b/c/file.txt'.pathmap("%p") #=> "a/b/c/file.txt" p 'a/b/c/file.txt'.pathmap("%f") #=> "file.txt" p 'a/b/c/file.txt'.pathmap("%n") #=> "file" p 'a/b/c/file.txt'.pathmap("%x") #=> ".txt" p 'a/b/c/file.txt'.pathmap("%X") #=> "a/b/c/file" p 'a/b/c/file.txt'.pathmap("%d") #=> "a/b/c" p 'a/b/c/file.txt'.pathmap("%2d") #=> "a/b" p 'a/b/c/file.txt'.pathmap("%-2d") #=> "b/c“ p 'a/b/c/file.txt'.pathmap("%d%s%{file,out}f") #=> "a/b/c/out.txt“ p 'a/b/c/file.txt'.pathmap("%X%{.*,*}x"){|ext| ext.upcase} #=> "a/b/c/file.TXT“ Sep.9, 2016 RubyKaigi2016@Kyoto 18

Slide 19

Slide 19 text

▶ Requires two files as prerequisites. ▶ Useful to define complex workflows. Prerequisite Map by Block Sep.9, 2016 RubyKaigi2016@Kyoto 19 FILEMAP = {"d/d00.fits“=>["p/p00.fits","p/01.fits"], ...} rule /^d¥/.*¥.fits$/ => proc{|x| FILEMAP[x]} do |t| p1,p2 = t.prerequisites sh "mDiff #{p1} #{p2} #{t.name} region.hdr" end

Slide 20

Slide 20 text

Rake as a WfDL ▶ Rake is a powerful WfDL to define Complex and Many-Task Scientific Workflow. ○ Rule ○ Pathmap ○ Internal DSL • For-Loop • Prerequisite map by block ▶ We use Rake as WfDL for Pwrake system. Sep.9, 2016 RubyKaigi2016@Kyoto 20

Slide 21

Slide 21 text

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 21

Slide 22

Slide 22 text

Pwrake master fiber pool fiber sh fiber sh fiber sh pwrake worker communicator Pwrake Structure Worker nodes Master node enq deq Task Graph Task Queue fiber sh pwrake worker communicator Scheduling Gfarm files files files files process process process process SSH SSH Sep.9, 2016 22 RubyKaigi2016@Kyoto

Slide 23

Slide 23 text

Task Queueing ▶ Rake ○ Depth-First Search • Same as Topological Sort • No parallelization ▶ Pwrake ○ Task Queue • Search Ready-to-Execute Tasks • Enqueue it. • Scheduling = Select Task on deq Sep.9, 2016 RubyKaigi2016@Kyoto 23 enq Topological Sort A B D C E F Task Queue A B C Workflow DAG A B C D E F deq

Slide 24

Slide 24 text

Thread vs. Fiber ▶ Pwrake is initially implemented using Thread. ▶ Thread is annoying… ○ Limited by max user processes (ulimit –u) ○ Hard to find the reason of deadlock. ○ Which part of code should be synchronized??? ○ Need to synchronize puts. ▶ Fiber is currently used. ○ Most of time, waiting I/O from worker nodes. ○ Easier coding due to explicit context switch. ▶ But requires Asynchronous I/O. Sep.9, 2016 RubyKaigi2016@Kyoto 24

Slide 25

Slide 25 text

Asynchronous I/O ▶ Bartender (Asynchronous I/O) by Seki-san ○ https://github.com/seki/bartender ○ Single Fiber for one I/O ▶ Pwrake Asynchronous I/O ○ Multiple Fibers for one I/O ○ Timeout handling Sep.9, 2016 RubyKaigi2016@Kyoto 25

Slide 26

Slide 26 text

Other Features ▶ Task options defined with desc ○ ncore, allow, deny, … ▶ Logging ▶ Report statics as a HTML page. ▶ Output DAG in Graphviz form. Sep.9, 2016 RubyKaigi2016@Kyoto 26

Slide 27

Slide 27 text

File Sharing ▶ File sharing is necessary for multi-node workflows ▶ File Staging by Workflow Systems ○ Transfer files to/from worker nodes. ○ Managed by Workflow Systems. ▶ File Sharing with Distributed File System (DFS) ○ NFS, Lustre, GPFS, Gluster, … ○ We choose Gfarm file system for Pwrake. Sep.9, 2016 RubyKaigi2016@Kyoto 27

Slide 28

Slide 28 text

Comparison of Network File Systems Sep.9, 2016 RubyKaigi2016@Kyoto 28 Storage CPU CPU CPU file1 file2 file3 Storage CPU CPU CPU Storage Storage file1 file2 file3 NFS Distributed file systems (Lutre, GPFS, etc) Storage CPU CPU CPU Storage Storage file1 file2 file3 Gfarm file system Concentration of storage Network limitation Scalable Performance with local access

Slide 29

Slide 29 text

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake Structure ▶Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 29

Slide 30

Slide 30 text

Gfarm File System ▶ http://oss-tsukuba.org/software/gfarm ▶ Distributed File System constructed by local storage of compute nodes. ▶ Designed for wide-area file sharing ○ across institutes connected through the Internet. ▶ Open-Source project by Prof. Tatebe ○ Since 2000. ○ Gfarm ver. 2 since 2007. ○ Current version: version 2.6.12 ▶ Reference: ○ Osamu Tatebe, Kohei Hiraga, Noriyuki Soda, "Gfarm Grid File System", New Generation Computing, 2010, Vol.28, issue 3, p.257 Sep.9, 2016 RubyKaigi2016@Kyoto 30

Slide 31

Slide 31 text

Global Directory Tree / /dir1 file1 file2 /dir2 file3 file4 file2 file3 file1 Metadata Server (MDS) Store content of file to local storage. Manages inode and file location. File System Nodes (FSN) Local Storage Client Directory lookup File access … compute process local access FSN is also compute node. Gfarm File System Components Sep.9, 2016 RubyKaigi2016@Kyoto 31

Slide 32

Slide 32 text

Use Cases of Gfarm ▶ HPCI (High Performance Computing Infrastructure) ○ http://www.hpci-office.jp/ ○ Computational environment connecting the K computer and other supercomputers of research institutions in Japan by SINET5. ▶ NICT Science Cloud ○ http://sc-web.nict.go.jp/ ▶ Commercial Uses ○ Active! mail by QUALITIA • http://www.qualitia.co.jp/product/am/ Sep.9, 2016 RubyKaigi2016@Kyoto 32

Slide 33

Slide 33 text

Gfarm Features ▶ Scalable Capacity ○ By adding FSN ○ Commodity hardware ▶ Fault Tolerance ○ Standby slave MDS ○ Automatic file replication (mirroring) ▶ High Performance ○ Parallel access scales ○ Local access Sep.9, 2016 RubyKaigi2016@Kyoto 33

Slide 34

Slide 34 text

Gfarm Issues ▶ MDS is stand alone, not scalable. ○ File creation speed is limited by DB performance. ○ Use SSD for MDS DB storage. ▶ Sequential access performance does not increase. ○ Gfarm does not support network RAID except 1. ○ Use RAID0/5 for FSN spool. ▶ Maybe improved in the future. Sep.9, 2016 RubyKaigi2016@Kyoto 34

Slide 35

Slide 35 text

Gfarm Information Source ▶ NPO - OSS Tsukuba ○ http://oss-tsukuba.org/ ▶ Gfarm Symposium/Workshop ○ http://oss-tsukuba.org/event ○ Next Workshop: Oct 21, 2016 @Kobe • http://oss-tsukuba.org/event/gw2016 ▶ Mailing List ○ https://sourceforge.net/p/gfarm/mailman/ ▶ Paid Support ○ http://oss-tsukuba.org/support Sep.9, 2016 RubyKaigi2016@Kyoto 35

Slide 36

Slide 36 text

Pwrake master Supporting Gfarm by Pwrake Sep.9, 2016 RubyKaigi2016@Kyoto 36 Pwrake worker process process / tmp/ pwrake_john_000/ Rakefile file01.dat file02.dat pwrake_john_001/ Rakefile file01.dat file02.dat / tmp/ john/ Rakefile file01.dat file02.dat Gfarm MDS Gfarm / Rakefile file01.dat file02.dat Find FSN where a file is stored. (gfwhere-pipe) Mount Gfarm FS for each core. (gfarm2fs) Pwrake Master Node Worker Node mount mount Check Gfarm FS? communicator

Slide 37

Slide 37 text

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake Structure ▶ Gfarm Distributed File System ▶Locality-Aware Task Scheduling ▶ Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 37

Slide 38

Slide 38 text

Locality in Gfarm File System ▶ Large Scientific Data ○ File I/O is a bottleneck ▶ Data Locality is a key ○ Write a File • Select local storage • to write output file ○ Read a File • Assign a task to the node where input file exits • Workflow System Sep.9, 2016 38 RubyKaigi2016@Kyoto Local Storage Local Storage File Task Task Local Storage Local Storage Task File File Write Read

Slide 39

Slide 39 text

▶ NodeQueue in TaskQueue: Assigned to worker node. ▶ enq: put a task into NodeQueue assigned to candidate nodes. ▶ deq: get a task from NodeQueue assigned to worker thread node. ▶ Load-balancing by deq-ing from another NodeQueue (Task-stealing) Sep.9, 2016 RubyKaigi2016@Kyoto Locality-Aware Task Queue TaskQueue Node 1 Node 2 Node 3 deq enq NodeQueue RemoteQueue Task 39 worker thread

Slide 40

Slide 40 text

1. Naïve Locality Scheduling ○ Define “candidate nodes” where input files are stored. ○ Default of Pwrake 2. Scheduling based on Graph Partitioning ○ Method using MCGP (Multi-Constraint Graph Partitioning) ○ Publication: CCGrid 2012 • M. Tanaka and O. Tatebe, "Workflow Scheduling to Minimize Data Movement Using Multi-constraint Graph Partitioning," in CCGrid 2012, p.65. Locality-aware Scheduling Methods Sep.9, 2016 RubyKaigi2016@Kyoto 40

Slide 41

Slide 41 text

▶ Find Candidate Nodes for Task based on Input File Location ○ Note: Input files for a task can be stored in multiple nodes. ▶ Method to define candidate node: ○ Calculate the total size of input files for each node. ○ Find candidate nodes having more file size than half of maximum size. Sep.9, 2016 RubyKaigi2016@Kyoto Naïve Locality Scheduling Task t × input files Node 3 Node 2 Node 1 file A file B file C ½ max filesize file A file B file C ○ ○ file C file C 41

Slide 42

Slide 42 text

Graph Partitioning on DAG Standard Graph Partitioning Proposed method using Multi-Constraint Graph Partitioning Node-A Node-B Node-C Node-D Former Tasks Latter Tasks Not aware of task parallelization Sep.9, 2016 RubyKaigi2016@Kyoto 42 Parallelize in every stage Node-A Node-B Node-C Node-D

Slide 43

Slide 43 text

Platform for Evaluation ▶ Cluster used for Evaluation ▶ Input File: 2MASS Image CPU Xeon E5410 (2.3GHz) Main Memory 16 GB Network GbE # of Nodes 8 Total # of Cores 32 Data size of each File 2.1 MB or 1.7 MB # of Input Files 607 Total Data size of Input Files 1270 MB Data I/O size during Workflow ~24 GB Total # of Tasks = # of Vertices 3090 At first, All the Input files are stored at a single node. Sep.9, 2016 43 RubyKaigi2016@Kyoto

Slide 44

Slide 44 text

Data Transfer between nodes 87.9 47.4 14.0 0 10 20 30 40 50 60 70 80 90 100 A(Unconcern) B(Naïve locality) C(MCGP) Data Size Ratio (%) Sep.9, 2016 RubyKaigi2016@Kyoto 44

Slide 45

Slide 45 text

Workflow Execution Time 0 20 40 60 80 100 120 140 160 180 200 A(Unconcern) B(Naïve locality) C(MCGP) Elapsed Time (sec) 31% down 22% down Includes time to solve MCGP (30 ms) Sep.9, 2016 RubyKaigi2016@Kyoto 45

Slide 46

Slide 46 text

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶Fault Tolerance ▶ Science Data Processing with Pwrake & Gfarm Sep.9, 2016 RubyKaigi2016@Kyoto 46

Slide 47

Slide 47 text

Fault Tolerance in Pwrake ▶ Master Failure: ○ Rerun Pwrake and resume interrupted workflow. • based on time stamp of Input/Output files. ▶ Worker Failure: ○ Policy: • Workflow does not stop even after one of worker nodes fails. ○ Approaches: • Automatic file replication by Gfarm FS • Task retry, Worker dropout by Pwrake (ver 2.1) Sep.9, 2016 RubyKaigi2016@Kyoto 47

Slide 48

Slide 48 text

Experiment of Worker Failure ▶ Kill worker process and gfsd at 20 sec. ▶ Reduce # of cores 64 → 56, Storage is unavailable after kill. ▶ Final result was correct. Workflow continued successfully. Sep.9, 2016 RubyKaigi2016@Kyoto 48 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 80 # of running processes time (sec) Kill processes and gfsd in a worker node

Slide 49

Slide 49 text

Contents ▶ Background: Scientific Workflow ▶ Workflow Definition Language ▶ Pwrake Structure ▶ Gfarm Distributed File System ▶ Locality-Aware Task Scheduling ▶ Fault Tolerance ▶Science Data Processing with Pwrake & Gfarm ○ NICT Science Cloud ○ HSC in Subaru Telescope Sep.9, 2016 RubyKaigi2016@Kyoto 49

Slide 50

Slide 50 text

NICT Science Cloud Sep.9, 2016 RubyKaigi2016@Kyoto 50 Himawari-8 realtime web http://himawari8.nict.go.jp/ http://sc-web.nict.go.jp/ Presentation at Gfarm Symposium 2015 - http://oss-tsukuba.org/event/gs2015

Slide 51

Slide 51 text

Hyper Suprime-Cam(HSC) in Subaru Telescope Sep.9, 2016 RubyKaigi2016@Kyoto 51 (Image credit: NAOJ・HSC project) HSC Outlook HSC focal plane CCD Subaru Telescope Field of View: 1.5 degree (x 3 than Suprime-Cam) # of CCDs: 116 CCD pixels: 4272×2272 Generates ~300 GB data per night One of HSC targets: Discovery of Super Nova

Slide 52

Slide 52 text

Conclusion ▶ Scientific Workflow System is required for processing of science data on multi-node cluster. ▶ Rake is powerful as a Workflow Definition Language. ▶ Pwrake workflow system is developed based on Rake an Gfarm file system. ▶ Study on Locality-Aware Task Scheduling. ▶ Fault Tolerance features. ▶ Pwrake & Gfarm use cases: ○ NICT Science Cloud ○ Subaru HSC Sep.9, 2016 RubyKaigi2016@Kyoto 52