Sciences, University of Tsukuba ▶ Majored in Astronomy ▶ The author of Ruby/NArray since 1999 ◦ Equivalent of Numpy ◦ Presentation at RubyKaigi 2010 at Tsukuba • http://rubykaigi.org/2010/ja/events/83 Sep.9, 2016 RubyKaigi2016@Kyoto 2
https://github.com/ruby-numo/narray ▶ Basic functionality work is almost complete. ▶ Future work: ◦ binding to numerical libraries. ◦ binding to plotting libraries. ◦ interface to I/O. ◦ speedup with SIMD etc. ◦ binding to GPU APIs. ◦ use case such as machine learning ◦ … ▶ Contributions are welcome. Sep.9, 2016 RubyKaigi2016@Kyoto 3
Purpose: ◦ Scientific data processing on computer cluster. ▶ Run Rake tasks concurrently. ▶ Execute sh command line on remote computing nodes. Sep.9, 2016 RubyKaigi2016@Kyoto 4 Pwrake cc -o a.o -c a.c cc -o b.o -c b.c cc -o c.o -c c.c rule ".o" => ".c" do |x| sh "cc -o #{x} -c #{x.source}" end Rakefile https://github.com/masa16/pwrake
Pegasus workflow system) ◦ Necessary to write a script to define many tasks. ▶ Design a new language ◦ e.g. Swift (Wilde et al. 2011) (Different language from Apple’s) ◦ Learning cost, small user community. ▶ Use an existing language ◦ e.g. GXP Make (Taura et al. 2013) Workflow Definition Language Sep.9, 2016 RubyKaigi2016@Kyoto 11
#=> "a/b/c/file.txt" p 'a/b/c/file.txt'.pathmap("%f") #=> "file.txt" p 'a/b/c/file.txt'.pathmap("%n") #=> "file" p 'a/b/c/file.txt'.pathmap("%x") #=> ".txt" p 'a/b/c/file.txt'.pathmap("%X") #=> "a/b/c/file" p 'a/b/c/file.txt'.pathmap("%d") #=> "a/b/c" p 'a/b/c/file.txt'.pathmap("%2d") #=> "a/b" p 'a/b/c/file.txt'.pathmap("%-2d") #=> "b/c“ p 'a/b/c/file.txt'.pathmap("%d%s%{file,out}f") #=> "a/b/c/out.txt“ p 'a/b/c/file.txt'.pathmap("%X%{.*,*}x"){|ext| ext.upcase} #=> "a/b/c/file.TXT“ Sep.9, 2016 RubyKaigi2016@Kyoto 18
to define Complex and Many-Task Scientific Workflow. ◦ Rule ◦ Pathmap ◦ Internal DSL • For-Loop • Prerequisite map by block ▶ We use Rake as WfDL for Pwrake system. Sep.9, 2016 RubyKaigi2016@Kyoto 20
Topological Sort • No parallelization ▶ Pwrake ◦ Task Queue • Search Ready-to-Execute Tasks • Enqueue it. • Scheduling = Select Task on deq Sep.9, 2016 RubyKaigi2016@Kyoto 23 enq Topological Sort A B D C E F Task Queue A B C Workflow DAG A B C D E F deq
▶ Thread is annoying… ◦ Limited by max user processes (ulimit –u) ◦ Hard to find the reason of deadlock. ◦ Which part of code should be synchronized??? ◦ Need to synchronize puts. ▶ Fiber is currently used. ◦ Most of time, waiting I/O from worker nodes. ◦ Easier coding due to explicit context switch. ▶ But requires Asynchronous I/O. Sep.9, 2016 RubyKaigi2016@Kyoto 24
▶ File Staging by Workflow Systems ◦ Transfer files to/from worker nodes. ◦ Managed by Workflow Systems. ▶ File Sharing with Distributed File System (DFS) ◦ NFS, Lustre, GPFS, Gluster, … ◦ We choose Gfarm file system for Pwrake. Sep.9, 2016 RubyKaigi2016@Kyoto 27
CPU CPU CPU file1 file2 file3 Storage CPU CPU CPU Storage Storage file1 file2 file3 NFS Distributed file systems (Lutre, GPFS, etc) Storage CPU CPU CPU Storage Storage file1 file2 file3 Gfarm file system Concentration of storage Network limitation Scalable Performance with local access
by local storage of compute nodes. ▶ Designed for wide-area file sharing ◦ across institutes connected through the Internet. ▶ Open-Source project by Prof. Tatebe ◦ Since 2000. ◦ Gfarm ver. 2 since 2007. ◦ Current version: version 2.6.12 ▶ Reference: ◦ Osamu Tatebe, Kohei Hiraga, Noriyuki Soda, "Gfarm Grid File System", New Generation Computing, 2010, Vol.28, issue 3, p.257 Sep.9, 2016 RubyKaigi2016@Kyoto 30
file2 file3 file1 Metadata Server (MDS) Store content of file to local storage. Manages inode and file location. File System Nodes (FSN) Local Storage Client Directory lookup File access … compute process local access FSN is also compute node. Gfarm File System Components Sep.9, 2016 RubyKaigi2016@Kyoto 31
◦ http://www.hpci-office.jp/ ◦ Computational environment connecting the K computer and other supercomputers of research institutions in Japan by SINET5. ▶ NICT Science Cloud ◦ http://sc-web.nict.go.jp/ ▶ Commercial Uses ◦ Active! mail by QUALITIA • http://www.qualitia.co.jp/product/am/ Sep.9, 2016 RubyKaigi2016@Kyoto 32
File creation speed is limited by DB performance. ◦ Use SSD for MDS DB storage. ▶ Sequential access performance does not increase. ◦ Gfarm does not support network RAID except 1. ◦ Use RAID0/5 for FSN spool. ▶ Maybe improved in the future. Sep.9, 2016 RubyKaigi2016@Kyoto 34
File I/O is a bottleneck ▶ Data Locality is a key ◦ Write a File • Select local storage • to write output file ◦ Read a File • Assign a task to the node where input file exits • Workflow System Sep.9, 2016 38 RubyKaigi2016@Kyoto Local Storage Local Storage File Task Task Local Storage Local Storage Task File File Write Read
put a task into NodeQueue assigned to candidate nodes. ▶ deq: get a task from NodeQueue assigned to worker thread node. ▶ Load-balancing by deq-ing from another NodeQueue (Task-stealing) Sep.9, 2016 RubyKaigi2016@Kyoto Locality-Aware Task Queue TaskQueue Node 1 Node 2 Node 3 deq enq NodeQueue RemoteQueue Task 39 worker thread
files are stored. ◦ Default of Pwrake 2. Scheduling based on Graph Partitioning ◦ Method using MCGP (Multi-Constraint Graph Partitioning) ◦ Publication: CCGrid 2012 • M. Tanaka and O. Tatebe, "Workflow Scheduling to Minimize Data Movement Using Multi-constraint Graph Partitioning," in CCGrid 2012, p.65. Locality-aware Scheduling Methods Sep.9, 2016 RubyKaigi2016@Kyoto 40
Location ◦ Note: Input files for a task can be stored in multiple nodes. ▶ Method to define candidate node: ◦ Calculate the total size of input files for each node. ◦ Find candidate nodes having more file size than half of maximum size. Sep.9, 2016 RubyKaigi2016@Kyoto Naïve Locality Scheduling Task t × input files Node 3 Node 2 Node 1 file A file B file C ½ max filesize file A file B file C ◦ ◦ file C file C 41
File: 2MASS Image CPU Xeon E5410 (2.3GHz) Main Memory 16 GB Network GbE # of Nodes 8 Total # of Cores 32 Data size of each File 2.1 MB or 1.7 MB # of Input Files 607 Total Data size of Input Files 1270 MB Data I/O size during Workflow ~24 GB Total # of Tasks = # of Vertices 3090 At first, All the Input files are stored at a single node. Sep.9, 2016 43 RubyKaigi2016@Kyoto
140 160 180 200 A(Unconcern) B(Naïve locality) C(MCGP) Elapsed Time (sec) 31% down 22% down Includes time to solve MCGP (30 ms) Sep.9, 2016 RubyKaigi2016@Kyoto 45
and resume interrupted workflow. • based on time stamp of Input/Output files. ▶ Worker Failure: ◦ Policy: • Workflow does not stop even after one of worker nodes fails. ◦ Approaches: • Automatic file replication by Gfarm FS • Task retry, Worker dropout by Pwrake (ver 2.1) Sep.9, 2016 RubyKaigi2016@Kyoto 47
at 20 sec. ▶ Reduce # of cores 64 → 56, Storage is unavailable after kill. ▶ Final result was correct. Workflow continued successfully. Sep.9, 2016 RubyKaigi2016@Kyoto 48 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 80 # of running processes time (sec) Kill processes and gfsd in a worker node
credit: NAOJ・HSC project) HSC Outlook HSC focal plane CCD Subaru Telescope Field of View: 1.5 degree (x 3 than Suprime-Cam) # of CCDs: 116 CCD pixels: 4272×2272 Generates ~300 GB data per night One of HSC targets: Discovery of Super Nova
science data on multi-node cluster. ▶ Rake is powerful as a Workflow Definition Language. ▶ Pwrake workflow system is developed based on Rake an Gfarm file system. ▶ Study on Locality-Aware Task Scheduling. ▶ Fault Tolerance features. ▶ Pwrake & Gfarm use cases: ◦ NICT Science Cloud ◦ Subaru HSC Sep.9, 2016 RubyKaigi2016@Kyoto 52