Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pwrake: Distributed Workflow Engine based on Rake

Pwrake: Distributed Workflow Engine based on Rake

Masahiro Tanaka 田中昌宏

September 09, 2016
Tweet

More Decks by Masahiro Tanaka 田中昌宏

Other Decks in Programming

Transcript

  1. Pwrake: Distributed
    Workflow Engine based on
    Rake
    Masahiro TANAKA - 田中昌宏
    Center for Computational Science, University of Tsukuba
    筑波大学 計算科学研究センター
    Sep.9, 2016 [email protected] 1
    Japan Science and Technology Agency

    View Slide

  2. Masahiro Tanaka
    ▶ Research Fellow at
    ○ Center for Computational Sciences, University of Tsukuba
    ▶ Majored in Astronomy
    ▶ The author of Ruby/NArray since 1999
    ○ Equivalent of Numpy
    ○ Presentation at RubyKaigi 2010 at Tsukuba
    • http://rubykaigi.org/2010/ja/events/83
    Sep.9, 2016 [email protected] 2

    View Slide

  3. NArray Progress
    ▶ The name of new version: Numo::NArray.
    ○ https://github.com/ruby-numo/narray
    ▶ Basic functionality work is almost complete.
    ▶ Future work:
    ○ binding to numerical libraries.
    ○ binding to plotting libraries.
    ○ interface to I/O.
    ○ speedup with SIMD etc.
    ○ binding to GPU APIs.
    ○ use case such as machine learning
    ○ …
    ▶ Contributions are welcome.
    Sep.9, 2016 [email protected] 3

    View Slide

  4. Today’s Topic:
    Pwrake
    ▶ Parallel Workflow extension for Rake
    ▶ Purpose:
    ○ Scientific data processing on computer cluster.
    ▶ Run Rake tasks concurrently.
    ▶ Execute sh command line on remote computing
    nodes.
    Sep.9, 2016 [email protected] 4
    Pwrake
    cc -o a.o -c a.c
    cc -o b.o -c b.c
    cc -o c.o -c c.c
    rule ".o" => ".c" do |x|
    sh "cc -o #{x} -c #{x.source}"
    end
    Rakefile
    https://github.com/masa16/pwrake

    View Slide

  5. Contents
    ▶ Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶ Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 5

    View Slide

  6. Contents
    ▶Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶ Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 6

    View Slide

  7.  Combine multiple-shots of Astronomical Images and produce
    a custom Mosaic Image. http://montage.ipac.caltech.edu/
    Sep.9, 2016 [email protected] 7
    Workflow Example: Montage

    View Slide

  8.  Combine multiple-shots of Astronomical Images and produce
    a custom Mosaic Image. http://montage.ipac.caltech.edu/
    Sep.9, 2016 [email protected] 8
    Workflow Example: Montage

    View Slide

  9. Example of scientific workflow:
    Montage (Astronomy image processing)
    mProjectPP
    mDiff+mFitplane
    mBGModel
    mBackground
    mShrink
    mAdd
    mAdd
    mJPEG
    Output image
    Workflow is expressed as DAG (Directed Acyclic Graph)
    Input images

    Process
    file
    output
    input
    Task
    Sep.9, 2016 [email protected] 9

    View Slide

  10. Contents
    ▶ Background: Scientific Workflow
    ▶Workflow Definition Language
    ▶ Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 10

    View Slide

  11. ▶ Use Markup Language (e.g. XML)
    ○ e.g. DAX (for Pegasus workflow system)
    ○ Necessary to write a script to define many tasks.
    ▶ Design a new language
    ○ e.g. Swift (Wilde et al. 2011) (Different language from Apple’s)
    ○ Learning cost, small user community.
    ▶ Use an existing language
    ○ e.g. GXP Make (Taura et al. 2013)
    Workflow Definition Language
    Sep.9, 2016 [email protected] 11

    View Slide

  12. Workflow to Build Program
    Sep.9, 2016 12
    [email protected]
    DAG
    a.o
    a.c
    b.o
    b.c
    c.o
    c.c
    foo

    View Slide

  13. Workflow to Build Program
    Sep.9, 2016 13
    [email protected]
    a.o
    a.c
    b.o
    b.c
    c.o
    c.c
    foo
    DAG Makefile (GNU make)
    SRCS := $(wildcard *.c)
    OBJS := $(subst .c,.o,$(SRCS))
    all: foo
    %.o : %.c
    cc -o [email protected] -c $<
    foo: $(OBJS)
    cc -o [email protected] $^

    View Slide

  14. Workflow to Build Program
    Sep.9, 2016 14
    Makefile (GNU make)
    Rakefile
    [email protected]
    SRCS := $(wildcard *.c)
    OBJS := $(subst .c,.o,$(SRCS))
    all: foo
    %.o : %.c
    cc -o [email protected] -c $<
    foo: $(OBJS)
    cc -o [email protected] $^
    SRCS = FileList["*.c"]
    OBJS = SRCS.ext("o")
    task :default => "foo"
    rule ".o" => ".c" do |x|
    sh "cc -o #{x} -c #{x.source}"
    end
    file "foo" => OBJS do |x|
    sh "cc -o #{x} #{OBJS}"
    end
    Rakefile is a Ruby Script.

    View Slide

  15. Useful features of Rake
    ▶ Ruby Scripting
    ▶ Pathmap
    Sep.9, 2016 [email protected] 15

    View Slide

  16. ▶ For-Loop
    Ruby Scripting enabled by Internal DSL
    Sep.9, 2016 [email protected] 16
    INPUT = FileList["r/*.fits"]
    OUTPUT = []
    for src in INPUT
    OUTPUT << dst = "p/"+File.basename(src)
    file dst => src do |t|
    sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr"
    end
    end
    task :default => OUTPUT

    View Slide

  17. ▶ Replaces %-format to the specified part of the
    path name.
    ▶ Applicable to FileList, String, Prerequisites
    Pathmap
    Sep.9, 2016 [email protected] 17
    INPUT = FileList["r/*.fits"]
    OUTPUT = INPUT.pathmap("p/%f")
    rule /^p¥/.*¥.fits$/ => "r/%n.fits" do |t|
    sh "mProjectPP #{t.prerequisites[0]} #{t.name} region.hdr"
    end
    task :default => OUTPUT

    View Slide

  18. Pathmap Examples
    ▶ See Rake manual for detail.
    p 'a/b/c/file.txt'.pathmap("%p") #=> "a/b/c/file.txt"
    p 'a/b/c/file.txt'.pathmap("%f") #=> "file.txt"
    p 'a/b/c/file.txt'.pathmap("%n") #=> "file"
    p 'a/b/c/file.txt'.pathmap("%x") #=> ".txt"
    p 'a/b/c/file.txt'.pathmap("%X") #=> "a/b/c/file"
    p 'a/b/c/file.txt'.pathmap("%d") #=> "a/b/c"
    p 'a/b/c/file.txt'.pathmap("%2d") #=> "a/b"
    p 'a/b/c/file.txt'.pathmap("%-2d") #=> "b/c“
    p 'a/b/c/file.txt'.pathmap("%d%s%{file,out}f")
    #=> "a/b/c/out.txt“
    p 'a/b/c/file.txt'.pathmap("%X%{.*,*}x"){|ext| ext.upcase}
    #=> "a/b/c/file.TXT“
    Sep.9, 2016 [email protected] 18

    View Slide

  19. ▶ Requires two files as prerequisites.
    ▶ Useful to define complex workflows.
    Prerequisite Map by Block
    Sep.9, 2016 [email protected] 19
    FILEMAP = {"d/d00.fits“=>["p/p00.fits","p/01.fits"], ...}
    rule /^d¥/.*¥.fits$/ => proc{|x| FILEMAP[x]} do |t|
    p1,p2 = t.prerequisites
    sh "mDiff #{p1} #{p2} #{t.name} region.hdr"
    end

    View Slide

  20. Rake as a WfDL
    ▶ Rake is a powerful WfDL to define Complex
    and Many-Task Scientific Workflow.
    ○ Rule
    ○ Pathmap
    ○ Internal DSL
    • For-Loop
    • Prerequisite map by block
    ▶ We use Rake as WfDL for Pwrake system.
    Sep.9, 2016 [email protected] 20

    View Slide

  21. Contents
    ▶ Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 21

    View Slide

  22. Pwrake master
    fiber pool
    fiber
    sh
    fiber
    sh
    fiber
    sh
    pwrake
    worker
    communicator
    Pwrake Structure
    Worker nodes
    Master node
    enq
    deq
    Task Graph
    Task Queue
    fiber
    sh
    pwrake
    worker
    communicator
    Scheduling
    Gfarm
    files
    files
    files
    files
    process
    process
    process
    process
    SSH
    SSH
    Sep.9, 2016 22
    [email protected]

    View Slide

  23. Task Queueing
    ▶ Rake
    ○ Depth-First Search
    • Same as Topological Sort
    • No parallelization
    ▶ Pwrake
    ○ Task Queue
    • Search Ready-to-Execute Tasks
    • Enqueue it.
    • Scheduling = Select Task on deq
    Sep.9, 2016 [email protected] 23
    enq
    Topological Sort
    A B D C E F
    Task Queue
    A B C
    Workflow DAG
    A B C
    D E
    F
    deq

    View Slide

  24. Thread vs. Fiber
    ▶ Pwrake is initially implemented using Thread.
    ▶ Thread is annoying…
    ○ Limited by max user processes (ulimit –u)
    ○ Hard to find the reason of deadlock.
    ○ Which part of code should be synchronized???
    ○ Need to synchronize puts.
    ▶ Fiber is currently used.
    ○ Most of time, waiting I/O from worker nodes.
    ○ Easier coding due to explicit context switch.
    ▶ But requires Asynchronous I/O.
    Sep.9, 2016 [email protected] 24

    View Slide

  25. Asynchronous I/O
    ▶ Bartender (Asynchronous I/O) by Seki-san
    ○ https://github.com/seki/bartender
    ○ Single Fiber for one I/O
    ▶ Pwrake Asynchronous I/O
    ○ Multiple Fibers for one I/O
    ○ Timeout handling
    Sep.9, 2016 [email protected] 25

    View Slide

  26. Other Features
    ▶ Task options defined with desc
    ○ ncore, allow, deny, …
    ▶ Logging
    ▶ Report statics as a HTML page.
    ▶ Output DAG in Graphviz form.
    Sep.9, 2016 [email protected] 26

    View Slide

  27. File Sharing
    ▶ File sharing is necessary for multi-node workflows
    ▶ File Staging by Workflow Systems
    ○ Transfer files to/from worker nodes.
    ○ Managed by Workflow Systems.
    ▶ File Sharing with Distributed File System (DFS)
    ○ NFS, Lustre, GPFS, Gluster, …
    ○ We choose Gfarm file system for Pwrake.
    Sep.9, 2016 [email protected] 27

    View Slide

  28. Comparison of Network File Systems
    Sep.9, 2016 [email protected] 28
    Storage
    CPU CPU CPU
    file1 file2 file3
    Storage
    CPU CPU CPU
    Storage
    Storage
    file1 file2 file3
    NFS
    Distributed file systems
    (Lutre, GPFS, etc)
    Storage
    CPU CPU CPU
    Storage
    Storage
    file1 file2 file3
    Gfarm file system
    Concentration of storage Network limitation
    Scalable Performance
    with local access

    View Slide

  29. Contents
    ▶ Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶ Pwrake Structure
    ▶Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 29

    View Slide

  30. Gfarm File System
    ▶ http://oss-tsukuba.org/software/gfarm
    ▶ Distributed File System constructed by local storage of
    compute nodes.
    ▶ Designed for wide-area file sharing
    ○ across institutes connected through the Internet.
    ▶ Open-Source project by Prof. Tatebe
    ○ Since 2000.
    ○ Gfarm ver. 2 since 2007.
    ○ Current version: version 2.6.12
    ▶ Reference:
    ○ Osamu Tatebe, Kohei Hiraga, Noriyuki Soda, "Gfarm Grid File System",
    New Generation Computing, 2010, Vol.28, issue 3, p.257
    Sep.9, 2016 [email protected] 30

    View Slide

  31. Global Directory Tree
    /
    /dir1
    file1 file2
    /dir2
    file3 file4 file2 file3
    file1
    Metadata
    Server
    (MDS)
    Store content of
    file to local storage.
    Manages
    inode and
    file location.
    File System Nodes (FSN)
    Local Storage
    Client
    Directory
    lookup File
    access
    … compute
    process
    local access
    FSN is also
    compute node.
    Gfarm File System Components
    Sep.9, 2016 [email protected] 31

    View Slide

  32. Use Cases of Gfarm
    ▶ HPCI (High Performance Computing Infrastructure)
    ○ http://www.hpci-office.jp/
    ○ Computational environment connecting the K computer and other
    supercomputers of research institutions in Japan by SINET5.
    ▶ NICT Science Cloud
    ○ http://sc-web.nict.go.jp/
    ▶ Commercial Uses
    ○ Active! mail by QUALITIA
    • http://www.qualitia.co.jp/product/am/
    Sep.9, 2016 [email protected] 32

    View Slide

  33. Gfarm Features
    ▶ Scalable Capacity
    ○ By adding FSN
    ○ Commodity hardware
    ▶ Fault Tolerance
    ○ Standby slave MDS
    ○ Automatic file replication (mirroring)
    ▶ High Performance
    ○ Parallel access scales
    ○ Local access
    Sep.9, 2016 [email protected] 33

    View Slide

  34. Gfarm Issues
    ▶ MDS is stand alone, not scalable.
    ○ File creation speed is limited by DB performance.
    ○ Use SSD for MDS DB storage.
    ▶ Sequential access performance does not
    increase.
    ○ Gfarm does not support network RAID except 1.
    ○ Use RAID0/5 for FSN spool.
    ▶ Maybe improved in the future.
    Sep.9, 2016 [email protected] 34

    View Slide

  35. Gfarm Information Source
    ▶ NPO - OSS Tsukuba
    ○ http://oss-tsukuba.org/
    ▶ Gfarm Symposium/Workshop
    ○ http://oss-tsukuba.org/event
    ○ Next Workshop: Oct 21, 2016 @Kobe
    • http://oss-tsukuba.org/event/gw2016
    ▶ Mailing List
    ○ https://sourceforge.net/p/gfarm/mailman/
    ▶ Paid Support
    ○ http://oss-tsukuba.org/support
    Sep.9, 2016 [email protected] 35

    View Slide

  36. Pwrake master
    Supporting Gfarm by Pwrake
    Sep.9, 2016 [email protected] 36
    Pwrake
    worker
    process
    process
    /
    tmp/
    pwrake_john_000/
    Rakefile
    file01.dat
    file02.dat
    pwrake_john_001/
    Rakefile
    file01.dat
    file02.dat
    /
    tmp/
    john/
    Rakefile
    file01.dat
    file02.dat
    Gfarm MDS
    Gfarm /
    Rakefile
    file01.dat
    file02.dat
    Find FSN where
    a file is stored.
    (gfwhere-pipe)
    Mount Gfarm FS
    for each core.
    (gfarm2fs)
    Pwrake Master Node
    Worker Node
    mount
    mount
    Check Gfarm FS?
    communicator

    View Slide

  37. Contents
    ▶ Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶ Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 37

    View Slide

  38. Locality in Gfarm File System
    ▶ Large Scientific Data
    ○ File I/O is a bottleneck
    ▶ Data Locality is a key
    ○ Write a File
    • Select local storage
    • to write output file
    ○ Read a File
    • Assign a task to the node
    where input file exits
    • Workflow System
    Sep.9, 2016 38
    [email protected]
    Local
    Storage
    Local
    Storage
    File
    Task Task
    Local
    Storage
    Local
    Storage
    Task
    File File
    Write
    Read

    View Slide

  39. ▶ NodeQueue in TaskQueue: Assigned to worker node.
    ▶ enq: put a task into NodeQueue assigned to candidate nodes.
    ▶ deq: get a task from NodeQueue assigned to worker thread node.
    ▶ Load-balancing by deq-ing from another NodeQueue (Task-stealing)
    Sep.9, 2016 [email protected]
    Locality-Aware Task Queue
    TaskQueue
    Node 1
    Node 2
    Node 3
    deq
    enq
    NodeQueue
    RemoteQueue
    Task
    39
    worker thread

    View Slide

  40. 1. Naïve Locality Scheduling
    ○ Define “candidate nodes” where input files are
    stored.
    ○ Default of Pwrake
    2. Scheduling based on Graph Partitioning
    ○ Method using MCGP (Multi-Constraint Graph Partitioning)
    ○ Publication: CCGrid 2012
    • M. Tanaka and O. Tatebe, "Workflow Scheduling to Minimize Data Movement Using
    Multi-constraint Graph Partitioning," in CCGrid 2012, p.65.
    Locality-aware Scheduling Methods
    Sep.9, 2016 [email protected] 40

    View Slide

  41. ▶ Find Candidate Nodes for Task based on Input File Location
    ○ Note: Input files for a task can be stored in multiple nodes.
    ▶ Method to define candidate node:
    ○ Calculate the total size of input files for each node.
    ○ Find candidate nodes having more file size than half of maximum size.
    Sep.9, 2016 [email protected]
    Naïve Locality Scheduling
    Task t
    ×
    input files
    Node 3
    Node 2
    Node 1
    file A file B
    file C
    ½
    max
    filesize
    file A file B file C
    ○ ○
    file C file C
    41

    View Slide

  42. Graph Partitioning on DAG
    Standard Graph Partitioning
    Proposed method using
    Multi-Constraint Graph Partitioning
    Node-A Node-B Node-C Node-D
    Former Tasks Latter Tasks
    Not aware of task parallelization
    Sep.9, 2016 [email protected] 42
    Parallelize in every stage
    Node-A Node-B Node-C Node-D

    View Slide

  43. Platform for Evaluation
    ▶ Cluster used for Evaluation ▶ Input File: 2MASS Image
    CPU Xeon E5410
    (2.3GHz)
    Main Memory 16 GB
    Network GbE
    # of Nodes 8
    Total # of Cores 32
    Data size of each File
    2.1 MB or
    1.7 MB
    # of Input Files 607
    Total Data size of
    Input Files
    1270 MB
    Data I/O size during
    Workflow
    ~24 GB
    Total # of Tasks
    = # of Vertices
    3090
    At first, All the Input files are stored at a single node.
    Sep.9, 2016 43
    [email protected]

    View Slide

  44. Data Transfer between nodes
    87.9
    47.4
    14.0
    0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    100
    A(Unconcern) B(Naïve locality) C(MCGP)
    Data Size Ratio (%)
    Sep.9, 2016 [email protected] 44

    View Slide

  45. Workflow Execution Time
    0
    20
    40
    60
    80
    100
    120
    140
    160
    180
    200
    A(Unconcern) B(Naïve locality) C(MCGP)
    Elapsed Time (sec)
    31% down
    22% down
    Includes time
    to solve MCGP
    (30 ms)
    Sep.9, 2016 [email protected] 45

    View Slide

  46. Contents
    ▶ Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶ Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶Fault Tolerance
    ▶ Science Data Processing with Pwrake & Gfarm
    Sep.9, 2016 [email protected] 46

    View Slide

  47. Fault Tolerance in Pwrake
    ▶ Master Failure:
    ○ Rerun Pwrake and resume interrupted workflow.
    • based on time stamp of Input/Output files.
    ▶ Worker Failure:
    ○ Policy:
    • Workflow does not stop even after one of worker
    nodes fails.
    ○ Approaches:
    • Automatic file replication by Gfarm FS
    • Task retry, Worker dropout by Pwrake (ver 2.1)
    Sep.9, 2016 [email protected] 47

    View Slide

  48. Experiment of Worker Failure
    ▶ Kill worker process and gfsd at 20 sec.
    ▶ Reduce # of cores 64 → 56, Storage is unavailable after kill.
    ▶ Final result was correct. Workflow continued successfully.
    Sep.9, 2016 [email protected] 48
    0
    10
    20
    30
    40
    50
    60
    70
    0 10 20 30 40 50 60 70 80
    # of running processes
    time (sec)
    Kill processes and gfsd
    in a worker node

    View Slide

  49. Contents
    ▶ Background: Scientific Workflow
    ▶ Workflow Definition Language
    ▶ Pwrake Structure
    ▶ Gfarm Distributed File System
    ▶ Locality-Aware Task Scheduling
    ▶ Fault Tolerance
    ▶Science Data Processing with
    Pwrake & Gfarm
    ○ NICT Science Cloud
    ○ HSC in Subaru Telescope
    Sep.9, 2016 [email protected] 49

    View Slide

  50. NICT Science Cloud
    Sep.9, 2016 [email protected] 50
    Himawari-8 realtime web
    http://himawari8.nict.go.jp/
    http://sc-web.nict.go.jp/
    Presentation at Gfarm Symposium 2015 - http://oss-tsukuba.org/event/gs2015

    View Slide

  51. Hyper Suprime-Cam(HSC) in Subaru
    Telescope
    Sep.9, 2016 [email protected] 51
    (Image credit: NAOJ・HSC project)
    HSC Outlook
    HSC focal plane CCD
    Subaru Telescope
    Field of View: 1.5 degree (x 3 than Suprime-Cam)
    # of CCDs: 116
    CCD pixels: 4272×2272
    Generates ~300 GB data per night
    One of HSC targets: Discovery of Super Nova

    View Slide

  52. Conclusion
    ▶ Scientific Workflow System is required for processing of
    science data on multi-node cluster.
    ▶ Rake is powerful as a Workflow Definition Language.
    ▶ Pwrake workflow system is developed based on Rake
    an Gfarm file system.
    ▶ Study on Locality-Aware Task Scheduling.
    ▶ Fault Tolerance features.
    ▶ Pwrake & Gfarm use cases:
    ○ NICT Science Cloud
    ○ Subaru HSC
    Sep.9, 2016 [email protected] 52

    View Slide