Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Ramdisk Provisioning Service for High Performance Data Analysis

A Ramdisk Provisioning Service for High Performance Data Analysis

Presented at the end of the SIParCS 2011 Internship programm at NCAR https://www2.cisl.ucar.edu/siparcs/calendar/2011-07-29/allan-espinosa/ramdisk-provisioning-service-high-performance-data-analys

Data-intensive postprocessing analysis is an important component of climate simulation science. For example, a high-resolution CESM atmosphere simulation workflow produces vast datasets from computing centers on TeraGrid. As the simulation progresses, scientists need to transfer data back to their home institutions to perform dataintensive diagnostics and analysis. But to do this, there is significant I/O-disk overhead in transferring and reading data. Previous work demonstrated that running the analysis with the dataset on a RAMbased file system, rather than a spinning-disk file system, significantly decreases analysis time. We build on this result by adding storage provisioning and data transfer steps to the analysis workflow and integrating the concepts as services on a Linux cluster. We configured the Torque resource manager and the Maui scheduler to provide a special queue that allocates RAM-disk space for users. The analysis cluster’s resource manager can then be used to manage temporary allocations, data transfer, and postprocessing tasks. The final result is an end-to-end data postprocessing pipeline that orchestrates data transfer from TeraGrid supercomputers to NCAR, executes the standard diagnostic, and transfers the data to the tape archive system, all without placing the data on spinning disk.

Allan Espinosa

July 29, 2011
Tweet

More Decks by Allan Espinosa

Other Decks in Research

Transcript

  1. A RAM-disk provisioning service for high
    performance data analysis
    Allan Espinosa† ([email protected])
    Mentors: M. Woitaszek and J. Dennis
    †University of Chicago, National Center for Atmospheric Research
    July 29, 2011
    1 / 64

    View full-size slide

  2. Outline
    1 Motivation: data analysis
    2 Approach and challenges
    3 Implementation
    4 Target applications
    5 Conclusions
    2 / 64

    View full-size slide

  3. Motivation: data-intensive post-processing
    Simulation results
    Computing center
    Transfer
    nodes
    Spinning disk-based
    parallel file system
    Analysis cluster
    3 / 64

    View full-size slide

  4. Motivation: data-intensive post-processing
    Simulation results
    Computing center
    Transfer
    nodes
    Analysis 1
    Spinning disk-based
    parallel file system
    Tape
    Archive
    Analysis cluster
    4 / 64

    View full-size slide

  5. Motivation: data-intensive post-processing
    Simulation results
    Computing center
    Transfer
    nodes
    Analysis 1
    Spinning disk-based
    parallel file system
    Tape
    Archive
    Analysis cluster
    5 / 64

    View full-size slide

  6. Motivation: data-intensive post-processing
    Simulation results
    Computing center
    Transfer
    nodes
    Analysis 1 Analysis 2
    Spinning disk-based
    parallel file system
    Tape
    Archive
    Analysis cluster
    6 / 64

    View full-size slide

  7. Motivation: data-intensive post-processing
    Simulation results
    Computing center
    Transfer
    nodes
    Analysis 1 Analysis 2 Analysis n
    Spinning disk-based
    parallel file system
    . . .
    Tape
    Archive
    Analysis cluster
    7 / 64

    View full-size slide

  8. Motivation: data-intensive post-processing
    Simulation results
    Computing center
    Transfer
    nodes
    Analysis 1 Analysis 2 Analysis n
    Spinning disk-based
    parallel file system
    . . .
    Tape
    Archive
    Analysis cluster
    Multiple trips to disk is slow
    8 / 64

    View full-size slide

  9. Approach: Run analysis on RAM
    Fast I/O access
    9 / 64

    View full-size slide

  10. Approach: Run analysis on RAM
    Fast I/O access
    tmpfs or formatted
    /dev/ram
    CPU CPU
    RAM-based
    disk
    Analysis node
    Problem: Restricted parallelism
    10 / 64

    View full-size slide

  11. Approach: Run analysis on RAM
    Fast I/O access
    tmpfs or formatted
    /dev/ram
    NFS-exported RAM
    CPU CPU
    RAM-based
    disk
    CPU
    CPU
    Problem: Restricted data size
    11 / 64

    View full-size slide

  12. Approach: Run analysis on RAM
    Fast I/O access
    tmpfs or formatted
    /dev/ram
    NFS-exported RAM
    Split data over multiple
    nodes
    CPU CPU
    RAM-based
    disk
    CPU CPU
    RAM-based
    disk
    Problem: Requires thorough I/O management
    12 / 64

    View full-size slide

  13. Approach: Run analysis on RAM
    Fast I/O access
    tmpfs or formatted
    /dev/ram
    NFS-exported RAM
    Split data over multiple
    nodes
    Lustre parallel RAM file
    system
    CPU CPU CPU CPU
    Lustre parallel
    RAM file system
    CPU CPU CPU CPU
    13 / 64

    View full-size slide

  14. Solution: Automatically-provisioned parallel file system
    Scheduler
    User Client
    Submit jobs
    Polynya analysis cluster
    14 / 64

    View full-size slide

  15. Solution: Automatically-provisioned parallel file system
    Control
    Node
    Parallel RAM file system
    Scheduler
    User Client
    Submit jobs
    Polynya analysis cluster
    15 / 64

    View full-size slide

  16. Solution: Automatically-provisioned parallel file system
    Control
    Node
    Transfer
    Node
    Parallel RAM file system
    Scheduler
    User Client
    Submit jobs
    Polynya analysis cluster
    WAN
    Transfer
    Node
    File system
    Kraken
    16 / 64

    View full-size slide

  17. Solution: Automatically-provisioned parallel file system
    Control
    Node
    Transfer
    Node
    Analysis
    Nodes
    Parallel RAM file system
    Scheduler
    User Client
    Submit jobs
    Polynya analysis cluster
    WAN
    Transfer
    Node
    File system
    Kraken
    17 / 64

    View full-size slide

  18. Solution: Automatically-provisioned parallel file system
    Control
    Node
    Transfer
    Node
    Analysis
    Nodes
    Archive
    Node
    Parallel RAM file system
    Tape
    Archive
    Scheduler
    User Client
    Submit jobs
    Polynya analysis cluster
    WAN
    Transfer
    Node
    File system
    Kraken
    18 / 64

    View full-size slide

  19. Remote triggering the workflow
    Simulation
    finishes
    Kraken
    Workflow
    Polynya
    Trigger workflow
    19 / 64

    View full-size slide

  20. Remote triggering the workflow
    Simulation
    finishes
    Kraken
    Request
    space
    Workflow
    Polynya
    Trigger workflow
    20 / 64

    View full-size slide

  21. Remote triggering the workflow
    Simulation
    finishes
    Kraken
    Request
    space
    Transfer
    datasets
    Workflow
    Polynya
    Trigger workflow
    21 / 64

    View full-size slide

  22. Remote triggering the workflow
    Simulation
    finishes
    Kraken
    Request
    space
    Transfer
    datasets
    Archive
    datasets
    Run
    analysis
    Workflow
    Polynya
    Trigger workflow
    22 / 64

    View full-size slide

  23. Remote triggering the workflow
    Simulation
    finishes
    Kraken
    Request
    space
    Transfer
    datasets
    Archive
    datasets
    Run
    analysis
    Trigger
    cleanup
    Workflow
    Polynya
    Trigger workflow
    23 / 64

    View full-size slide

  24. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    24 / 64

    View full-size slide

  25. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    25 / 64

    View full-size slide

  26. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    duration of allocation
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    26 / 64

    View full-size slide

  27. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    duration of allocation
    1 Route to control node
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    27 / 64

    View full-size slide

  28. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    duration of allocation
    1 Route to control node
    2 Prepare space
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    28 / 64

    View full-size slide

  29. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    duration of allocation
    1 Route to control node
    2 Prepare space
    3 Sleep until allocation
    expiration
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    29 / 64

    View full-size slide

  30. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    duration of allocation
    1 Route to control node
    2 Prepare space
    3 Sleep until allocation
    expiration
    4 Email notice before
    expiration
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    30 / 64

    View full-size slide

  31. Requesting RAM-based disk space
    Implementation: PBS Torque+Maui scheduler generic resource
    Parameters:
    amount of space
    duration of allocation
    1 Route to control node
    2 Prepare space
    3 Sleep until allocation
    expiration
    4 Email notice before
    expiration
    5 Clean up space
    #PBS -W x="GRES:ramdisk@25"
    #PBS -l walltime="48:00:00"
    #PBS -q ramdisk_service
    #PBS -l prologue=allocate.sh
    #PBS -l epilogue=cleanup.sh
    sleep 45h
    mail user@cluster ...
    sleep 3h
    31 / 64

    View full-size slide

  32. Transferring datasets
    Implementation: Route request to transfer nodes
    Striped GridFTP data nodes
    32 / 64

    View full-size slide

  33. Transferring datasets
    Implementation: Route request to transfer nodes
    Striped GridFTP data nodes
    Co-located as RAM-based disk space provider
    33 / 64

    View full-size slide

  34. Transferring datasets
    Implementation: Route request to transfer nodes
    Striped GridFTP data nodes
    Co-located as RAM-based disk space provider
    Other administrative components:
    GridFTP control channel server
    34 / 64

    View full-size slide

  35. Transferring datasets
    Implementation: Route request to transfer nodes
    Striped GridFTP data nodes
    Co-located as RAM-based disk space provider
    Other administrative components:
    GridFTP control channel server
    Key-authenticated SSH∗
    ∗Remote trigger mechanism
    35 / 64

    View full-size slide

  36. Transferring datasets
    Implementation: Route request to transfer nodes
    Striped GridFTP data nodes
    Co-located as RAM-based disk space provider
    Other administrative components:
    GridFTP control channel server
    Key-authenticated SSH∗
    X509-authenticaed GRAM5∗
    ∗Remote trigger mechanism
    36 / 64

    View full-size slide

  37. Example application: AMWG diagnostics
    Compares CESM
    simulation data,
    observational data,
    reanalysis data
    37 / 64

    View full-size slide

  38. Example application: AMWG diagnostics
    Compares CESM
    simulation data,
    observational data,
    reanalysis data
    Parallel implementation in
    Swift∗
    ∗Parallel scripting engine http://www.ci.uchicago.edu/swift
    38 / 64

    View full-size slide

  39. Example application: AMWG diagnostics
    Compares CESM
    simulation data,
    observational data,
    reanalysis data
    Parallel implementation in
    Swift∗
    Parameters:
    dataset name
    number of time
    segments (years)
    ∗Parallel scripting engine http://www.ci.uchicago.edu/swift
    39 / 64

    View full-size slide

  40. Example application: AMWG diagnostics
    Compares CESM
    simulation data,
    observational data,
    reanalysis data
    Parallel implementation in
    Swift∗
    Parameters:
    dataset name
    number of time
    segments (years)
    Dataset volume: 2.8 GB
    per year (1◦ data)
    ∗Parallel scripting engine http://www.ci.uchicago.edu/swift
    40 / 64

    View full-size slide

  41. Data movement benchmarks∗
    File system
    IOR-8 GridFTP to Polynya
    Write† from Frost from Kraken
    /dev/null 3,190
    Lustre disk 111
    tmpfs RAM 2,983
    XFS RAM 2,296
    Lustre RAM 2,881
    ∗units in MB/s
    †from D. Duplyakin’s experiments
    41 / 64

    View full-size slide

  42. Data movement benchmarks∗
    File system
    IOR-8 GridFTP to Polynya
    Write† from Frost from Kraken
    /dev/null 3,190 139
    Lustre disk 111 113
    tmpfs RAM 2,983 117
    XFS RAM 2,296 125
    Lustre RAM 2,881 134
    ∗units in MB/s
    †from D. Duplyakin’s experiments
    32 MB TCP buffer, 16 MB block size, 4 streams
    42 / 64

    View full-size slide

  43. Data movement benchmarks∗
    File system
    IOR-8 GridFTP to Polynya
    Write† from Frost from Kraken
    /dev/null 3,190 139 28
    Lustre disk 111 113 35
    tmpfs RAM 2,983 117 34
    XFS RAM 2,296 125 35
    Lustre RAM 2,881 134 36
    ∗units in MB/s
    †from D. Duplyakin’s experiments
    32 MB TCP buffer, 16 MB block size, 16 streams
    43 / 64

    View full-size slide

  44. Data movement benchmarks∗
    File system
    IOR-8 GridFTP to Polynya
    Write† from Frost from Kraken
    /dev/null 3,190 139 28
    Lustre disk 111 113 35
    tmpfs RAM 2,983 117 34
    XFS RAM 2,296 125 35
    Lustre RAM 2,881 134 36
    GridFTP from Kraken to Frost: 216 MB/s
    ∗units in MB/s
    †from D. Duplyakin’s experiments
    32 MB TCP buffer, 16 MB block size, 16 streams
    44 / 64

    View full-size slide

  45. Application performance
    Ran on 64-CPU node, 2-year time segment (8.2 GB total)
    File system Runtime (s)
    Lustre disk 213
    tmpfs RAM 29
    XFS RAM 29
    Lustre RAM 70
    45 / 64

    View full-size slide

  46. Application performance
    From Frost:
    Lustre RAM
    XFS RAM
    tmpfs RAM
    Lustre disk
    Data Transfer
    AMWG Analysis
    Time (s)
    0 50 100 150 200 250
    46 / 64

    View full-size slide

  47. End-to-end workflow
    Request space
    Time (s)
    47 / 64

    View full-size slide

  48. End-to-end workflow
    Transfer
    Request space
    Time (s)
    48 / 64

    View full-size slide

  49. End-to-end workflow
    Transfer
    Request space
    Time (s)
    49 / 64

    View full-size slide

  50. End-to-end workflow
    Analysis 1
    Analysis 2
    . . .
    Analysis n
    Archive
    Transfer
    Request space
    Time (s)
    50 / 64

    View full-size slide

  51. End-to-end workflow
    Analysis 1
    Analysis 2
    . . .
    Analysis n
    Archive
    Transfer
    Request space
    Time (s)
    51 / 64

    View full-size slide

  52. End-to-end workflow
    Analysis 1
    Analysis 2
    . . .
    Analysis n
    Archive
    Transfer
    Cleanup
    Request space
    Time (s)
    52 / 64

    View full-size slide

  53. End-to-end workflow
    Analysis 1
    Analysis 2
    . . .
    Analysis n
    Archive
    Transfer
    Cleanup
    Request space
    Time (s)
    53 / 64

    View full-size slide

  54. Other use case: Interactive jobs
    Automated workflow split component wise
    54 / 64

    View full-size slide

  55. Other use case: Interactive jobs
    Automated workflow split component wise
    Each step is run by the user manually
    55 / 64

    View full-size slide

  56. Other use case: Interactive jobs
    Automated workflow split component wise
    Each step is run by the user manually
    Steps:
    1 Request space
    2 Transfers data to allocated space (globus-url-copy or
    Globus Online)
    3 Runs analysis on allocated space
    4 Email notice before expiration
    5 Cleanup by deleting request job
    56 / 64

    View full-size slide

  57. Conclusions
    End-to-end analysis platform without touching spinning disk
    57 / 64

    View full-size slide

  58. Conclusions
    End-to-end analysis platform without touching spinning disk
    Interface through familiar PBS interface
    58 / 64

    View full-size slide

  59. Conclusions
    End-to-end analysis platform without touching spinning disk
    Interface through familiar PBS interface
    Workflow automation to drive analysis
    59 / 64

    View full-size slide

  60. Conclusions
    End-to-end analysis platform without touching spinning disk
    Interface through familiar PBS interface
    Workflow automation to drive analysis
    Network bandwidth critical to performance
    60 / 64

    View full-size slide

  61. Conclusions
    End-to-end analysis platform without touching spinning disk
    Interface through familiar PBS interface
    Workflow automation to drive analysis
    Network bandwidth critical to performance
    Future work:
    Tune network for high performance data movement
    61 / 64

    View full-size slide

  62. Conclusions
    End-to-end analysis platform without touching spinning disk
    Interface through familiar PBS interface
    Workflow automation to drive analysis
    Network bandwidth critical to performance
    Future work:
    Tune network for high performance data movement
    Application-perspective file system scalability
    62 / 64

    View full-size slide

  63. Conclusions
    End-to-end analysis platform without touching spinning disk
    Interface through familiar PBS interface
    Workflow automation to drive analysis
    Network bandwidth critical to performance
    Future work:
    Tune network for high performance data movement
    Application-perspective file system scalability
    Explore framework on other resources: disk, bandwidth, etc.
    63 / 64

    View full-size slide

  64. Questions?
    A RAM-disk provisioning service for high
    performance data analysis
    Allan Espinosa† ([email protected])
    Mentors: M. Woitaszek and J. Dennis
    †University of Chicago, National Center for Atmospheric Research
    July 29, 2011
    64 / 64

    View full-size slide