Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Ramdisk Provisioning Service for High Performance Data Analysis

A Ramdisk Provisioning Service for High Performance Data Analysis

Presented at the end of the SIParCS 2011 Internship programm at NCAR https://www2.cisl.ucar.edu/siparcs/calendar/2011-07-29/allan-espinosa/ramdisk-provisioning-service-high-performance-data-analys

Data-intensive postprocessing analysis is an important component of climate simulation science. For example, a high-resolution CESM atmosphere simulation workflow produces vast datasets from computing centers on TeraGrid. As the simulation progresses, scientists need to transfer data back to their home institutions to perform dataintensive diagnostics and analysis. But to do this, there is significant I/O-disk overhead in transferring and reading data. Previous work demonstrated that running the analysis with the dataset on a RAMbased file system, rather than a spinning-disk file system, significantly decreases analysis time. We build on this result by adding storage provisioning and data transfer steps to the analysis workflow and integrating the concepts as services on a Linux cluster. We configured the Torque resource manager and the Maui scheduler to provide a special queue that allocates RAM-disk space for users. The analysis cluster’s resource manager can then be used to manage temporary allocations, data transfer, and postprocessing tasks. The final result is an end-to-end data postprocessing pipeline that orchestrates data transfer from TeraGrid supercomputers to NCAR, executes the standard diagnostic, and transfers the data to the tape archive system, all without placing the data on spinning disk.

93ac6f98100c830506d8d451245635e1?s=128

Allan Espinosa

July 29, 2011
Tweet

Transcript

  1. A RAM-disk provisioning service for high performance data analysis Allan

    Espinosa† (aespinosa@cs.uchicago.edu) Mentors: M. Woitaszek and J. Dennis †University of Chicago, National Center for Atmospheric Research July 29, 2011 1 / 64
  2. Outline 1 Motivation: data analysis 2 Approach and challenges 3

    Implementation 4 Target applications 5 Conclusions 2 / 64
  3. Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Spinning

    disk-based parallel file system Analysis cluster 3 / 64
  4. Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Analysis

    1 Spinning disk-based parallel file system Tape Archive Analysis cluster 4 / 64
  5. Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Analysis

    1 Spinning disk-based parallel file system Tape Archive Analysis cluster 5 / 64
  6. Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Analysis

    1 Analysis 2 Spinning disk-based parallel file system Tape Archive Analysis cluster 6 / 64
  7. Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Analysis

    1 Analysis 2 Analysis n Spinning disk-based parallel file system . . . Tape Archive Analysis cluster 7 / 64
  8. Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Analysis

    1 Analysis 2 Analysis n Spinning disk-based parallel file system . . . Tape Archive Analysis cluster Multiple trips to disk is slow 8 / 64
  9. Approach: Run analysis on RAM Fast I/O access 9 /

    64
  10. Approach: Run analysis on RAM Fast I/O access tmpfs or

    formatted /dev/ram CPU CPU RAM-based disk Analysis node Problem: Restricted parallelism 10 / 64
  11. Approach: Run analysis on RAM Fast I/O access tmpfs or

    formatted /dev/ram NFS-exported RAM CPU CPU RAM-based disk CPU CPU Problem: Restricted data size 11 / 64
  12. Approach: Run analysis on RAM Fast I/O access tmpfs or

    formatted /dev/ram NFS-exported RAM Split data over multiple nodes CPU CPU RAM-based disk CPU CPU RAM-based disk Problem: Requires thorough I/O management 12 / 64
  13. Approach: Run analysis on RAM Fast I/O access tmpfs or

    formatted /dev/ram NFS-exported RAM Split data over multiple nodes Lustre parallel RAM file system CPU CPU CPU CPU Lustre parallel RAM file system CPU CPU CPU CPU 13 / 64
  14. Solution: Automatically-provisioned parallel file system Scheduler User Client Submit jobs

    Polynya analysis cluster 14 / 64
  15. Solution: Automatically-provisioned parallel file system Control Node Parallel RAM file

    system Scheduler User Client Submit jobs Polynya analysis cluster 15 / 64
  16. Solution: Automatically-provisioned parallel file system Control Node Transfer Node Parallel

    RAM file system Scheduler User Client Submit jobs Polynya analysis cluster WAN Transfer Node File system Kraken 16 / 64
  17. Solution: Automatically-provisioned parallel file system Control Node Transfer Node Analysis

    Nodes Parallel RAM file system Scheduler User Client Submit jobs Polynya analysis cluster WAN Transfer Node File system Kraken 17 / 64
  18. Solution: Automatically-provisioned parallel file system Control Node Transfer Node Analysis

    Nodes Archive Node Parallel RAM file system Tape Archive Scheduler User Client Submit jobs Polynya analysis cluster WAN Transfer Node File system Kraken 18 / 64
  19. Remote triggering the workflow Simulation finishes Kraken Workflow Polynya Trigger

    workflow 19 / 64
  20. Remote triggering the workflow Simulation finishes Kraken Request space Workflow

    Polynya Trigger workflow 20 / 64
  21. Remote triggering the workflow Simulation finishes Kraken Request space Transfer

    datasets Workflow Polynya Trigger workflow 21 / 64
  22. Remote triggering the workflow Simulation finishes Kraken Request space Transfer

    datasets Archive datasets Run analysis Workflow Polynya Trigger workflow 22 / 64
  23. Remote triggering the workflow Simulation finishes Kraken Request space Transfer

    datasets Archive datasets Run analysis Trigger cleanup Workflow Polynya Trigger workflow 23 / 64
  24. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    24 / 64
  25. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 25 / 64
  26. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space duration of allocation #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 26 / 64
  27. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space duration of allocation 1 Route to control node #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 27 / 64
  28. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 28 / 64
  29. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 29 / 64
  30. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration 4 Email notice before expiration #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 30 / 64
  31. Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource

    Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration 4 Email notice before expiration 5 Clean up space #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 31 / 64
  32. Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP

    data nodes 32 / 64
  33. Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP

    data nodes Co-located as RAM-based disk space provider 33 / 64
  34. Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP

    data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server 34 / 64
  35. Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP

    data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server Key-authenticated SSH∗ ∗Remote trigger mechanism 35 / 64
  36. Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP

    data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server Key-authenticated SSH∗ X509-authenticaed GRAM5∗ ∗Remote trigger mechanism 36 / 64
  37. Example application: AMWG diagnostics Compares CESM simulation data, observational data,

    reanalysis data 37 / 64
  38. Example application: AMWG diagnostics Compares CESM simulation data, observational data,

    reanalysis data Parallel implementation in Swift∗ ∗Parallel scripting engine http://www.ci.uchicago.edu/swift 38 / 64
  39. Example application: AMWG diagnostics Compares CESM simulation data, observational data,

    reanalysis data Parallel implementation in Swift∗ Parameters: dataset name number of time segments (years) ∗Parallel scripting engine http://www.ci.uchicago.edu/swift 39 / 64
  40. Example application: AMWG diagnostics Compares CESM simulation data, observational data,

    reanalysis data Parallel implementation in Swift∗ Parameters: dataset name number of time segments (years) Dataset volume: 2.8 GB per year (1◦ data) ∗Parallel scripting engine http://www.ci.uchicago.edu/swift 40 / 64
  41. Data movement benchmarks∗ File system IOR-8 GridFTP to Polynya Write†

    from Frost from Kraken /dev/null 3,190 Lustre disk 111 tmpfs RAM 2,983 XFS RAM 2,296 Lustre RAM 2,881 ∗units in MB/s †from D. Duplyakin’s experiments 41 / 64
  42. Data movement benchmarks∗ File system IOR-8 GridFTP to Polynya Write†

    from Frost from Kraken /dev/null 3,190 139 Lustre disk 111 113 tmpfs RAM 2,983 117 XFS RAM 2,296 125 Lustre RAM 2,881 134 ∗units in MB/s †from D. Duplyakin’s experiments 32 MB TCP buffer, 16 MB block size, 4 streams 42 / 64
  43. Data movement benchmarks∗ File system IOR-8 GridFTP to Polynya Write†

    from Frost from Kraken /dev/null 3,190 139 28 Lustre disk 111 113 35 tmpfs RAM 2,983 117 34 XFS RAM 2,296 125 35 Lustre RAM 2,881 134 36 ∗units in MB/s †from D. Duplyakin’s experiments 32 MB TCP buffer, 16 MB block size, 16 streams 43 / 64
  44. Data movement benchmarks∗ File system IOR-8 GridFTP to Polynya Write†

    from Frost from Kraken /dev/null 3,190 139 28 Lustre disk 111 113 35 tmpfs RAM 2,983 117 34 XFS RAM 2,296 125 35 Lustre RAM 2,881 134 36 GridFTP from Kraken to Frost: 216 MB/s ∗units in MB/s †from D. Duplyakin’s experiments 32 MB TCP buffer, 16 MB block size, 16 streams 44 / 64
  45. Application performance Ran on 64-CPU node, 2-year time segment (8.2

    GB total) File system Runtime (s) Lustre disk 213 tmpfs RAM 29 XFS RAM 29 Lustre RAM 70 45 / 64
  46. Application performance From Frost: Lustre RAM XFS RAM tmpfs RAM

    Lustre disk Data Transfer AMWG Analysis Time (s) 0 50 100 150 200 250 46 / 64
  47. End-to-end workflow Request space Time (s) 47 / 64

  48. End-to-end workflow Transfer Request space Time (s) 48 / 64

  49. End-to-end workflow Transfer Request space Time (s) 49 / 64

  50. End-to-end workflow Analysis 1 Analysis 2 . . . Analysis

    n Archive Transfer Request space Time (s) 50 / 64
  51. End-to-end workflow Analysis 1 Analysis 2 . . . Analysis

    n Archive Transfer Request space Time (s) 51 / 64
  52. End-to-end workflow Analysis 1 Analysis 2 . . . Analysis

    n Archive Transfer Cleanup Request space Time (s) 52 / 64
  53. End-to-end workflow Analysis 1 Analysis 2 . . . Analysis

    n Archive Transfer Cleanup Request space Time (s) 53 / 64
  54. Other use case: Interactive jobs Automated workflow split component wise

    54 / 64
  55. Other use case: Interactive jobs Automated workflow split component wise

    Each step is run by the user manually 55 / 64
  56. Other use case: Interactive jobs Automated workflow split component wise

    Each step is run by the user manually Steps: 1 Request space 2 Transfers data to allocated space (globus-url-copy or Globus Online) 3 Runs analysis on allocated space 4 Email notice before expiration 5 Cleanup by deleting request job 56 / 64
  57. Conclusions End-to-end analysis platform without touching spinning disk 57 /

    64
  58. Conclusions End-to-end analysis platform without touching spinning disk Interface through

    familiar PBS interface 58 / 64
  59. Conclusions End-to-end analysis platform without touching spinning disk Interface through

    familiar PBS interface Workflow automation to drive analysis 59 / 64
  60. Conclusions End-to-end analysis platform without touching spinning disk Interface through

    familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance 60 / 64
  61. Conclusions End-to-end analysis platform without touching spinning disk Interface through

    familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement 61 / 64
  62. Conclusions End-to-end analysis platform without touching spinning disk Interface through

    familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement Application-perspective file system scalability 62 / 64
  63. Conclusions End-to-end analysis platform without touching spinning disk Interface through

    familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement Application-perspective file system scalability Explore framework on other resources: disk, bandwidth, etc. 63 / 64
  64. Questions? A RAM-disk provisioning service for high performance data analysis

    Allan Espinosa† (aespinosa@cs.uchicago.edu) Mentors: M. Woitaszek and J. Dennis †University of Chicago, National Center for Atmospheric Research July 29, 2011 64 / 64