A Ramdisk Provisioning Service for High Performance Data Analysis

A RAM-disk provisioning service for high performance data analysis Allan
Espinosa† ([email protected]) Mentors: M. Woitaszek and J. Dennis †University of Chicago, National Center for Atmospheric Research July 29, 2011 1 / 64

Outline 1 Motivation: data analysis 2 Approach and challenges 3
Implementation 4 Target applications 5 Conclusions 2 / 64

Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Spinning
disk-based parallel ﬁle system Analysis cluster 3 / 64

Motivation: data-intensive post-processing Simulation results Computing center Transfer nodes Analysis
1 Spinning disk-based parallel ﬁle system Tape Archive Analysis cluster 4 / 64

1 Spinning disk-based parallel ﬁle system Tape Archive Analysis cluster 5 / 64

1 Analysis 2 Spinning disk-based parallel ﬁle system Tape Archive Analysis cluster 6 / 64

1 Analysis 2 Analysis n Spinning disk-based parallel ﬁle system . . . Tape Archive Analysis cluster 7 / 64

1 Analysis 2 Analysis n Spinning disk-based parallel ﬁle system . . . Tape Archive Analysis cluster Multiple trips to disk is slow 8 / 64

Approach: Run analysis on RAM Fast I/O access 9 /
64

Approach: Run analysis on RAM Fast I/O access tmpfs or
formatted /dev/ram CPU CPU RAM-based disk Analysis node Problem: Restricted parallelism 10 / 64

formatted /dev/ram NFS-exported RAM CPU CPU RAM-based disk CPU CPU Problem: Restricted data size 11 / 64

formatted /dev/ram NFS-exported RAM Split data over multiple nodes CPU CPU RAM-based disk CPU CPU RAM-based disk Problem: Requires thorough I/O management 12 / 64

formatted /dev/ram NFS-exported RAM Split data over multiple nodes Lustre parallel RAM ﬁle system CPU CPU CPU CPU Lustre parallel RAM ﬁle system CPU CPU CPU CPU 13 / 64

Solution: Automatically-provisioned parallel ﬁle system Scheduler User Client Submit jobs
Polynya analysis cluster 14 / 64

Solution: Automatically-provisioned parallel ﬁle system Control Node Parallel RAM ﬁle
system Scheduler User Client Submit jobs Polynya analysis cluster 15 / 64

Solution: Automatically-provisioned parallel ﬁle system Control Node Transfer Node Parallel
RAM ﬁle system Scheduler User Client Submit jobs Polynya analysis cluster WAN Transfer Node File system Kraken 16 / 64

Solution: Automatically-provisioned parallel ﬁle system Control Node Transfer Node Analysis
Nodes Parallel RAM ﬁle system Scheduler User Client Submit jobs Polynya analysis cluster WAN Transfer Node File system Kraken 17 / 64

Solution: Automatically-provisioned parallel ﬁle system Control Node Transfer Node Analysis
Nodes Archive Node Parallel RAM ﬁle system Tape Archive Scheduler User Client Submit jobs Polynya analysis cluster WAN Transfer Node File system Kraken 18 / 64

Remote triggering the workflow Simulation finishes Kraken Workflow Polynya Trigger
workflow 19 / 64

Remote triggering the workflow Simulation finishes Kraken Request space Workflow
Polynya Trigger workflow 20 / 64

Remote triggering the workflow Simulation finishes Kraken Request space Transfer
datasets Workflow Polynya Trigger workflow 21 / 64

datasets Archive datasets Run analysis Workﬂow Polynya Trigger workﬂow 22 / 64

datasets Archive datasets Run analysis Trigger cleanup Workﬂow Polynya Trigger workﬂow 23 / 64

Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource
24 / 64

Parameters: amount of space #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 25 / 64

Parameters: amount of space duration of allocation #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 26 / 64

Parameters: amount of space duration of allocation 1 Route to control node #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 27 / 64

Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 28 / 64

Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 29 / 64

Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration 4 Email notice before expiration #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 30 / 64

Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration 4 Email notice before expiration 5 Clean up space #PBS -W x="GRES:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster ... sleep 3h 31 / 64

Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP
data nodes 32 / 64

data nodes Co-located as RAM-based disk space provider 33 / 64

data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server 34 / 64

data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server Key-authenticated SSH∗ ∗Remote trigger mechanism 35 / 64

data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server Key-authenticated SSH∗ X509-authenticaed GRAM5∗ ∗Remote trigger mechanism 36 / 64

Example application: AMWG diagnostics Compares CESM simulation data, observational data,
reanalysis data 37 / 64

reanalysis data Parallel implementation in Swift∗ ∗Parallel scripting engine http://www.ci.uchicago.edu/swift 38 / 64

reanalysis data Parallel implementation in Swift∗ Parameters: dataset name number of time segments (years) ∗Parallel scripting engine http://www.ci.uchicago.edu/swift 39 / 64

reanalysis data Parallel implementation in Swift∗ Parameters: dataset name number of time segments (years) Dataset volume: 2.8 GB per year (1◦ data) ∗Parallel scripting engine http://www.ci.uchicago.edu/swift 40 / 64

Data movement benchmarks∗ File system IOR-8 GridFTP to Polynya Write†
from Frost from Kraken /dev/null 3,190 Lustre disk 111 tmpfs RAM 2,983 XFS RAM 2,296 Lustre RAM 2,881 ∗units in MB/s †from D. Duplyakin’s experiments 41 / 64

from Frost from Kraken /dev/null 3,190 139 Lustre disk 111 113 tmpfs RAM 2,983 117 XFS RAM 2,296 125 Lustre RAM 2,881 134 ∗units in MB/s †from D. Duplyakin’s experiments 32 MB TCP buﬀer, 16 MB block size, 4 streams 42 / 64

from Frost from Kraken /dev/null 3,190 139 28 Lustre disk 111 113 35 tmpfs RAM 2,983 117 34 XFS RAM 2,296 125 35 Lustre RAM 2,881 134 36 ∗units in MB/s †from D. Duplyakin’s experiments 32 MB TCP buﬀer, 16 MB block size, 16 streams 43 / 64

from Frost from Kraken /dev/null 3,190 139 28 Lustre disk 111 113 35 tmpfs RAM 2,983 117 34 XFS RAM 2,296 125 35 Lustre RAM 2,881 134 36 GridFTP from Kraken to Frost: 216 MB/s ∗units in MB/s †from D. Duplyakin’s experiments 32 MB TCP buﬀer, 16 MB block size, 16 streams 44 / 64

Application performance Ran on 64-CPU node, 2-year time segment (8.2
GB total) File system Runtime (s) Lustre disk 213 tmpfs RAM 29 XFS RAM 29 Lustre RAM 70 45 / 64

Application performance From Frost: Lustre RAM XFS RAM tmpfs RAM
Lustre disk Data Transfer AMWG Analysis Time (s) 0 50 100 150 200 250 46 / 64

End-to-end workﬂow Request space Time (s) 47 / 64

End-to-end workﬂow Transfer Request space Time (s) 48 / 64

End-to-end workﬂow Transfer Request space Time (s) 49 / 64

End-to-end workﬂow Analysis 1 Analysis 2 . . . Analysis
n Archive Transfer Request space Time (s) 50 / 64

n Archive Transfer Request space Time (s) 51 / 64

n Archive Transfer Cleanup Request space Time (s) 52 / 64

n Archive Transfer Cleanup Request space Time (s) 53 / 64

Other use case: Interactive jobs Automated workﬂow split component wise
54 / 64

Each step is run by the user manually 55 / 64

Each step is run by the user manually Steps: 1 Request space 2 Transfers data to allocated space (globus-url-copy or Globus Online) 3 Runs analysis on allocated space 4 Email notice before expiration 5 Cleanup by deleting request job 56 / 64

Conclusions End-to-end analysis platform without touching spinning disk 57 /
64

Conclusions End-to-end analysis platform without touching spinning disk Interface through
familiar PBS interface 58 / 64

familiar PBS interface Workﬂow automation to drive analysis 59 / 64

familiar PBS interface Workﬂow automation to drive analysis Network bandwidth critical to performance 60 / 64

familiar PBS interface Workﬂow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement 61 / 64

familiar PBS interface Workﬂow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement Application-perspective ﬁle system scalability 62 / 64

familiar PBS interface Workﬂow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement Application-perspective ﬁle system scalability Explore framework on other resources: disk, bandwidth, etc. 63 / 64

Questions? A RAM-disk provisioning service for high performance data analysis
Allan Espinosa† ([email protected]) Mentors: M. Woitaszek and J. Dennis †University of Chicago, National Center for Atmospheric Research July 29, 2011 64 / 64

A Ramdisk Provisioning Service for High Perform...

A Ramdisk Provisioning Service for High Performance Data Analysis

More Decks by Allan Espinosa

Other Decks in Research

Featured

Transcript