Presented at the end of the SIParCS 2011 Internship programm at NCAR https://www2.cisl.ucar.edu/siparcs/calendar/2011-07-29/allan-espinosa/ramdisk-provisioning-service-high-performance-data-analys
Data-intensive postprocessing analysis is an important component of climate simulation science. For example, a high-resolution CESM atmosphere simulation workflow produces vast datasets from computing centers on TeraGrid. As the simulation progresses, scientists need to transfer data back to their home institutions to perform dataintensive diagnostics and analysis. But to do this, there is significant I/O-disk overhead in transferring and reading data. Previous work demonstrated that running the analysis with the dataset on a RAMbased file system, rather than a spinning-disk file system, significantly decreases analysis time. We build on this result by adding storage provisioning and data transfer steps to the analysis workflow and integrating the concepts as services on a Linux cluster. We configured the Torque resource manager and the Maui scheduler to provide a special queue that allocates RAM-disk space for users. The analysis cluster’s resource manager can then be used to manage temporary allocations, data transfer, and postprocessing tasks. The final result is an end-to-end data postprocessing pipeline that orchestrates data transfer from TeraGrid supercomputers to NCAR, executes the standard diagnostic, and transfers the data to the tape archive system, all without placing the data on spinning disk.