de Physique des Particules Overview LSST computing activities at CC-IN2P3 fabio hernandez! [email protected] LSST2015 Project and Community Workshop, Bremerton, August 2015
de Physique Nucléaire et de Physique des Particules National Institute of Nuclear Physics and Particle Physics part of CNRS, the largest publicly funded research organisation in France 2500 researchers, engineers and technicians 25 laboratories and research platforms in France, 16 international laboratories 40 major international projects 218 M€ annual budget ◦ Mission: to promote and unify research activities in the field of subatomic physics
matter’s most elementary constituents and fundamental interactions structure of nuclear matter Universe’s composition and behaviour theory, instrumentation, accelerator R&D, computing ◦ IN2P3 is contributing to two subsystems of LSST camera construction data management ◦ I will focus on data management
81 people 7.5 M€ annual equipment budget (~11 M€ overall) 2 machine rooms, 1.6 MW, 1700 m2 high throughput computing, scientific data center well connected to national and international networks in partnership with France’s Atomic Energy Commission’s Irfu ◦ Shared computing facility supporting the institute’s research program ~70 experiments in high energy physics, nuclear physics and astroparticle physics room 2 room 1
of Agreement signed between LSST Corp., NCSA and IN2P3 ◦ Level 2 data processing to be jointly performed by both NCSA and CC- IN2P3 NCSA has lead responsibility for LSST data release processing CC-IN2P3 to process 50% of the data and store the full dataset, both raw and derived data NCSA and CC-IN2P3 will both validate the data produced by the other party and will each end up with a full data release ◦ First meeting between NCSA and CC-IN2P3 held in Lyon last May https://indico.in2p3.fr/event/11723
science background ◦ Working on computing for high energy physics research for 20+ years software development for scientific data management (transfer, cataloguing, mass storage, …) and operations of IT services for scientific research ◦ Spent 10 years helping planning, prototyping, deploying and operating the worldwide grid computing infrastructure for the Large Hadron Collider (LHC) technical leader of the French contribution to the LHC computing grid ◦ Spent 5 years as deputy director of CC-IN2P3 ◦ Spent 4 years in Beijing (China) as senior visiting scientist to the Institute of High Energy Physics and as technical expert for the French embassy in China ◦ Now LSST CC-IN2P3 project leader
◦ Goals to provide Qserv developers a realistic testbed platform for developing, integrating, operating and validating the software to exercise the system with realistic use cases (e.g. CFHT data processing using LSST software) and provide feedback to developers ◦ Aggregated capacity: 400 CPU cores, 800 GB memory, 500 TB disk composed of 50 boxes provided by DELL in the framework of an institutional partnership with CC-IN2P3 virtual machine(s) for building, packaging and deploying ◦ Qserv development team now routinely works with it
Goal to provide a ready-to-use, cloud-based distribution of LSST stable releases to be used by both individual users and computer centres in compute nodes, virtual machines and application containers ◦ Benefits lowers the barrier to entry for end-users to try LSST software: no need to install every new release allows both individuals and computing centres to use identical software distribution for reproducibility purposes self-contained software stack: all the required software is included in each release available to the community world-wide ◦ Experimentation started on October 2014 using CERN’s CernVM FS used routinely in production by all LHC experiments across the world: 161 sites, 140 k CPU
◦ Principles of operation of CernVM FS client server architecture server hosts the set of files composing each LSST official release client exposes a synthesised file system with the contents of the release (e.g. LSST v10.1 contains 130k files and directories, 5 GB) ◦ Client-server interaction uses standard HTTP standard HTTP proxies can be used for reducing latency ◦ Details client downloads and caches the preprocessed entire file catalog of each release at mount time and the contents of each file on demand (i.e. on open) all metadata requests are served by the client using the file catalog cached on local disk client exposes files and directories in read-only mode file system can be mounted at boot time or on-demand (via autofs on Linux)
Goal to understand file I/O patterns induced by LSST software framework to identify the requirements a storage platform needs to satisfy to support LSST workflows to provide feedback to developers on the intrinsic limitations inherent to the available storage platforms ◦ Current status implemented clueFS, a tool for collecting machine-parseable data on I/O activity at the file system level: https://github.com/airnandez/cluefs FUSE-based file system to intercept, trace and forward I/O operations to the underlying file system developed Python notebook for interactive exploration of data collected by clueFS developed (very crude) trace replay tool on top of SimGrid simulation toolkit
◦ Examples of findings running the LSST v10.1 demo a single index file of the astrometry.net package is open 190 times when executing the stack demo; other files open 80 times not necessarily an anomaly, but if this is intended we should exploit this 4KB reads (because of cfitsio), small writes (4KB, 20KB) details: https://github.com/airnandez/cluefs-tools ◦ Ongoing work improve current simulator by including more realistic I/O-related behaviour develop tools for visualising the I/O activity from the trace data PhD student from IHEP (Beijing, China) currently visiting CC-IN2P3 working on this
◦ Investigating suitable configuration of GPFS for serving FITS-formatted data our tests with processing tasks of CFHT data showed excessive network traffic between GPFS file servers and compute nodes I/O performed via cfitsio cfitsio reads in units of block size (1MiB in our case) which causes GPFS cache trashing, in particular when reading FITS HDUs which are typically a few tens of KiB situation significantly improved by configuring GPFS pools to use a block size of 256 KiB but there are some limitations for configuring file systems using different block sizes
◦ Will look at pre-staging data using Linux built-in functionality, without modifying cfitsio posix_fadvise(2) allows an application to announce its intention of accessing file data in a specific pattern (e.g. sequential, random, once, will need, don’t need) data pre-staged in this way is managed by Linux built-in cache mechanism ◦ Will investigate usage of NFS v4 as the protocol for accessing data served by GPFS expected more flexibility for configuring the block size of different file systems, for instance, based on the type of stored data (e.g. blocksizes of 4 KiB or 8 KiB)
◦ We are very interested in understanding how LSST intends to store its data, in detail file format: FITS, ASDF, HDF5, … file granularity: CCD, exposure, … data processing workflows these aspects will likely have a substantial impact on the platform we will need to deploy ◦ File I/O may be the limiting factor of the data processing platform at the scale needed by the foreseen LSST data release processing we want to discover issues early on and iterate promptly ◦ Could we consider using different storage strategies for different goals? for bulk processing we could explore databases for metadata and files for binary data for archival and data exchange we could use an appropriate file format
◦ Goal to explore a system of caching data files (e.g. immutable images) in memory, for bulk processing the system would aggregate the memory of a cluster of machines and expose it as a synthesised file system with single namespace ◦ Principle of operation applications open files as usual, file chunks enter the cache on demand (i.e. at open) from the disk-based repository subsequent operations on the same file would exploit the cluster memory cache: read operations would retrieve data from memory from another machine in the cluster inspired from Apache Spark’s TachyonFS furthermore, the system could exploit known file formats: for instance, opening a FITS file would automatically load into cache all the HDUs of that file, or store the image metadata (FITS HDU) in a dedicated in-memory database ◦ Use case bulk processing of a set of immutable files requiring repeated read-only access to the same data
Goal: to explore object storage as a repository for LSST data as an alternative to conventional networked storage systems object storage model seems well suited for serving large quantities of immutable files ◦ Potential benefits LSST data could be accessible using standard protocols: no confinement of the data to a single site make the LSST stack cloud-aware, which allows for separation of processing clusters from storage clusters ◦ Status previous encouraging results evaluating object storage for LHC data: https://speakerdeck.com/airnandez/files-without- borders evaluation instance of OpenStack Swift deployed at CC-IN2P3, CEPH evaluation instance currently being deployed ◦ Ways to go forward 1) make the Butler aware of other protocols, beyond POSIX I/O, for both image metadata (HDUs) and binary (i.e. pixels) 2) implement a file system to expose objects stored in Swift/S3 backends as normal files: no modification to Butler needed but less flexible
◦ Goal explore object storage backends as a mechanism for inter-site data exchange decouple the storage system used for inter-site data exchange from the system used for on-site data processing ◦ Idea use Swift (or CEPH) as a reception/emission buffer for inter-site data exchange use standard protocols for bulk data transfer over WAN, in particular HTTP-based, leveraging the work of the huge community working on tools for the web ◦ Comment experience with LHC project shows that inter-site data exchange is a topic that must be addressed early on and requires extensive preparation and rehearsal
[email protected] ◦ Goal explore suitability of new tools for managing clusters for the use case of LSST level 2 data processing ◦ Idea could we build a single cluster, composed of compute nodes physically located at NCSA and CC-IN2P3 for DRP? would recent tools such as Apache Mesos would be useful for us? ◦ Potential benefits be prepared for alternative modes of cluster-based data processing for the era 2020-2030, beyond the conventional batch system time shifts between Urbana-Champaign and Lyon could be exploited for round-the-clock operations ◦ Constraints this would require the processing workflows to be aware of the physical location of the data (at the site level), to avoid unnecessary access to the data over the network across the Atlantic
Explore the suitability and benefits of map-reduce-based execution platforms such as Apache Spark for LSST workflows ◦ Explore containerisation of specific LSST workflows and how to orchestrate them for building a LSST application-specific data processing service e.g. Kubernetes, Mesosphere ◦ Understand what it takes, in the current state of affairs, to run LSST workflows on a cloud service e.g. Amazon web services, OpenStack both for computing and storage