Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ENVRIplus: Review of existing Research Infrastructures: requirements, technologies, achievements and gaps leading to characterization

SciTech
March 28, 2016

ENVRIplus: Review of existing Research Infrastructures: requirements, technologies, achievements and gaps leading to characterization

ENVRIplus is a Horizon 2020 project bringing together Environmental and Earth System Research Infrastructures (RIs), projects and networks together with technical specialist partners to create a more coherent, interdisciplinary and interoperable cluster of Environmental Research Infrastructures across Europe. It is driven by three overarching goals: 1) promoting cross-fertilization between RIs, 2) implementing innovative concepts and devices across RIs, and 3) facilitating research and innovation in the field of environment for an increasing number of users outside the RIs. To achieve these goals, the first task is to revise the information about the research infrastructures and available technologies in order to clarify requirements, identify issues and highlight opportunities. This presentation gives an overview about the methodology followed for gathering the RIs requirements, some preliminary analyses, and future actions.

SciTech

March 28, 2016
Tweet

More Decks by SciTech

Other Decks in Technology

Transcript

  1. H2020 Project Project Number: 654182  Funded by EU Horizon2020  

    2015-2019 (Rosa: Sept. 2015 to April 2017)  Goal: Bring together 21 RIs with technical specialist partners to create a more coherent, interdisciplinary and interoperable cluster of Environmental RI across Europe.  DIR task (Rosa & Cristina & Malcolm): Leading the requirement capture for delivering common data functionalities:   Interviews and questionnaires   Digest and analyse answers   Data Intensive Federation Which  are  the  par+cipa+ng     Research  Infrastructures  ?   PROVIDING SHARED SOLUTIONS FOR SCIENCE AND SOCIETY
  2. H2020 Project Project Number: 654182 WHAT ARE RESEARCH INFRAESTRUCTURE  Research

    infrastructures (RIs) refers to facilities, resources and related services used by the scientific community to conduct a top-level research in their field. (e.g. CERN) Envriplus works with 21 RIs from all domains of Environmental science: Atmospheric, Marine, Biosphere and Solid Earth.  Goal: Multidisciplinary Earth system science  Avoiding duplication of efforts  Making RIs products easier to use with each other  Improving their innovation potential and cost/ benefit ratio
  3. DATA IDENTIFICATION AND CITATION Marine  and   Seafloor  data  

    Sea  data   Marine  biology   Bio-­‐diversity   DOI?   DOI?   DOI?   A  digital  object  iden+fier  (DOI)  is  a  serial   code  used    to  uniquely  iden+fy  objects   What  we  understand  by  Iden0fica0on  and  cita0on:     Mechanisms  to  provide  durable  references  to  data  objects  and    collec+ons  of  data  objects    
  4. CURATION: QUALITY CONTROL, ANNOTATION … …….     Nearly  real-­‐-me

     Data  handling,     Raw Level 1 Level 2 What  we  understand  by  cura0on:     Processes  to  assure  the  availability  and  quality  of  data  over  the  long  term      
  5. DATA CATALOGUING Marine  and   Seafloor  data   Carbon,  ocean,

      eco-­‐sys     Cloud  par+cles,   atmosphere  composi+on   Gas-­‐sphere  spices,   etc.   Interoperable  cataloguing?   What  we  understand  by  cataloguing:     Catalogues  are  built  to  accelerate  access  to  data  subsets/methods/services/scripts/ workflows  that  can  be  delimited  by  queries  over  a  catalogue.    
  6. PROCESSING, MONITORING AND DIAGNOSIS Tools: processing, monitoring, and diagnosis, virtual

    research environments What  we  understand  by  processing:     Includes  every  computa+onal  transforma+on  of  data,  e.g.,  pre-­‐processing  raw  data,  signal   processing,  data  quality  assurance  (QA),  analysis  of  data,  data  simula+on  and  comparison  
  7. PERFORMANCE OPTIMIZATION Tools: processing, monitoring, and diagnosis, virtual research environments

    How to optimize: data discovery, access, delivery and processing.
  8. DATA PROVENANCE Provenance Provenance Provenance PROV-­‐O?   What  we  understand

     by  provenance:     Recording  informa+on  about  how  data,  code  and  working  prac+ces  were  created  and   were  transformed  to  their  current  form.    
  9. H2020 Project Project Number: 654182 COMMON DATA CHALLENGES SIOS  

    Curation Cataloguing Processing Identification Citation Optimization Provenance Adop+on/Customiza+on/Integra+on     Adop+on/Customiza+on/Integra+on  
  10. OUR APPROACH RI  specific  requirements  and  common  abstrac+ons    

    3. Bring common RI data solutions into practice 1. A common understanding for smart RI data solutions 2. Prototype prioritized common RI data solutions Common  opera+ons  customized  to  specific  RIs.    
  11. SIOS   REQUIREMENTS AND STATE OF THE ART REVIEW Use

     cases   workflows   Metadata   Priori+zed   services   Community   Standards  
  12. H2020 Project Project Number: 654182  Dialogue between •  RI representatives

    and •  requirement gathering go betweens  Steered by a questionnaire •  with ethical approval and consent forms  Topics covered •  Generic aspects of each RI •  Identification and citation, Curation, Cataloguing, Processing, Provenance, Optimization, Community Support  Results accumulated in Wiki (public) and ActiveCollab (private): https://envrireferencemodel.atlassian.net/wiki/display/ERR/ ENVRI+R+Requirements   Topic requirements detailed by RI   Analyses by topic REQUIREMENTS GATHERING METHOD Main  topics:   1.  Iden+fica+on  and  cita+on   2.  Cura+on   3.  Cataloguing   4.  Processing   5.  Provenance   6.  Op+miza+on   7.  Community  Support  
  13. H2020 Project Project Number: 654182  They show the high-level:  Commonalities

     Differences  potential interoperability between RIs.  They covered the following areas of interest:  RI’s main goal  High-level questions spanning the main 7 topics:  Data lifecycle, data/services offered, data standards/software used  Data management plan, data security/access, non-functional constraints  Optimisation plans, interactions with other RIs  What objectives and services their RI are expecting from ENVRIplus. GENERIC REQUIREMENTS Main  topics:   1.  Iden+fica+on  and  cita+on   2.  Cura+on   3.  Cataloguing   4.  Processing   5.  Provenance   6.  Op+miza+on   7.  Community  Support  
  14. H2020 Project Project Number: 654182  RI Goal:  Free access to

    atmospheric aerosols, clouds, and trace gases data from observations  Free access to data products and tools for  quality assurance (QA)  data analysis and research  Data lifecycle:  Data from stations transferred to a computational resource for QA and store it afterwards to one of their topic-databases  Through the ACTRIS portal:   visualise and access to data  Identification and citation:  DOI + code station EXAMPLE OF GENERIC REQUIREMENTS - ACTRIS
  15. H2020 Project Project Number: 654182  Data and services offered:  Data:

    Free and open access to all data and data products.  Software for: quality assurance (QA) and data analysis.  Instrumentation: TNA to different calibration centres and laboratories.  Expertise: Calibration centres offer training and specific advice to users.  Training: Training of operators and users in the field of atmospheric science.  Data standards used: NetCDF. CF 1.5 -Compliant format, NASA-Ames  Software used: Linux servers, relational databases  Data management: Covers all the topics except the optimization EXAMPLE OF GENERIC REQUIREMENTS - ACTRIS
  16. H2020 Project Project Number: 654182  Data security and access:  General

    open data access without login, but  Some communities password / login.  Different timing to publish data based on the type of data.  No embargo period  Non-functional constraints: Computational environment cost  Optimization plans/ Issues / Challenges: Data visualization, data provision, inter-operability between data center nodes  Interaction with other RIs and Initiatives: IAGOS, ICOS, AeroCom  ACTRIS expects that ENVRIplus will provide technology/advice for:  Activity of sensors , how instruments work in extreme conditions  Small sensors capabilities EXAMPLE OF GENERIC REQUIREMENTS - ACTRIS
  17. H2020 Project Project Number: 654182  Input, i.e. what are the

    characteristics of the dataset(s) to be processed.  Analytics, i.e. what are the characteristics of the processing tasks to be enacted.  Output, i.e. what are the characteristics of the products resulting from the processing.  Statistics, i.e. what are the scientific motivations leading to the identification of the specific data processing envisaged by the community. PROCESSING REQUIREMENTS– 4 MAIN ASPECTS
  18. H2020 Project Project Number: 654182 PROCESSING REQUIREMENTS - ACTRIS https://envrireferencemodel.atlassian.net/wiki/display/ERR/Processing

    +in+ACTRIS   Input:   Tabular data, SDF files, numbers, matrix   Size varies: from 100GB per year to TB per year, GB per day and per instruments   Data can be heterogeneous or not  Analytics:  Preprocessing and processing steps  Programing languages: Python, C, 3Pascal  Uses batch and interactive processing mode  Output: ~ Input data  Statistics – Varies
  19. H2020 Project Project Number: 654182  RIs’ needs with respect to

    dataset(s) to be analysed are quite diverse both across RIs and in the context of the same RI.  Data access: ftp, email, database  Data formats: multiple, e.g. NetCDF, text.  Data typology: time series and tabular data.   Volume: vary from a few KBs to GBs and TBs.  The need to homogenise and promote state-of-the-art practices for data description, discovery •  to easily analyse dataset(s) across RIs. PROCESSING - INPUTS
  20. H2020 Project Project Number: 654182  Languages: Python, Matlab and R

    to C, C++, Java, and Fortran.  Platforms: Linux servers, HPC clusters, Cloud.  Free and Open Source Software  No organised approach to make available the data processing tools both within the RI and outside the RI.  Future data processing platform should be “open” and “flexible” to:   scientists to easily plug-in their algorithms  without bothering with the computing platform   service managers to configure the platform to exploit diverse computing infrastructures   third-party service providers to programmatically invoke the analytics methods  scientists to run analytic tasks without requiring them to install any software. PROCESSING ANALYTICS
  21. H2020 Project Project Number: 654182  Same variety that input data

      Less understood the need to make these data available.  Some RIs by paper, catalogue.  Output resulting from a data processing task should be “published”  Open Science practices.  Future data processing platform should offer having access:  to the datasets resulting from a data processing task  To the metadata characterising the task to enact the scientist to properly use the results. PROCESSING OUTPUTS
  22. H2020 Project Project Number: 654182  Minority of the Ris responded

    to the statistics questions. Why ?  Data collection is the primary aim of many of the RIs  Hypotheses underlying data collection is not undertaken.   Many RIs collect data considering a general hypotheses when the data collection programmes/instruments were designed Ris collect multiple streams of data in time series thus there is the potential to undertake multivariate analysis of the data- no consistency in approaches.  Most analysers will be engaging in formal testing of hypotheses rather than data mining although the latter was not necessarily ruled out. Many RIs had or are going to implement outlier/anomaly detection on their data.  Variety of statistical methods can be undertaken- a frequentist or Bayesian framework. PROCESSING - STATISTICAL
  23. H2020 Project Project Number: 654182  General thoughts •  RIs are

    typically federations of diverse autonomous organisations •  These organisations •  Have established roles, cultures, working practices & resources •  These roles must remain as they are their primary business •  ENVRIplus will change their ways of digitally engaging •  They have internal diversity that may be relevant •  They need to incrementally engage in their federations •  The benefits must outweigh internal costs at each step •  They may be engaged in many federations •  Significant benefit from using the same framework for each federation •  Federating for multi-domain science - the ENVRIplus goal •  Needs all of the above PRELIMINARY GENERAL ANALYSIS
  24. H2020 Project Project Number: 654182  EMBRC would like to explore

    new data workflows, which make use of marine biological and ecological data  LTER needs to integrate a data repository into their workflow system, and develop an integrated data portal (e.g. with a time series viewer )   EPOS uses the dispel4py workflow engine in VERCE which is based on and is able to export to PROV-O whereas in future it is planned to to use CERIF data model and ontology instead. WORKFLOWS RELEVANT INFORMATION
  25. H2020 Project Project Number: 654182  Virtualisation: the user neither knows

    nor cares were and how the information processing is done as long as their requirements are respected in Service level agreements, quality of service agreements etc.  Interoperation: to satisfy the desire for end-users to be able to access not only resources in their domain of interest but across domains  Re-use of standard components of software as building blocks joined together like LEGO; this has implications for APIs and messaging interfaces;  The definition of data structures and semantics separately from software in order to be able to use generic software components; SHORT TERM ANALYSIS OF THE STATE OF THE ART AND TRENDS
  26. H2020 Project Project Number: 654182  The need for systems to

    be distributed, partitioned, parallel and (mobile) client-device-independent;  The need for systems to handle data streams from instruments/ detectors and for users to be able to control the parameters of data- taking;  Composition of software components linked to datasets as workflows with parallel/sequential, distributed/centralised control and exception management; SHORT TERM ANALYSIS OF THE STATE OF THE ART AND TRENDS
  27. H2020 Project Project Number: 654182   Reviewing the initial topics

    analyses provided by topic leaders   Performing technology review for each topic  Responsible for processing – Workflows  Deliverable  Next  Give all these information to the Reference Model team WHAT ARE WE DOING NOW