Better Software for better ScienceThe 12th Iberian Grid Conference will take place in Benasque, Spain, from Monday 25th to Friday 29th of September.
Important Dates
|
This edition is dedicated to the memory of our dear Vicente Hernández. A reference for the IBERGRID community both in the academic and in the human.
Organised by:
Supported by:
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
IBERGRID was created under the agreement of Scientific and Technological cooperation initially signed by Spain and Portugal 20 years ago in 2003. This presentation will provide an overview of the IBERGRID infrastructure status and related activities.
This is a plenary presentation from Ignacio Blanquer on the status of EOSC of 40 minutes.
The DT-GEO project (2022-2025), funded under the Horizon Europe topic call INFRA-2021-TECH-01-01, is implementing an interdisciplinary digital twin for modelling and simulating geophysical extremes at the service of research infrastructures and related communities. The digital twin consists of interrelated Digital Twin Components (DTCs) dealing with geohazards from earthquakes to volcanoes to tsunamis and that harness world-class computational (FENIX, EuroHPC) and data (EPOS) Research Infrastructures, operational monitoring networks, and leading-edge research and academic partnerships in various fields of geophysics. The project is merging and assembling latest developments from other European projects and EuroHPC Centers of Excellence to deploy 12 DTCs, intended as self-contained containerised entities embedding flagship simulation codes, artificial intelligence layers, large volumes of (real-time) data streams from and into data-lakes, data assimilation methodologies, and overarching workflows for deployment and execution of single or coupled DTCs in centralised HPC and virtual cloud computing Research Infrastructures (RIs). Each DTC addresses specific scientific questions and circumvents technical challenges related to hazard assessment, early warning, forecasts, urgent computing, or geo-resource prospection. This presentation summarises the results form the first year of the project including the digital twin architecture and the (meta)data structures enabling (semi-)automatic discovery, contextualisation, and orchestration of software (services) and data assets. This is a preliminary step before verifying the DTCs at 13 Site Demonstrators and starts a long-term community effort towards a twin on Geophysical Extremes integrated in the Destination Earth (DestinE) initiative.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
The BigHPC project is bringing together innovative solutions to improve the monitoring of heterogeneous HPC infrastructures and applications, the deployment of applications and the management of HPC computational and storage resources. It aims as well to alleviate the current storage performance bottleneck of HPC services when dealing with data intensive applications growth among the major workload in HPC environments.
Largest companies, government research centers and academic computing centers aggregate the computing power to answer the Big Data buzzword. The typical jobs pretend to extract value from data having four Vs characteristics (volume, variety, velocity and veracity). But to use those resources there are several different implementations that make it difficult for a user to use those infrastructures. BigHPC platform gives a way to start using those resources with a common job definition (BigHPC job) and adapted to the requirements of the largest BigHPC infrastructures, making life easier for system administrators.
In this presentation we will show the involved services, how they can be deployed in any infrastructure and the jobs workflow supported over the Gitlab platform.
udocker is a tool to execute containers in HPC resources. It can pull container images from any registry, be it DockerHub, GitHub Container Registry (https://ghcr.io) or GitLab Container Registry (https://registry.gitlab.com) or others. udocker is a run-time tool to enable the execution of applications encapsulated in containers both on HPC and Cloud resources.
This presentation will describe the developments introduced in the latest version of udocker (1.3.10). These include: support for pulling, importing, loading and executing images from different architectures possibly different from the executing host, support for QEMU on PRoot modes, experimental support for native Fakechroot execution on arm64 and ppc64le for some guest operating systems such as CentOS 7, AlmaLinux 8, AlmaLinux 9 and Ubuntu 22 and improved support for OCI images.
The presentation will also address ongoing developments such as:
* Further improvements for the OCI image format.
* New installation capabilities allowing udocker installations tailored to the host OS and/or hardware architecture and enabling the user to customise which tools and libraries to install.
* Further improvements for MPI applications, towards automation of discovery and mapping of host libraries, devices and drivers.
The EuroCC project aims to boost knowledge and use of high-performance computing (HPC) across Europe, through a network of national competence centres (NCC).
The Portuguese Competence Centre (NCC Portugal) coordinates dissemination, training, knowledge and technology transfer activities, as well as the promotion of the use of HPC. It has been a point of contact for potential users - whether from industry, science or public administration (among others).
The EuroCC project began in 2020 and is being followed by the EuroCC 2 project that started in 2023 and will last 3 years.
In Portugal, supercomputing is under development, it is expected that the arrival of Deucalion - the country's latest supercomputer - will continue to pave the way. Dissemination activities, promoted by NCC Portugal, have also been important in educating and attracting the public, namely by publishing news and use cases in more accessible language.
There are still difficulties - both in internal and external communication - and we have planned new strategies to overcome them over the next three years.
This presentation will showcase the EuroCC project and the activities carried out by NCC Portugal, particularly those of dissemination, as well as the achievements, the challenges of internal and external communication and the plans to address them.
The field of Complex Systems has rapidly gained prominence over recent decades as a result of its capacity to explore the intricate and interdependent behaviors exhibited by a wide range of natural, social, and technological systems. Driven by advances in data availability, computational methods, and interdisciplinary collaboration, Complex Systems research has become a burgeoning field of study with broad cross-disciplinary appeal.
Community detection theory is vital for the structural analysis of many types of complex collaboration systems, especially for human-like collaboration networks. Within a network, a community is understood as a group of nodes among which interactions are more frequent (or of greater weight) than would be expected if interactions were completely random. The detection and analysis of this type of groups give us relevant information on the characteristics of the structure of interactions at a mesoscopic scale, halfway between the global and local scale.
Usually, complex systems are formed by a great number of interacting agents which implies huge quantities of data when modelling these systems. Because of this, the problem of community detection requires the development of powerful and optimized algorithms that fit the requirements of each problem. In this work, we present a new community detection algorithm, the Targeted Community Merging (TCM) algorithm, based on a well-known and widely used algorithm in the literature, which allows obtaining proper community partitions with a small number of communities.
We then perform an analysis and comparison between the departmental and community structure of scientific collaboration networks within the University of Zaragoza. To construct the scientific collaboration networks and perform this analysis, we use data from the University of Zaragoza. This data consists of published articles and researchers affiliation database, covering a period of time ranging from January 2002 to January 2021. Our analysis is focused on three macro-areas of knowledge of the University of Zaragoza: Science, Health Science and Engineering and Architecture.
Thus, we draw valuable conclusions from the inter- and intra-departmental collaboration structure that could be useful to take decisions on an eventual departmental restructuring. This algorithm and methods can be easily generalized to be applied to other collaboration systems where data of similar features and a native partition of the agents that conform the system are available.
Software engineering best practices favour the creation of better quality projects, where similar projects should originate from similar pre-defined layout, also called software templates. This approach greatly enhances project comprehension, without the need for extensive documentation. Additionally, it allows the pre-setting of certain functionalities simplifying further code development. There exist various tools to create such templates and then routinely generate projects from them. One of such Open Source tools is cookiecutter [1], a cross-platform command-line utility. The templates, or cookiecutters, can be re-used and freely hosted on software version control platforms e.g. GitHub.
In this lightning talk, we present a new (pre-production) platform that enables the collection of various templates on a marketplace/hub and use them to generate new projects on-fly through a web interface without requiring the installation of the cookiecutter tool on the client side. The platform features a GitHub repository to collect metadata about templates, a python-based backend, and a javascript Web GUI with the authentication via EGI Check-In.
[1] https://github.com/cookiecutter/cookiecutter
PypKa is a tool developed by the Machuqueiro Lab at the University of Lisbon (UL), Portugal. It’s a Poisson–Boltzmann-based pKa predictor for proteins using 3D structures as input. The tool also predicts isoelectric points and can process pdb structures to assign the correct protonation states to all residues. The impact of the PypKa cloud service is to predict pKa values of titratable sites in proteins (Reis et al. 2020). The team from the UL applied to the EGI-ACE Call in April 2021 with the intention to elevate the PypKa tool to the next level by porting it onto a scalable and easy-to-use cloud service that allows for fast pKa and isoelectric point calculations using user-provided protein structures or those obtained from the Protein Data Bank.
To better support this mission, and make the PypKa the go-to solution for the calculation of fast pKa and isoelectric point calculations, the EGI-ACE user support team allocated resources, services and provided consultancy in the framework of a dedicated Competence Centre (CC) led by a a technical expert from CNRS. The Competence Centre was composed of members from appropriate service, resource and technology providers (UPV, CESGA, CNRS, and LIP). From a technical point of view, considering that the performance of the PypKa cloud service scaled up almost linearly concerning the number of vCPU cores, to support this challenge, the PypKa cloud server was deployed on the cloud resources of the EGI cloud infrastructure. Specifically:
- Universitat Politècnica de València (UPV) was involved as service provider of the Infrastructure Manager (IM) to facilitate the configuration and the set-up of the virtual and elastic cluster, whose resources scale dynamically, taking into account the number of users' requests to be served.
- LIP was involved as a cloud resource provider because of the close proximity to the user (the national cloud in Portugal).
- CESGA was involved as a cloud resource provider to complement the pool of resource capacity allocated by LIP.
The result is a cloud-based Thematic Service that is available as a web portal at https://pypka.org/ and offers the functionality to predict Poisson-Boltzmann-based pKa values of biomolecules. Since late 2021, the portal with the scalable setup has been used by ~1,000 researchers from 15 European countries.
The technical support offered to the University of Lisbon to operate the PypKa Thematic Service will continue after the end of the EGI-ACE project with the following reallocation:
- LIP resource provider will continue to support the Thematic Service as a national cloud. The principal investigator has already received a national grant to continue using the LIP resources.
- CESGA cannot guarantee support and resources and it will be replaced with local resources provided by UL.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
ICSC, established in September 2022 and run by the ICSC Foundation, is the Italian High-Performance Computing, Big Data e Quantum Computing Research Centre. It is one of the five National Centres funded by the Italian National Recovery and Resilience Plan (NRRP) with about 320 MEuro each.
The ICSC activities focus on the one hand on the maintenance and upgrade of the Italian HPC and Big Data infrastructure with the aim to build a national, cloud-native data lake. On the other, on the development of advanced methods and numerical applications and software tools to integrate computing, simulation, collection, and analysis of data of interest for research, manufacturing, and society. Its overall goal is to support both scientific research and industry.
This contribution will discuss the current status of ICSC, its governance and activities, as well as some collaboration opportunities.
Finally, it will give some perspectives on the sustainability of this key National infrastructure beyond the NRRP funding period.
EUCAIM (https://cancerimage.eu/) is a pan-European federated infrastructure for cancer images, fueling AI innovations. A federated infrastructure supports EUCAIM, including a set of core services that comprise a public metadata catalogue, a federated search service following a common hyperontology, an access negotiation system, a coherent AAI and a distributed processing service. EUCAIM will permit users to discover, search, request, access and process medical imaging and associated clinical data in a flexible manner, supporting federated providers with different access levels and a centralised catalogue. EUCAIM is based on cloud and container technologies and it will be linked to intensive computing infrastructures such as EGI and supercomputing centres.
The EUCAIM project is the cornerstone of the European Cancer Imaging Initiative, one of the flagships of the Europe's Beating Cancer Plan (EBCP). EUCAIM is building a federated European infrastructure for cancer images data, starting with 21 clinical sites from 12 countries. IFCA (CSIC) participates in this project and oversees the Data FAIRification sub-task, as well as collaborating in others.
The project will provide a central hub that will link EU-level and national initiatives, hospital networks as well as research repositories with cancer images data. Clinicians, researchers and innovators will have cross-border access to an interoperable, privacy-preserving and secure infrastructure for federated, distributed analysis of cancer imaging data.
Data FAIRness in AI4HI projects
EUCAIM builds upon the work of the “AI for Health Imaging” (AI4HI) projects, namely: Chaimeleon, EuCanImage, ProCancer-I, Incisive and Primage. These projects are developing Artificial Intelligence algorithms to detect the cancer from imaging and are establishing federated repositories for cancer images.
We reviewed the data FAIRification practices of these projects to inform the best practices to be adopted in EUCAIM. From them the one with a more comprehensive approach was Chaimeleon, and their approach will help to establish the work in EUCAIM.
FAIR EVA
The EUCAIM project offers a comprehensive suite of tools and services as well, designed to streamline the data preprocessing process. Among these tools it is required to check the FAIRness of datasets.
Compliance with the FAIR principles implies considering multiple dimensions. The EUCAIM approach is based on the RDA recommendations, but during the project we will also define further FAIR attributes related specifically to Cancer Imaging data.
The EOSC-Synergy H2020 project developed a tool called FAIR EVA (evaluator, validator & advisor) that has been selected for its deployment in the EUCAIM infrastructure. Alternatives, like F-UJI, were considered, but EUCAIM is adopting FAIR EVA for deployment in its infrastructure.
FAIR EVA has been developed to check the FAIRness level of digital objects from different repositories or data portals. It requires the object identifier (preferably persistent and unique identifier) and the repository to check. It also provides a generic and agnostic way to check digital objects. FAIR evaluator is a service that runs over the web. IT can be deployed as a stand-alone application or in a docker container. It implements different web services: the API that manages the evaluation and the web interface to facilitate accessing and user-friendliness.
FAIR evaluator implements a modular architecture to allow data services and repositories to develop new plugins to access its services. Also, some parameters can be configured like the metadata terms to check, controlled vocabularies, etc. For the initial iteration the vanilla version of the tool will be deployed, but during the project a plugin will be developed to include the agreed new FAIR attributes to be checked.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
OIDC (OpenID Connect) is widely used for transforming our digital
infrastructures (e-Infrastructures, HPC, Storage, Cloud, ...) into the
token based world.
OIDC is an authentication protocol that allows users to be authenticated
with an external, trusted identity provider. Although typically meant for
web- based applications, there is an increasing need for integrating
shell- based services.
This contribution delivers an overview and the current state of several
tools, each of which provides a solution to a specific aspect of using
tokens on the commandline in production services:
oidc-agent is the tool for obtaining oidc-access tokens on the
commandline. It focuses on security and manages to provide ease of use
at the same time. The agent operates on a users workstation or laptop
and is well integrated with graphical user interfaces of several
operating systems, such as Linux, MacOS, and Windows. Advanced features
include agent-forwarding which allows users to securely obtain access
tokens from remote machines to which they are logged in.
mytoken is both, a server software and a new token type. Mytokens allow
obtaining access tokens for long time spans, of up to multiple years. It
introduces the concept of "capabilities" and "restrictions" to limit the
power of long living tokens. It is designed to solve difficult use-cases
such as computing jobs that are queued for hours before they run for
days. Running (and storing the output of) such a job is straightforward,
reasonably secure, and fully automisable using mytoken.
pam-ssh-oidc is a pam module that allows accepting access tokens in the
Unix pluggable authentication system. This allows using access tokens
for example in ssh sessions or other unix applications such as su. Our
pam module allows verification of the access token via OIDC or via 3rd
party REST interfaces.
motley-cue is a REST based service that works together with pam-ssh-oidc
to validate access tokens. Along the validation of access tokens,
motley-cue may - depending on the enabled features - perform additional
useful steps in the "SSH via OIDC" use-case. These include
Authorisation (based on VO membership)
Authorisation (based on identity assurance)
Dynamic user creation
One-time-password generation (in case the access token is too long for
the SSH-client used)
Account provisioning via plugin based system (interfaces with local
Unix accounts, LDAP accounts, and external REST interfaces)
mccli is a client side tool that enables clients to use OIDC
access-tokens that normally do not support them. Currently, ssh, sftp
and scp are supported protocols.
oidc-plugin for putty makes use of the new putty plugin interface to use
access tokens for authentication, whenever an ssh-server supports it.
The plugin interfaces with oidc-agent for windows to obtain tokens.
The combination of the tools presented allows creative new ways of using
the new token-based AAIs with old and new tools. Given enough time, this
contribution will include live-demos for all of the presented tools.
Computing and data management workflows are increasingly demanding access to S3 storage services with POSIX capabilities by locally mounting a file system from a remote site to directly perform operations on files and directories.
To address this requirement in distributed environments, various service integrations and needs must be considered.
In the context of this activity, solutions based on S3 (for object storage) and HTTP WebDAV (for hierarchical storage) protocols have been carefully examined and put into operation.
In both cases, the access to the data must be regulated by standard, federated authentication and authorization mechanisms, such as OpenID Connect (OIDC), which is already adopted as authentication/authorization mechanism within WLCG and the European Open Science Cloud (EOSC).
Starting from such assumption, the possibility to manage data access by integrating JSON Web Token (JWT) authentication, provided by INDIGO-IAM as Identity Provider (IdP), with both CEPH RADOS Gateway (the object storage interface for CEPH) and StoRM WebDAV with Rclone, have been evaluated and a comparison between the performance yielded by S3 and WebDAV protocols has been carried out within the same distributed environment.
One of the main obstacles in collecting data for life science today is the compliance with GDPR. Among the others, the requirement of managing the Informed Consent in a lawful, transparent and auditable way is one of the open issues that Trusted Research Environments must address. Today many hospitals and research organizations exploit some kind of Consent Management System (CMS), usually purchased as a SaaS product by a commercial provider. This solution suffers from some drawbacks like the need to trust a third party about the correct management of consent data.
To overcome the trustworthiness issue, we have designed and developed an open-source, blockchain-based Informed Consent Management System which offers patients and data subjects the possibility to provide, modify or deny their consent of using their personal datasets in some research, without the need of any intermediate authority in the middle.
In this talk we’ll describe the rationale and the main design choices we made. We’ll provide details about the technology exploited and about our proof-of-concept deployment on INFN DataCloud environment.
Scientific computing benefits from the automated provisioning of virtualized infrastructures from multiple Infrastructure as a Service (IaaS) clouds. An abstraction on how computational resources are provisioned, configured and delivered to the end user is required to widespread the adoption of Cloud computing. During the last decade, the development of the open-source Infrastructure Manager (IM) tool has provided the ability to interact with multiple on-premises Cloud Management Platforms (CMPs), such as OpenStack and OpenNebula, many public Cloud providers, such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform, including European public Cloud providers such as Cloud & Heat, Open Telekom Cloud or Orange Cloud and also Federated providers as EGI Cloud Compute or FogBow.
The IM leverages open-source technologies for Infrastructure as Code (IaC) such as Ansible and standards such as TOSCA (Topology and Open Specification for Cloud Applications), to describe complex application architectures to be deployed on the cloud. The IM supports multiple integration paths for different user profiles, including a fully-featured REST API, alongside an easy-to-use command-line interface. In addition, the web-based IM Dashboard facilitates the usage by less savvy users by offering pre-packaged tested popular application architectures that can seamlessly be deployed on multiple Cloud back-ends. This is the case of Kubernetes clusters, JupyterLab instances, Hadoop clusters, etc. By providing a wizard-like TOSCA-based composition approach, users can further customize their deployment specifying the resources to allocate, the software configuration or the number of nodes to deploy in the cluster. It also provides the ability to scale the deployed infrastructures both horizontally (adding or removing nodes) or vertically (resizing a particular VM).
This contribution looks back into a decade of innovation in the field of automated provisioning of virtualised infrastructures using the Infrastructure Manager, highlighting its usage in multiple European projects (e.g. INDIGO-DataCloud, EOSC-HUB, EGI-ACE, AI-SPRINT, InterTwin, DT-GEO, etc.) and scientific communities (e.g. PanGeo, ENES, etc.). As a result, the Infrastructure Manager is being offered in production as one of the EGI services for research.
This contribution will also address the evolution of the Infrastructure Manager to become an orchestration component for the edge-to-cloud continuum, by allowing the definition ofevent-driven functions to be deployed on on-premises serverless computing platforms such as OSCAR and public FaaS offerings such as AWS Lambda.
References
[1] Infrastructure Manager. https://im.egi.ei
[2] Infrastructure Manager. https://www.egi.eu/service/infrastructure-manager/
[3] OSCAR. https://oscar.grycap.net
The composition of workflows using visual environments can significantly benefit AI scientists in leveraging the Function-as-a-Service (FaaS) paradigm for the execution of inference pipelines. With this goal, we have designed, in the context of the AI4EOSC project, AI4Compose [https://github.com/AI4EOSC/ai4-compose], an approach to perform low-code composition of AI inference pipelines. It leverages Node-RED [https://nodered.org/] and Elyra [https://elyra.readthedocs.io/en/latest/index.html], two widely used open-source tools for the graphical composition of pipelines, based on a drag-and-drop approach. On the one hand, Node-RED is a flow-based programming tool, originally developed by IBM’s Emerging Technology Services team and now a part of the OpenJS Foundation. It is a powerful tool that serves to communicate hardware and services in a fast and easy way. On the other hand, Elyra is a set of AI-focused extensions for JupyterLab Notebooks. It provides a visual Notebook Pipeline editor to build notebook-based AI pipelines, simplifying the conversion of multiple notebooks into batch jobs or workflow.
The FaaS model enables scientists to efficiently manage application components, executed on-demand as functions. To exploit this model, AI4Compose is integrated with the OSCAR serverless framework [https://oscar.grycap.net/], to run the AI models for inference. OSCAR is an open-source platform to support the event-driven serverless computing model for data-processing applications that can run on top of multi-clouds thanks to the Infrastructure Manager (IM) [https://www.grycap.upv.es/im/index.php]. Its functionality flow is mainly based on the monitoring of an object storage solution where users upload files to a bucket and this automatically triggers the execution of parallel invocations to a function responsible for processing each file (asynchronous mode). It also supports synchronous invocations through highly scalable HTTP-based endpoints (based on KNative). The integration with OSCAR is made through flow implementations offered as reusable components inside both Node-RED and Elyra visual pipeline compositors.
With AI4Compose, users will gain agility and resource efficiency as they can leverage the management of the computing platform to OSCAR, which provides a highly scalable infrastructure to support complex computational tasks. Also, AI scientists can easily design, deploy and manage their workflows using an intuitive visual environment, reducing the time and effort required for the maintenance of inference pipelines. Lastly, our platform aims to lower the learning curve for researchers to implement AI and FaaS pipelines.
This work was supported by the project AI4EOSC ‘‘Artificial Intelligence for the European Open Science Cloud’’ that has received funding from the European Union’s Horizon Europe Research and Innovation Programme under Grant 101058593. Also, Project PDC2021-120844-I00 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR and Grant PID2020-113126RB-I00 funded by MCIN/AEI/10.13039/501100011033.
OSCAR is an open-source platform for serverless event-driven data-processing applications built on Kubernetes. The event-driven architecture allows for flexible and scalable applications which execute in response to events coming from different sources, such as object storage systems (e.g. MinIO or dCache).
An OSCAR cluster can be dynamically deployed by the Infrastructure Manager (IM) [2] on any major public (e.g. AWS), on-premises (e.g. OpenStack) or federated Cloud (e.g EGI Compute), both using web-based user interfaces (IM Dashboard) or programmatically. OSCAR supports ARM-based computer architectures deployed in low-powered devices such as clusters of Raspberry Pis.
An OSCAR service is created by specifying a Docker image, which can be in an image container registry (e.g. Docker Hub), certain computing requirements (e.g. vCPUs, RAM, GPUs) and a user-provided shell script that will be executed inside a dynamically created container on a horizontally scalable Kubernetes cluster which grows and shrinks depending on the workload. An OSCAR service can also support synchronous invocations to create highly-scalable HTTP endpoints via Knative. It also provides the ability to expose load-balanced services accessed via HTTP requests, a more performant approach when deploying AI models for inference, where the weights need to be pre-loaded in memory to be reused for subsequent inference requests.
OSCAR services can be chained in a Functions Definition Language (FDL) to create data-driven pipelines, even across multiple OSCAR clusters, so that the output of one service is uploaded to the input object storage of another service. By chaining these services, data-processing pipelines along the cloud-to-edge continuum can be created.
OSCAR is integrated with EGI Notebooks and Elyra to support the composition of AI inference pipelines from Jupyter Notebooks. It is also integrated with scientific object storage systems such as dCache to react upon file uploads to a certain folder This functionality, when coupled with Apache Nifi for scalable event-driven ingestion, provides the ability to support data-driven processing in an scalable Kubernetes-based platform.
In AI-SPRINT, OSCAR supports the scalable inference of pre-trained AI modes in use cases related to agriculture 4.0, personalised healthcare and maintenance and inspection. In AI4EOSC, OSCAR is also used to deploy AI models, but extending its support to create visual AI pipelines using both Node-Red and Elyra for the AI4Compose service. In InterTwin, OSCAR performs data-driven ingestion and processing via dCache and Apache Nifi.
In this contribution, we want to showcase some benefits of OSCAR as a serverless platform for scientific computing.
Acknowledgements
Project PDC2021-120844-I00 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Grant PID2020-113126RB-I00 funded by MCIN/AEI/10.13039/501100011033. This work was supported by the project AI-SPRINT ‘‘AI in Secure Privacy-Preserving Computing Continuum’’ that has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant 101016577. Also, by the project AI4EOSC ‘‘Artificial Intelligence for the European Open Science Cloud’’ that has received funding from the European Union’s Horizon Europe Research and Innovation Programme under Grant 101058593.
References
[1] OSCAR. https://oscar.grycap.net
[2] Infrastructure Manager. https://im.egi.eu
The EGI Federated Cloud (FedCloud) is a multinational cloud system that seamlessly integrates community, private, and public clouds into a scalable computing platform dedicated to research. Each cloud within this federated infrastructure configures its Cloud Management Framework (CMF) based on its preferences and constraints. The inherent heterogeneity among cloud sites can pose challenges when attempting to use native cloud tools such as Terraform at the federation level.
The FedCloud client, the official client of the EGI Federated Cloud, plays an important role in simplifying the use of these tools. It offers the following capabilities:
In essence, the FedCloud client serves as a valuable bridge, simplifying the use of native cloud tools within the EGI Federated Cloud environment. Its features contribute to a more user-friendly and efficient cloud computing experience, particularly when dealing with the diverse cloud infrastructure found within the federation.
Secret management stands as an important security service within the EGI Cloud federation. This service encompasses the management of various types of secrets, including tokens and certificates, and their secure delivery to the target cloud environment. Historically, accessing secrets from virtual machines (VMs) has relied on OIDC access tokens, a method that harbors potential security vulnerabilities. In the event of VM compromise, these access tokens can be pilfered, enabling attackers to gain access to all user secrets.
The Locker mechanism introduces an innovative and robust approach to securely deliver secrets to VMs. Users can effortlessly create a locker, deposit their secrets within it, and then furnish the locker's token to their VMs. Key security attributes of the locker system include:
Temporary and autoclean: Lockers have a limited lifespan and quantity. Upon expiration, lockers are automatically purged, along with all the secrets contained within them.
Isolation: Access to the secrets within a locker is exclusively through its associated token, which can solely be used for accessing the locker's secrets—nothing more. This isolation allows users to store tokens in Continuous Integration/Continuous Deployment (CI/CD) pipelines and similar tools, mitigating the risk of exposing personal secrets.
Malfeasance detection: The locker mechanism possesses the capability to detect if a token has been compromised and is being misused.
By adopting the locker approach, users can securely deliver secrets to VMs within the EGI Cloud federation, all while safeguarding their personal credentials from exposure. This innovative solution enhances the overall security posture of the cloud infrastructure, providing a robust foundation for secret management.
The Dynamic DNS service offered by IISAS plays a pivotal role in providing comprehensive, federation-wide Dynamic DNS support for virtual machines within the EGI Cloud infrastructure. This service allows users to register their preferred host names within designated domains (e.g., my-server.vo.fedcloud.eu) and associate them with public IP addresses of their servers.
The Dynamic DNS service brings about a significant enhancement in both the usability and security of services within the Cloud environment. Users can conveniently access services or virtual machines deployed in Clouds using pre-registered, meaningful, and memorable hostnames instead of raw IP addresses. Furthermore, with appropriately configured hostnames, users can obtain valid SSL certificates for the services hosted in the Cloud.
However, the true power of the Dynamic DNS service lies in its ability to facilitate service migration and ensure high availability. It can seamlessly redirect the service endpoint from one service instance to another in a mere minute, all without causing disruptions to end-users. By integrating with a monitoring service or utilizing a cron script, the Dynamic DNS service can automatically shift the service endpoint from a faulty instance to a healthy one, thus ensuring uninterrupted availability.
In summary, the Dynamic DNS service not only simplifies accessibility and security but also serves as a robust tool for achieving service continuity and high availability, making it an invaluable asset within the EGI Cloud infrastructure.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
Edge-to-Cloud is expected to provide the means for workloads execution and data processing both in the Edge and the Cloud. Within this presentation we present different efforts addressing the challenges towards achieving the next generation of continuum management. We will explain this presenting efforts of two projects illuMINEation and ICOS.
ICOS is proposing a high-level meta operating system (metaOS) to realize the continuum. The use case that is targeting to exploit the technologies is related to agriculture and robotics - the Agriculture Operational Robotic Platform (AORP) is an agro robot that can execute different tasks and missions, like sowing and tending crops, removing weeds, monitoring crop development, and identifying threats. The platform moves autonomously through the field, performing the assigned missions. The robotic platform consists of control and driving modules. In addition, it is equipped with interchangeable tools - a seeder and a sprayer. The AORP is equipped with cameras, sensors and Edge computational devices that can be connected to the Cloud directly, via the transport platform, or via farm connectivity.
The core objective of illuMINEation is to improve the efficiency as well as health & safety of European mining operations and their personnel. The project developed multi-level distributed IIoT platform for improved decision-making processes, fostering the evolution of a virtual mining environment (including Interfaces, Edge analytics, Cyber security, Fog & cloud infrastructure, Data transfer & communication). IlluMINEation project has received funding from EU H2020 R&I Program under grant No. 869379.
RedIRIS, in collaboration with the CCN, provides cybersecurity services to Unique Scientific and Technical Infrastructures (ICTS) that depend on the Ministry of Science and Innovation in Spain. These services are focused on the prevention of security incidents and active defense. The key of these services is facilitate the complete process of adaptation to the ENS (adaptation, implementation, audit and certification) of the ICTSs. Additionally, other cybersecurity services are provided to help the organization in the prevention and defense against security incidents.
This presentation will review the current challenges in the design and implementation of Digital Twins from the point of view of model simulation.
Several scenarios will be presented diving into the different sets of requirements of the flagship applications in the projects DT-GEO and interTwin, and the different technical solutions available to tackle them.
An analysis and similarities and possible collaboration paths will be made.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
The interTwin project is designing and building a Digital Twin Engine (DTE) to support interdisciplinary Digital Twins.
In particular, the interTwin DTE is aiming at supporting both end users (e.g. scientists, policymakers) but also DT developers, who would like to have an easy way to build and model their Digital Twins.
The talk will present the general project status focusing on describing the first version of the Blueprint architecture which has been released in June 2023, the Use cases supported and the contributions from Spanish and Portuguese institutions project partners.
The content of the project’s First Software release due at the end of 2023, and the plan for our Open-Source Community will be also discussed, together with the activities to align the project architecture to what Destination Earth is designing.
interTwin is a project started in September 2022, funded by the EU for the development of an open source platform, called Digital Twin Engine (DTE), to support the digital twins of selected communities, which can be exported in multiple scientific fields. For this reason, interTwin was designed to develop the platform involving both scientific domain experts and computational resource providers.
From the infrastructural perspective one of the main challenge is to federate a set of highly heterogeneous and disparate providers. We envision to cope with it enabling the "transparent offloading" capability. The latter will be embedded in the DTE infrastructural layer by exploiting the Virtual Kubelet technology. In this talk we present the API layer developed by interTwin in order to guarantee a unique interface and thus a standard way for a cloud deployed service to communicate to any external system for the actual payload offloading. The status of the current testbeds at Vega and Juelich supercomputing centers will be also presented.
The itwinAI framework represents a comprehensive solution developed by CERN and the Julich Supercomputing Center (JSC) to facilitate the development, training, and maintenance of AI-based methods for scientific applications. It serves as a core module within the interTwin project, aimed at co-designing and implementing an interdisciplinary Digital Twin Engine. itwinAI streamlines the entire AI lifecycle, offering user-friendly core functionalities such as distributed training, hyperparameter optimization, and model registry.
Distributed Training: itwinAI simplifies the process of distributing existing code across multiple GPUs and nodes, automating the training workflow. It leverages industry-standard backends, including PyTorch Distributed Data Parallel (DDP), TensorFlow distributed strategies, and Horovod.
Hyperparameter Optimization: Enhancing model accuracy is made more efficient with itwinAI's hyperparameter optimization functionality. Researchers can intelligently explore hyperparameter spaces, eliminating the need for manual parameter tuning. This functionality is empowered by RayTune.
Model Registry: itwinAI offers a robust model registry for logging and storing models and associated performance metrics, enabling comprehensive analysis. The backend leverages MLFlow for seamless model management.
itwinAI has undergone successful deployment and testing on JSC's HDFML cluster and has been integrated with interLink, a framework within interTwin designed to seamlessly offload compute-intensive tasks from cloud to high-performance computing (HPC) resources.
The versatility of itwinAI is evident in its application across various scientific domains, including its contributions to Detector Simulation in High-Energy Physics and Fire Risk Modeling in climate research. This framework stands as a valuable resource for researchers, data scien
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
Dataverse is an open source data repository solution with increased adoption by research organizations and user communities for data sharing and preservation. Datasets stored in Dataverse are
cataloged, described with metadata, and can be easily shared and downloaded. After having dedicated one year to the development and integration of a Dataverse based repository for research data, we realized the lack of tools, benchmarks and reference information regarding performance testing of Dataverse based repositories.
In this presentation we will share our process of testing the application’s performance, the issues we came across and the process of debugging some of the bottlenecks we encountered.
While using some of the most common tools all Linux distributions come bundled with and the Apache JMeter to stress test the service, we were able to discover some bottlenecks and create a series of interactions that can be used to benchmark the service’s performance and behavior under load.
In the framework of EOSC Association several Task Forces have been created to study and report about Open Science, Open Data and in particular Quality for Research Software.
Research Software (RS) is defined as software that is produced by researchers and used as an enabler for scientific activities. A major objective of the EOSC task force on Infrastructures for Quality Research Software is to improve the quality of RS, both from the technical and organizational point of view for RS in general and in particular the software used in the services offered through EOSC.The Task Force is subdivided into three subgroups. In this presentation we will describe the first results and the first deliverable from the “Ensure Software Quality” subgroup.
The subgroup has conducted a review of the state-of-the-art of Software Quality models. The presentation will explain the strategy and the first deliverable containing a significant amount of Quality Models with corresponding Quality Attributes. The models and attributes have been compared and were merged where appropriate. The deliverable has been published in Zenodo: https://zenodo.org/record/8221384 , and is currently under review by external parties to the subgroup.
This first document constitutes a reference to the ongoing work towards identifying the most appropriate quality attributes, not necessarily from the same Quality Model, for each category of identified RS. Thus the objective of the subgroup is to recommend a set (or several sets) of Quality Attributes for RS validation.
Issues such as Open Source, FAIR for RS, citation as well as testing, will be also tackled in the presentation.
The EOSC-Synergy project is developing a toolset to bring over mainstream practices close to researchers throughout the development life cycle of the EOSC software and services. The objective is twofold: on the one hand streamlining the adoption of such practices in the scope of the EOSC, and on the other hand, providing a software-quality assessment tool to promote, measure and reward quality.
The Software Quality Assurance as a Service (SQAaaS[1]) platform tackles both objectives. The web-based interface ensures that no previous expertise is required for composing the fundamental building blocks, the pipelines supported by JePL library, which define the workflow that drives the validation and verification of the software and services.
The JePL library facilitates the creation of Jenkins pipelines by using a YAML description to define the several stages that compose a CI/CD pipeline. The actions in the YAML configuration file are aligned with the criteria compiled in the software and service quality baselines [2][3], and support popular deployment tools to orchestrate the required set of services needed during the quality assessment process. A minimal (single-stage) Jenkins CI/CD pipeline definition (Jenkinsfile) is required to dynamically compose the required set of stages defined as actions in the YAML description.
Each step in a SQAaaS pipeline generated using JePL, addresses a well-defined quality criterion according to the baseline criteria the EOSC-Synergy project has adhered (and contributed to) [2][3]. The Pipeline as a Service module allows the researcher to compose ad hoc pipelines that can be readily used when added to code repositories. As a complement, the Quality Assessment & Awarding module conducts a comprehensive analysis of the quality attributes of a given software release and recognizes its achievements by issuing digital badges. The badges’ metadata, compliant with the Open Badges specification [4], contain all the pointers and associated data that have resulted from the quality assessment process.
SQAaaS platform has been already used by multiple use cases [5], and the first prototype, featuring the Pipeline as Code module, was closed on May 2022. As a proof of concept, this new release will already provide support for issuing digital badges. The validation of each incremental release is actively performed by the thematic services that take part in the EOSC-Synergy project. The ultimate version, which will include the full coverage of the two aforementioned modules, will be available at the end of the project.
[1] https://digital.csic.es/handle/10261/296555
[2] http://dx.doi.org/10.20350/digitalCSIC/12543
[3] http://dx.doi.org/10.20350/digitalCSIC/12533
[4] https://www.imsglobal.org/spec/ob/v2p1/
[5] https://sqaaas.eosc-synergy.eu
The Software Quality Assurance as a Service (SQAaaS[1]) platform tackles both objectives. The web-based interface ensures that no previous expertise is required for composing the fundamental building blocks, the pipelines supported by JePL library, which define the workflow that drives the validation and verification of the software and services.
This demonstration will show how the SQAaaS Quality assessment & Awarding module can be easily used to assess and validate the quality of research software.
The Software Quality Assurance as a Service (SQAaaS[1]) platform tackles both objectives. The web-based interface ensures that no previous expertise is required for composing the fundamental building blocks, the pipelines supported by JePL library, which define the workflow that drives the validation and verification of the software and services.
This demonstration will show how the Pipeline as a service module can be easily used to create and execute quality pipelines for research software.
In the context of the EOSC-Synergy SQAaaS platform, the JePL library will be used to enable the on-demand dynamic composition of Jenkins pipelines that will perform the several steps of the envisaged quality assurance. These steps will implement the quality validation actions defined in the EOSC-synergy software and services quality criteria.
The demonstration will highlight the features and capabilities of the library in practice, showing how to easily create pipelines that implement and comply with the good practices that are expected during the software lifecycle, from development to production. Starting from the SQAaaS web interface, it will be created the configuration files for an open source repository. Getting in more detail, it will be shown how to change generated configurations with some pratical use cases when using some of the available tools.
This is particularly relevant to developers and managers of research services both at the infrastructure and thematic levels.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
The protection and sustainable management of marine ecosystem services is the main challenge of the Galician Marine Sciences Program (Programa de Ciencias Mariñas de Galicia, CCMM) through three main lines of action: observation and monitoring of the marine environment and the coast; sustainable, smart and precision aquaculture, and innovation, knowledge and opportunities to adapt to change in the marine economy.
These three main lines are developed with the participation and drive of 250 researchers belonging to 89 research groups from multiple disciplines, and all the public institutions involved in the generation of marine science in Galicia. The Galician Marine Sciences Program is part of the State Marine Sciences Program, which in turn is part of the Complementary R&D&I Plans with the Autonomous Communities, through Investment Line 1 of Component 17 of the Recovery Plan, Transformation and Resilience.
CESGA participates in the project in the activity related to the data and integration platform that uses CESGA's BigData platform. It will provide quick access to ready-to-use Big Data solutions and allow users to take advantage of modern data processing tools, covering a wide range of use cases that include the processing of large volumes of information in a timely manner. parallel, high-speed processing of data streams in real time, or the processing of heterogeneous data from different sources (structured and unstructured). There is no need to learn how to deploy complex Big Data services, users will connect and start using the platform. BDICESGA provides a scalable infrastructure whose capacity can grow with demand by adding additional resources.
https://ccmmbigdata.cesga.es/#features
https://cetmar.org/projects/programa-de-ciencias-marinas-de-galicia/?lang=en
In this contribution I will expose the R&D projects and activities that are active in WLCG to evolve the infrastructure towards the HL-LHC era.
The Square Kilometre Array Observatory (SKAO) is an international collaborative effort focused on constructing and operating the world's most advanced radio telescope. SKA data (~700PB/year) will be delivered to a Global Network of SKA Regional Centres (SRCNet) that will provide the global scientific community with access to SKA Observatory data with analysis tools and services, as well as the processing and storage capacity to fully exploit its scientific potential, making the SRCs the place where SKA science will be done.
Five prototypes have been proposed within the SRCNet to be implemented in order to provide each of the building blocks that will shape the SRCNet. These prototypes cover a) the deployment of global data distribution, replication and scientific archiving platforms between SRCNet nodes, b) federated authentication, authorisation and auditing infrastructure, c) distributed data processing between SRCs, d) data visualisation, and finally e) software delivery and distribution. To tackle the work with these prototypes, different Agile teams have been established, consisting of members from several national SRCs of the SRCNet who collaborate in the development, implementation and deployment of these services for SRCNet.
The Instituto de Astrofísica de Andalucía (IAA-CSIC) is leading the development of the Spanish SKA Regional Centre (ESPSRC). The ESPSRC members are working in an Agile team called Coral Team (a Agile stream-aligned team) that also integrates other members of the United Kingdom (UKSRC), Switzerland (SRCCH) and Sweden (SWESRC). The Coral team is involved in the deployment and testing of a scaled-down version of the international SRCNet platform (mini-SRCNet), key to the evaluation of the technologies to be used in its implementation.
Under this scenario and to support the work with the prototypes, the ESPSRC provides a flexible infrastructure model governed by OpenStack where computing, resources and storage are enabled for collaboration with the development of the prototypes proposed in the team, such as deployment of data distribution platforms like Rucio and CACD Storage Inventory, orchestration of container services, science platforms, Virtual Machines (VM), on-demand clusters and software distribution among others.
The ESPSRC plays an important role in providing computing resources for research, development and training/testing projects, fostering a transparent and collaborative environment aligned with FAIR and Open Science principles. In this contribution we present the ESPSRC, detailing our work within the SRCNet collaboration on hardware infrastructure and cloud computing, data distribution and archiving, software delivery and science services, as well as our collaborations with other SRCs.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
The H2020 C-SCALE (Copernicus – eoSC AnaLytics Engine, https://c-scale.eu/) project has created services to unlock the vast potential of Copernicus data for advanced Earth Observation analytics by providing a pan-European federated data and computing infrastructure through the EOSC Portal. As the project is coming to the end, this session aims to present the main outcomes of the project with a focus on how the Iberian e-infrastructure community has contributed to those.
During the conference session, attendees will be presented with an overview of the resulting C-SCALE service portfolio, which has been made accessible through the EOSC Marketplace, including the FedEarthData, EO-MQS and openEO platform services and the Workflow Solutions, designed to enhance the efficiency of data processing and analysis, thereby streamlining both research and application development in Earth Observation. The session is tailored to cater to the needs of both data and compute providers - for both comprehensive information will be shared on how to become part of the C-SCALE federated ecosystem. Real-world examples on how INCD have been contributing and leveraging C-SCALE services will be provided, such as the deployment of a self-hosted STAC Catalog as a support for both the OpenEO and the Agriculture of Data use case.
Finally, sustainability of the project results and its early and future expected impact will be discussed - including how to engage in the services and foster potential follow up collaborations.
The International Lattice Data Grid (ILDG) is a community effort
of physicists working on Lattice Field Theory in order to coordinate
and enable the sharing of their large and valuable datasets from
numerical simulations. ILDG started around 20 years ago and is organized
as a world-wide federation of regional grids which use interoperable
services (e.g catalogues) and unified standards, (e.g. for APIs and
metadata) following the FAIR principles. We report on the status,
progress and plans to extend and modernize the ILDG framework and
services, including the new implementation of the metadata and file
catalogues, extensions of the metadata schema, and transition to
Indigo IAM and token-based technologies.
Climate data analysis often entails downloading datasets of several terabytes in size from various sources and employing local workstations or HPC computing infrastructures for analysis. However, this approach becomes inefficient in the era of big data due to the considerable expenses linked with transferring substantial volumes of raw data over the Internet from diverse sources, encompassing observations and model simulations. Climate data analysis tasks involve routine procedures like subsetting, regridding, and bias adjustment. These processes can be effectively executed using existing packages that adhere to best practices, thereby curtailing redundancy. Recent strides in web-based computing frameworks and cloud computing have emerged as feasible alternatives, furnishing collaborative computing infrastructures that improve code reproducibility and reusability. Cloud systems are frequently established on top of object storage, accompanied by the development of pioneering data formats and libraries that harness the potential of this innovative storage paradigm. Consequently, web-based virtual research environments have arisen, grounded in cloud infrastructures. These infrastructures not only refine data analysis workflows but also improve overall productivity. This work provides an overview of these research environments.
Zoom Ibergrid 2023
Id: 846 2964 6790
Passwd: 044413
URL: https://us06web.zoom.us/j/84629646790?pwd=5qakht6CbPHd0hm5aKtGaRsdkCHB5Y.1
Reproducibility is a cornerstone of scientific research, ensuring the reliability and validity of results by allowing independent verification of findings. During the EGI-ACE project, EGI has developed EGI Replay, a service that allows researchers to reproduce and share custom computing environments effortlessly. With Replay, researchers can replicate the execution of your analysis in a notebooks-based platform, ensuring that others can easily access and interact with the content.
EGI Replay is based on Binder technology, which builds computing environments on the fly from a code repository that contains the code you’d like to run, as well as a set of configuration files that determine the exact computing environment to run it. Replay also generates shareable links for others to interact with the content from any browser. This means other researchers can easily reproduce the analysis and access data available in EGI’s infrastructure, making it easier than ever to collaborate and share your work with others.
EGI Replay is powered by the EGI distributed infrastructure and integrated with additional EGI services:
Check-in provides federated authentication to Replay and integration with EOSC AAI. Replay automatically generates and refreshes access tokens for accessing any Check-in enabled service from EGI or third party providers
DataHub provides simple and scalable access to distributed data for Replay. User's DataHub spaces are automatically mounted and visible on Replay's interface, making it simple to perform analysis of available datasets. DataHub currently hosts more than 1,000 public and private datasets, reaching more than 1.3 PB.
Software Distribution provides access to software based on CVMFS. Replay mounts selected CVMFS repositories on the environment for even simpler access to community-specific software.
With the recent introduction of the EOSC Data Transfer in the EOSC portal, new workflows for data analytics are enabled in Replay: users can trigger the transfer of available EOSC datasets to EGI infrastructure and then perform their reproducible analysis with Replay accessing the data.
This presentation will provide an overview of the EGI Replay service and how it can support the reproducibility of Open Science using data from the EOSC portal.
OpenStack, known for its open-source cloud capabilities, offers a wide range of services for creating and managing various types of clouds. However, setting up and maintaining an OpenStack cloud can be a complex and time-consuming task. In recent years, Kubernetes has gained popularity as a platform for managing containers, making it an attractive choice for simplifying the infrastructure behind OpenStack.
At CESNET we have decided to build a next-generation of OpenStack cloud utilizing Kubernetes. In this presentation, we introduce our innovative approach to building the CESNET OpenStack cloud using Kubernetes as the foundation, combined with the OpenStack Helm infra-based distribution. We will discuss benefits of this approach which enables simplified deployment and management of OpenStack services by leveraging the automation and scalability of Kubernetes.
In today's world of high-throughput bioinformatics and advanced experimental techniques, researchers are generating enormous datasets. Consider of cryo-electron microscopy data or the complex images produced by nuclear magnetic resonance and light microscopy—they're all rich in scientific value. Not just for the researchers who generate them, but for the entire scientific community. The key to unlocking their potential lies in making them accessible, and that's where metadata comes into play.
This contribution dives into the techniques and steps needed to handle metadata effectively. We'll explore how to extract metadata from the data sources themselves, organize it using ontologies (structured frameworks for knowledge), and seamlessly incorporate it into commonly recognized metadata standards. We'll also look at the development of a powerful system and language that can unify metadata from different sources, making it clear and easy to work with.
In response to these challenges, our team is in the final stages of creating a groundbreaking toolset. This toolset will empower scientists to add and manage metadata according to specific guidelines (FAIR principles and EOSC recommendations).
Instruct is the pan-European Research Infrastructure for structural biology, centered in bringing high-end technologies and methods to European researchers. The Instruct Image Processing Center (I2PC) has actively been promoting FAIR practices for cryoEM data acquisition and processing workflows. Efforts carried in the scope of projects such as EOSC-Life, BY_COVID and EOSC-Synergy are driving the implementation of the Data Management Plan of Instruct-ERIC and defining the future of an integrated CryoEM workflow. The image acquisitions at the microscope facilities are controlled by SmartScope (an open-source software used in the field) and registered in a Laboratory Information Management System (LIMS) while the project details and metadata will be FAIR-ly available through ARIA (the project management application developed inside Instruct) and the information is stored through a data space (with distributed tools such as iRODS). The processing can easily be deployed in federated computing infrastructures (such as EGI Cloud Compute or the IFCA cloud) by using ScipionCloud. Eventually, the data is made publicly available through EMPIAR (the electron microscopy public archive). The effort will culminate with a fully-connected and integrated system, offering a solution for the whole pipeline in CryoEM (and other) techniques, respecting FAIR principles and ensuring the availability of information.
The “Digital Science and Innovation” Interdisciplinary Thematic Platform (PTI) was launched by the Spanish National Research Council (CSIC) in June 2022, with the aim to innovate in all areas of digital science and data lifecycle management, from planning, acquisition, and processing to publication and preservation.
The platform groups its activity into the following 4 strategic areas and 2 cross-cutting areas:
Data Science
Sensors and integration of intelligent systems
Cross-cutting software and tools
Digital Security
Open Science
Innovation
In addition, the platform integrates infrastructures such as clean rooms and data processing centres. The idea is that PTI integrates all the capabilities, technologies, and knowledge of CSIC research groups working on digitization, to tackle projects with high levels of technological development, and integrate the results into the industrial sector.
During its first year the platform has established links with many different groups and counts now with about 150 researchers from more than 40 different CSIC institutes, it also has external participants from industry and other institutions. We have established links with GAIA-X Spain through the FGCSIC and made contacts in Industry.
All these was displayed during the PTI’s first Annual Meeting last may in the CCHS in Madrid, with the participation of representatives from research, governmental institutions and industry.
Data Science
One of our aims is to enable collaboration in Data Science projects for researchers all across the spectrum, and in particular to encourage the use of digital technologies in all areas of science and humanities, helping to establish links among different research groups and communities and fostering collaboration and the participation in funding calls for digital technologies related projects at all levels from local to European.
For instance, three of the PTI’s groups have collaborated in the elaboration of a European project proposal (SIESTA) that has been funded by the European Commision and will start on the first of January 2024.
We aim also to facilitate the access and knowledge of these technologies for all types of researchers (from wet lab to digital humanities). For this we propose collaborations to facilitate the access to computational infrastructures and are preparing several formation activities.
Open Science
Open Science is a cross-cutting area in the platform, meaning that it is necessary to address it in any project developed in the platform. Our objectives in this area of the PTI are:
We have organised a summer course at UIMP “Pilares para el avance de la Ciencia Abierta”, held in Cuenca in September 2023 that covered different aspects of Open Science, from Open Access and FAIR data to infrastructures (like EOSC) and policies.
Kampal IC is an application to host networked "collective intelligence" brainstorms with an evolution model inspired in physics, limiting interactions to 2^D "nearest neighbors"
We describe its expansion to include artificial agents, that can interact either with humans or with themselves. Due to the computational requirements of modern AI agents, mostly based on Large Language Models, this architecture includes an API to deploy agents in a distributed way and activate them when needed.
Many FAIR and Open Data concepts are being implemented with more collaborative tools than ever before. The VISA (Virtual Infrastructure for Scientific Analysis) Portal, developed at the Institut Laue-Langevin (ILL) and adopted by partners in the EU projects PaNOSC and ExPaNDS, is the most promising concept for us at DESY.
A group of research institutes that participated in these projects agreed to advance the development, integration and subsequent use of VISA at their respective facilities. The portal is in production at ILL and the European Synchrotron Radiation Facility (ESRF) for beamline control and data access as well as analysis.
VISA allows scientists to start their preferred set of interactive analysis tools and giving them access to scientific data repositories.
VISA's inherent access control mechanism synchronized with scientific metadata catalogues allows users to access not only Open Data but also their private data that might still be under embargo.
Concrete topics in the presentation at IBERGRID will be our roadmap to integrate VISA into the DESY infrastructure and our planned changes to the external user account management targeting improved compliance with FAIR Data compliance.
INCD (www.incd.pt) provides computing and data services to the Portuguese national scientific and academic community in all areas of knowledge. The infrastructure is especially oriented to provide scientific computing and data oriented services, supporting researchers and their participation in national and international projects.
INCD operates an integrated infrastructure with services being provided from multiple geographic locations, interconnected by a state-of-the-art data network. The INCD services are integrated in international infrastructures with which it shares computing resources for the benefit of projects of national and international relevance. In this context, INCD participates in the European Grid Infrastructure (EGI), Iberian computing infrastructure (IBERGRID), the Worldwide LHC Computing Grid (WLCG), and the European Open Science Cloud and Portuguese advanced Computing Network (RNCA).
This presentation will provide an overview of the INCD infrastructure current status and expected developments.
A review on the current upgrade plans and future endeavours of PSNC will be described.
The Portuguese Distributed Computing Infrastructure (INCD) is a digital research e-infrastructure with operational centers at different geographic locations within the country. The catalogue of services includes both generic services like HPC, HTC and cloud as well as tailored services developed for specific purposes and/or scientific domains.
This ecosystem represents a challenge regarding deploying and managing software and services across several locations. Over the years, INCD has been developing tools and practices to maintain the software that is then made available across sites through a central software distribution service based on Squid and the CernVM File System (CVMFS) to address this challenge.
In this talk, we will present the architecture, workflow and software tools used at INCD to install, test, deploy and publish software for services and user applications.
INCD is a Portuguese digital infrastructure that provides computing and data oriented services for research and education in all scientific domains. INCD has been part of the national roadmap of research infrastructures and is a member of the Portuguese Advanced Computing Network (RNCA).
As a national infrastructure, INCD needs to fullfil statutory, legal and funding related obligations. For these purposes INCD needs to collect, manage and report on a wide range of present and past information regarding access requests, supported projects, usage, research results and impact. This information is also essential for the management of users and projects and therefore has been kept separate across several different systems and databases.
In order to streamline and improve the management of the required information INCD is developing a new information system. With this system INCD aims to facilitate both the administrative and technical aspects of its operation including: user and projects management, helpdesk and support, keeping track of results such as scientific publications, patents, thesis and dissemination materials, and finally make easier the production of statistics and reports. This presentation will focus on the system requirements, challenges and design.
CESGA