28–30 Oct 2024
Porto
Europe/Lisbon timezone

Galician Marine Sciences Program Data Lake

28 Oct 2024, 17:10
20m
Auditório (Centro de Investigação Médica (CIM-FMUP))

Auditório

Centro de Investigação Médica (CIM-FMUP)

Development of innovative software and services IBERGRID

Speakers

Ms Cecilia Grela Llerena (CESGA)Mr Pablo Prieto Rúa (CESGA)Dr Javier Cacheiro López (CESGA)

Description

Over the last two years, the Galician Marine Sciences Program (CCMM) has developed a Data Lake to support the collection and analysis of data related to Galicia’s marine ecosystem. The Data Lake architecture facilitates processing both structured and unstructured data, already integrating diverse datasets such as ocean currents velocity maps, species distribution data, upwelling indices, buoy-derived marine conditions, marine carbon-related datasets, SOCAT coastal and North Atlantic data and atmospheric models.
For the storage layer, the Data Lake utilizes Apache Hadoop’s HDFS distributed filesystem and Apache Parquet for efficient distributed and parallel processing.
For the analysis layer, Apache Spark enables high-performance, scalable data processing, combining multiple datasets to advance marine ecosystem research and support sustainable resource management.
Interactive processing is enabled through a web portal that uses JupyterLab notebooks tightly integrated with the Data Lake and customized for marine sciences usage.
The Data Lake not only accelerates data-driven insights but also provides a scalable infrastructure for future research, fostering collaboration and innovation in the sustainable management of Galicia’s marine resources.

Primary authors

Ms Cecilia Grela Llerena (CESGA) Mr Pablo Prieto Rúa (CESGA) Dr Javier Cacheiro López (CESGA) Dr Carlos Fernandez Sanchez (CESGA) Mr Francisco Landeira Vega (CESGA)

Presentation materials