FAIRfication of PANGAEA datasets: Recent Developments and Lessons Learned (original) (raw)
Related papers
PANGAEA—an information system for environmental sciences
Computers & Geosciences, 2002
PANGAEA is an information system for processing, long-term storage, and publication of georeferenced data related to earth science fields. Essential services supplied by PANGAEA are project data management and the distribution of visualization and analysis software. Organization of data management includes quality control and publication of data and the dissemination of metadata according to international standards. Data managers are responsible for acquisition and maintenance of data. The data model used reflect the information processing steps in the earth science fields and can handle any related analytical data. The basic technical structure corresponds to a three tiered client/server architecture with a number of comprehensive clients and middleware components controlling the information flow and quality. On the server side a relational database management system (RDBMS) is used for information storage. The web-based clients include a simple search engine (PangaVista) and a data mining tool (ART). The client used for maintenance of information contents is optimized for data management purposes. Analysis and visualization of metainformation and analytical data is supported by a number of software tools, which can either be used as 'plug-ins' of the PANGAEA clients or as standalone applications, distributed as freeware from the PANGAEA website. Established and well-documented software tools are the mini-GIS PanMap, the plotting tool PanPlot, and Ocean Data View (ODV) for the exploration of oceanographic data. PANGAEA operates on a long-term basis. The available resources are sufficient not only for the acquisition of new data and the maintenance of the system but also for further technical and organizational developments. r
Concurrency and Computation: Practice and Experience
EarthCube Data Discovery Studio (DDStudio) is a crossdomain geoscience data discovery and exploration portal. It indexes over 1.65 million metadata records harvested from 40+ sources and utilizes a configurable metadata augmentation pipeline to enhance metadata content, using text analytics and an integrated geoscience ontology. Metadata enhancers add keywords with identifiers that map resources to science domains, geospatial features, measured variables, and other characteristics. The pipeline extracts spatial location and temporal references from metadata to generate structured spatial and temporal extents, maintaining provenance of each metadata enhancement, and allowing user validation. The semantically enhanced metadata records are accessible as standard ISO 19115/19139 XML documents via standard search interfaces. A search interface supports spatial, temporal, and text-based search, as well as functionality for users to contribute, standardize, and update resource descriptions, and to organize search results into shareable collections. DDStudio bridges resource discovery and exploration by letting users launch Jupyter notebooks residing on several platforms for any discovered datasets or dataset collection. DDStudio demonstrates how linking search results from the catalog directly to software tools and environments reduces time to science in a series of examples from several geoscience domains. URL: datadiscoverystudio.org K E Y W O R D S data discovery, Jupyter notebooks, metadata, metadata augmentation 1 INTRODUCTION Finding data using commercial search engines or numerous domain-specific data portals, then downloading the files or accessing the data via services to explore its content and determine its fitness for use, are common components of research projects. Increases in data volumes, variety of data types, incomplete or poorly structured metadata, implicit assumptions, and unfamiliar terminology complicate interpretation of discovered resources and their reuse in research workflows, especially in multidisciplinary studies. In the geosciences, finding data is a well-articulated challenge due to heterogeneous data models, semantic conventions, access protocols, and other practices of data description and access across geoscience disciplines. Quality, completeness and standards-compliance of available metadata catalogs vary dramatically, while metadata curation remains mostly manual and labor-intensive. As a result, traditional metadata management and data discovery models become increasingly inadequate. While large-scale web search is supported by several commercial search engines, finding data across domains remains a challenge.
PANGAEA information system for glaciological data management
1998
Specific parameters determined from continental ice sheet or glacier cores can be used to reconstruct former climate. To use this scientific resource effectively, an information system is needed which guarantees consistent long-term data storage and provides easy access. Such a system, to archive any data of pal eo climatic relevance, together with the related metadata, raw data and evaluated paleoclimatic data, is presented. It is based on a relational database and provides standardized import and export routines, easy access with uniform retrieval functions and tools for visualizing the data. The network is designed as a client-server system, providing access through the Internet with proprietary client software including a high functionality or read-only access to published data via the World Wide Web (www.pangaea.de).
An examination of scientific data repositories, data reusability, and the incorporation of FAIR
Proceedings of the Association for Information Science and Technology, 2020
Scientific data repositories (SDRs) provide a way for scientists to share data through data deposition and reuse of deposited data. Over the last twenty-plus years, hundreds of scientific SDRs have become available. This research examines 132 SDRs. This study assesses if the information available in the SDRs aligned with what scientists need to determine data reusability and if the SDRs enforce FAIR principles.
An examination of scientific data repositories, data reusability, and the incorporation of FAIR
Proceedings of the Association for Information Science and Technology, 2020
Scientific data repositories (SDRs) provide a way for scientists to share data through data deposition and reuse of deposited data. Over the last twenty-plus years, hundreds of scientific SDRs have become available. This research examines 132 SDRs. This study assesses if the information available in the SDRs aligned with what scientists need to determine data reusability and if the SDRs enforce FAIR principles.
Publication and Curation of Large-Scale Shared Scientific Data
Many environmental scientists today need to assemble, use, share and save data from a diverse set of sources. These "synthesis" efforts are often interdisciplinary and blend data from ground-based sensors, satellites, field observations, and the literature. At even moderate scales of both data size and diversity, the cost and time required to find, gather, collate, normalize, and customize data in order to build a synthesis dataset can be daunting at best.. By explicitly identifying and addressing the different requirements for each data role (author, curator, data valet, publisher, and consumer), our data management architecture for large-scale shared environmental data enables the creation of such synthesis datasets that continue to grow and evolve with new data, data annotations, participants, and use rules. We show the effectiveness of our approach in the context of the FLUXNET Carbon-Climate Synthesis Dataset, one of the largest ongoing biogeophysical field experiments.
Research Ideas and Outcomes
Natural science collections are vast repositories of bio- and geodiversity specimens. These collections, originating from natural history cabinets or expeditions, are increasingly becoming unparalleled sources of data facilitating multidisciplinary research (Meineke et al. 2018, Heberling et al. 2019, Cook et al. 2020, Thompson et al. 2021). Due to various global data mobilization and digitisation efforts (Blagoderov et al. 2012,Nelson and Ellis 2018), this digitised information about specimens includes database records along with two/three-dimensional images, sonograms, sound or video recordings, computerised tomography scans, machine-readable texts from labels on the specimens as well as media items and notes related to the discovery sites and acquisition (Hedrick et al. 2020,Phillipson 2022). The scope and practice of specimen gathering are also evolving. The term extended specimen was coined to refer to the specimen and associated data extending beyond the singular physical obje...
Scientific Data, 2017
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community. Supporting discovery through good data management Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process. Unfortunately, the existing digital ecosystem surrounding scholarly data publication prevents us from extracting maximum benefit from our research investments (e.g., ref. 1). Partially in response to this, science funders, publishers and governmental agencies are beginning to require data management and stewardship plans for data generated in publicly funded experiments. Beyond proper collection, annotation, and archival, data stewardship includes the notion of 'long-term care' of valuable digital assets, with the goal that they should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. The outcomes from good data management and stewardship, therefore, are high quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation, and reuse in downstream studies. What constitutes 'good data management' is, however, largely undefined, and is generally left as a decision for the data or repository owner. Therefore, bringing some clarity around the goals and desiderata of good data management and stewardship, and defining simple guideposts to inform those who publish and/or preserve scholarly data, would be of great utility. This article describes four foundational principles-Findability, Accessibility, Interoperability, and Reusability-that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added-value gained by contemporary, formal scholarly digital publishing. Importantly, it is our intent that the principles apply not only to 'data' in the conventional sense, but also to the algorithms, tools, and workflows that led to that data. All scholarly digital research objects 2-from data to analytical pipelines-benefit from application of these principles, since all components of the research process must be available to ensure transparency, reproducibility, and reusability. There are numerous and diverse stakeholders who stand to benefit from overcoming these obstacles: researchers wanting to share, get credit, and reuse each other's data and interpretations; professional data publishers offering their services; software and tool-builders providing data analysis and processing services such as reusable workflows; funding agencies (private and public) increasingly Correspondence and requests for materials should be addressed to B.M.
Data Management in Astrobiology: Challenges and Opportunities for an Interdisciplinary Community
Astrobiology, 2014
Data management and sharing are growing concerns for scientists and funding organizations throughout the world. Funding organizations are implementing requirements for data management plans, while scientists are establishing new infrastructures for data sharing. One of the difficulties is sharing data among a diverse set of research disciplines. Astrobiology is a unique community of researchers, containing over 110 different disciplines. The current study reports the results of a survey of data management practices among scientists involved in the astrobiology community and the NASA Astrobiology Institute (NAI) in particular. The survey was administered over a 2-month period in the first half of 2013. Fifteen percent of the NAI community responded (n = 114), and additional (n = 80) responses were collected from members of an astrobiology Listserv. The results of the survey show that the astrobiology community shares many of the same concerns for data sharing as other groups. The benefits of data sharing are acknowledged by many respondents, but barriers to data sharing remain, including lack of acknowledgement, citation, time, and institutional rewards. Overcoming technical, institutional, and social barriers to data sharing will be a challenge into the future.