EarthCube Data Discovery Studio: A gateway into geoscience data discovery and exploration with Jupyter notebooks (original) (raw)
Concurrency and Computation: Practice and Experience
EarthCube Data Discovery Studio (DDStudio) is a crossdomain geoscience data discovery and exploration portal. It indexes over 1.65 million metadata records harvested from 40+ sources and utilizes a configurable metadata augmentation pipeline to enhance metadata content, using text analytics and an integrated geoscience ontology. Metadata enhancers add keywords with identifiers that map resources to science domains, geospatial features, measured variables, and other characteristics. The pipeline extracts spatial location and temporal references from metadata to generate structured spatial and temporal extents, maintaining provenance of each metadata enhancement, and allowing user validation. The semantically enhanced metadata records are accessible as standard ISO 19115/19139 XML documents via standard search interfaces. A search interface supports spatial, temporal, and text-based search, as well as functionality for users to contribute, standardize, and update resource descriptions, and to organize search results into shareable collections. DDStudio bridges resource discovery and exploration by letting users launch Jupyter notebooks residing on several platforms for any discovered datasets or dataset collection. DDStudio demonstrates how linking search results from the catalog directly to software tools and environments reduces time to science in a series of examples from several geoscience domains. URL: datadiscoverystudio.org K E Y W O R D S data discovery, Jupyter notebooks, metadata, metadata augmentation 1 INTRODUCTION Finding data using commercial search engines or numerous domain-specific data portals, then downloading the files or accessing the data via services to explore its content and determine its fitness for use, are common components of research projects. Increases in data volumes, variety of data types, incomplete or poorly structured metadata, implicit assumptions, and unfamiliar terminology complicate interpretation of discovered resources and their reuse in research workflows, especially in multidisciplinary studies. In the geosciences, finding data is a well-articulated challenge due to heterogeneous data models, semantic conventions, access protocols, and other practices of data description and access across geoscience disciplines. Quality, completeness and standards-compliance of available metadata catalogs vary dramatically, while metadata curation remains mostly manual and labor-intensive. As a result, traditional metadata management and data discovery models become increasingly inadequate. While large-scale web search is supported by several commercial search engines, finding data across domains remains a challenge.