Editorial: Data publication – ESSD goals, practices and recommendations (original) (raw)

To enable this overall goal of reliable durable data access, ESSD recommends that providers, reviewers and users adhere to the following recommendations.

3.1 Emphatic open access

For ESSD, easy free open access to data applies to data providers as well as users. Data providers must have easy access to no-cost mechanisms and services that will curate their data. Curation includes reliable long-term storage and backup, minting and maintenance of permanent identifiers, and appropriate metadata services that facilitate search, identification and download. Users, following identifier links embedded in an ESSD paper, should enjoy fast free reliable “two-click” access: one click to a relevant landing page and a second click to download. An ideal repository will include topical and geographic browsing. Users should not encounter registration steps, password requests, access agreements, or other log-in barriers or tracking mechanisms. From the start of open discussion, ESSD data products should exist in full public access without proprietary protection periods or other restrictions. Data publication as practised by ESSD depends on free bilateral unrestricted access. Most data repositories used by ESSD providers promote and support exactly this level of open access. New-to-ESSD data repositories can usually provide identical levels of access service. When authors lack information about or access to appropriate data centres, _ESSD_provides guidance and recommendations.

ESSD likewise insists on easily accessible non-proprietary databases, data products, data processing codes and other software tools necessary to process and use published data. Positive examples include Comma-separated value (.csv) files (which, if skilfully prepared, can contain abundant metadata), netCDF files (enabled by well-documented manuals and freely available netCDF libraries), MySQL databases, QGIS-compatible shapefiles, open-source script codes (R, Python), etc. Proprietary software products such as ArcGIS, MATLAB and Microsoft Access fail to support the open access and exchange necessary for ESSD data publication; products in these formats require conversion to non-proprietary formats for data sharing. Because researchers can generally use Excel and because many free translators exist,ESSD accepts Excel files as a special case.

3.2 Mandatory permanent identifiers

ESSD emerged synchronously with the application of digital object identifiers (DOI) to research data. The use of DOI for data identification and tracking and for version control remains critical to data publication processes; all ESSD data sets and data products must carry a DOI from the time of manuscript submission. Application of the DOI system to these products, whether flat files, databases or algorithms, serves to protect and inform both users and providers. Changes implemented as a consequence of the ESSD review process should in all cases result in a new DOI for the revised data product. The final published product will carry two DOI: one for the final data product as reviewed and perhaps revised and a second for the published description.

3.3 Accurate useful data descriptions including source attribution

An ESSD data description provides a unique complete recipe covering original data sources, data collection methods where applicable, tools and overall preparation of the data product. To help users – particularly users interested in, but perhaps unfamiliar with, the product – ESSD authors must provide accurate documentation of sources, algorithms, codes, models, etc., sufficiently to allow new users to develop subsequent or alternate analyses or conclusions. Ideally, detailed ESSD data descriptions prevent or at least minimise subsequent data misinterpretations or misuse. A good ESSD paper will include attribution tables that summarise data used, data sources (with URL) and journal citations so that readers and users can easily follow the same links to the same sources. Formats for all data links and attributions should follow current best practices (find examples and links to formal data citation principles under ESSD's data policy, https://www.earth-system-science-data.net/about/data_policy.html) and include a full accurate citation in the paper's reference list. A carefully prepared rigorously reviewed description represents a strong value-added feature of data publication as practised by ESSD.

3.4 Inclusive lists of codes and tools

To meet the goal of providing complete data preparation documentation, _ESSD_data products and data descriptions should include all codes, libraries, statistical or interpolation routines, model versions, etc. For example, when authors develop or use processing schemes in R, they must provide the specific names and URLs of those R codes. When they have validated their product through use of or comparison with models, they must provide exact details of model configurations, reliable links to model versions, etc., sufficiently to allow readers and users to replicate the analysis. Often, _ESSD_authors provide a flow chart of sources, processing steps and outcomes, accompanied by a table listing sources with necessary details. Data providers, who typically carry this information informally, generally benefit from the effort needed to formally record and document these procedures.

3.5 Extensive validations

Authors will need to demonstrate, first to reviewers and later to a wide range of users, the validity and applicability of their datasets and data products. Exact mechanisms and options for validation will vary substantially among and across data products. Because ESSD serves to ensure the suitability of published data for future research, each ESSD paper should demonstrate skill and utility of the submitted data product by some form of comparison to prior products, alternate data sources, similar products at different time or space resolution, model outcomes, initial short records of recent sensors, etc. For some data products, full validation with independent source materials may prove scientifically or technologically infeasible; comparisons to prior or alternate products may not offer quantitative validation. Community-wide compilations such as global budgets might only allow validation of specific components or comparisons to earlier versions. In all cases, however, authors must have made and reported best efforts at intercomparison and validation.

3.6 Explicit uncertainty accounting and analysis

Each ESSD data set, database, data product or data processing algorithm contains and perhaps induces uncertainties. ESSD products will also carry uncertainties inherited from source data. Authors must explicitly and extensively describe and document those uncertainties. Exact expressions of and standards for uncertainty will vary depending on types and sources of data, but as a service to and courtesy to subsequent users, every ESSD data product must include uncertainty documentation. Authors may need to rely on and cite their own expert judgement, but such conclusions must appear explicitly within an overall uncertainty assessment in terms of percent, standard deviation, or other accepted metrics. Future users, including modellers, require careful, explicit and quantitative uncertainty analysis that will allow them to choose or avoid subsequent use based on documented uncertainties of the ESSD product.

3.7 User guidance in a data availability section

Authors must describe access to their data product(s) in an explicit data availability section (another value-added feature of ESSD). This section must list current primary and alternate data repository links, explain any versions, include links to open-access source files, etc. Where a user will encounter multiple files, authors must explain the contents and expected uses of each file; the availability section should not point to an FTP site full of raw text (.txt) files. When, for convenience, authors provide smaller-size (e.g. monthly) files as surrogates for larger higher-resolution (e.g. daily) files available offline, the data availability section should provide explicit descriptions and access guidance for both the small and large files. “Contact the author” does not represent useful or appropriate guidance for data availability. As mentioned above, all links to third-party data sources should appear in the reference list in citable accessible formats; the authors may wish to add notations, guidance, version information, etc. in this section. Data availability sections should also describe plans and schedules for future updates, when applicable. All _ESSD_papers should have included their specific data link as the final sentence of the abstract and should repeat those links accompanied by all necessary explanation and assistance in the data availability section.

3.8 Interest and utility

Ideally, and to justify efforts of reviewers, editors and publication staff, an ESSD data product will prove interesting and useful to a wide range of users. Authors should know that, to ensure that ESSD products enable substantial advances in future research, editors must apply dual criteria in all cases; does the data as submitted demonstrate sufficient quality and will the data product interest a sufficient number of users? Clearly, a small data set collected over a short time at a single location generally does not qualify (e.g. an emission factor measured for a short time at one location might represent a significant data-gathering effort but with limited impact), while a community compilation of a global product covering 6 or 7 decades should qualify. Those end points include a wide range of plausible intermediary products. Over its short time of existence ESSD has tried to adopt an inclusive and expansive view of potential data impact. By close adherence to the guidelines above, data providers will help ESSD editors assess interest and utility.