Deborah Agarwal - Academia.edu (original) (raw)

Papers by Deborah Agarwal

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), 2016

The ISCN is an international scientific community devoted to the advancement of soil carbon resea... more The ISCN is an international scientific community devoted to the advancement of soil carbon research. The ISCN manages an open-access, community-driven soil carbon database. This is version 3-1 of the ISCN Database, released in December 2015. It gathers 38 separate dataset contributions, totalling 67,112 sites with data from 71,198 soil profiles and 431,324 soil layers. For more information about the ISCN, its scientific community and resources, data policies and partner networks visit: http://iscn.fluxdata.org/.

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), 2021

These data are the results of a systematic review that investigated how data standards and report... more These data are the results of a systematic review that investigated how data standards and reporting formats are documented on the version control platform GitHub. Our systematic review identified 32 data standards in earth science, environmental science, and ecology that use GitHub for version control of data standard documents. In our analysis, we characterized the documents and content within each of the 32 GitHub repositories to identify common practices for groups that version control their documents on GitHub.In this data package, there are 8 CSV files that contain data that we characterized from each repository, according to the location within the repository. For example, in 'readme_pages.csv' we characterize the content that appears across the 32 GitHub repositories included in our systematic review. Each of the 8 CSV files has an associated data dictionary file (names appended with '_dd.csv' and here we describe each content category within CSV files.There is one file-level metadata file (flmd.csv) that provides a description of each file within the data package.

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), 2021

This data package contains three templates that can be used for creating README files and Issue T... more This data package contains three templates that can be used for creating README files and Issue Templates, written in the markdown language, that support community-led data reporting formats. We created these templates based on the results of a systematic review (see related references) that explored how groups developing data standard documentation use the Version Control platform GitHub, to collaborate on supporting documents. Based on our review of 32 GitHub repositories, we make recommendations for the content of README Files (e.g., provide a user license, indicate how users can contribute) and so 'README_template.md' includes headings for each section. The two issue templates we include ('issue_template_for_all_other_changes.md' and 'issue_template_for_documentation_change.md') can be used in a GitHub repository to help structure user-submitted issues, or can be modified to suit the needs of data standard developers. We used these templates when establishing ESS-DIVE's community space on GitHub (https://github.com/ess-dive-community) that includes documentation for community-led data reporting formats. We also include file-level metadata 'flmd.csv' that describes the contents of each file within this data package. Lastly, the temporal range that we indicate in our metadata is the time range during which we searched for data standards documented on GitHub.

Scientific Data, Nov 14, 2022

Research can be more transparent and collaborative by using Findable, Accessible, Interoperable, ... more Research can be more transparent and collaborative by using Findable, Accessible, Interoperable, and Reusable (FAIR) principles to publish Earth and environmental science data. Reporting formatsinstructions, templates, and tools for consistently formatting data within a discipline-can help make data more accessible and reusable. However, the immense diversity of data types across Earth science disciplines makes development and adoption challenging. Here, we describe 11 community reporting formats for a diverse set of Earth science (meta)data including cross-domain metadata (dataset metadata, location metadata, sample metadata), file-formatting guidelines (file-level metadata, CSV files, terrestrial model data archiving), and domain-specific reporting formats for some biological, geochemical, and hydrological data (amplicon abundance tables, leaf-level gas exchange, soil respiration, water and sediment chemistry, sensor-based hydrologic measurements). More broadly, we provide guidelines that communities can use to create new (meta)data formats that integrate with their scientific workflows. Such reporting formats have the potential to accelerate scientific discovery and predictions by making it easier for data contributors to provide (meta)data that are more interoperable and reusable.

Data Science Journal, Mar 18, 2021

Physical samples are foundational entities for research across biological, Earth, and environment... more Physical samples are foundational entities for research across biological, Earth, and environmental sciences. Data generated from sample-based analyses are not only the basis of individual studies, but can also be integrated with other data to answer new and broader-scale questions. Ecosystem studies increasingly rely on multidisciplinary team-science to study climate and environmental changes. While there are widely adopted conventions within certain domains to describe sample data, these have gaps when applied in a multidisciplinary context. In this study, we reviewed existing practices for identifying, characterizing, and linking related environmental samples. We then tested practicalities of assigning persistent identifiers to samples, with standardized metadata, in a pilot field test involving eight United States Department of Energy projects. Participants collected a variety of sample types, with analyses conducted across multiple facilities. We address terminology gaps for multidisciplinary research and make recommendations for assigning identifiers and metadata that supports sample tracking, integration, and reuse. Our goal is to provide a practical approach to sample management, geared towards ecosystem scientists who contribute and reuse sample data.

Journal of machine learning for modeling and computing, 2022

Machine learning can provide sustainable solutions to gap-fill groundwater (GW) data needed to ad... more Machine learning can provide sustainable solutions to gap-fill groundwater (GW) data needed to adequately constrain watershed models. However, imputing missing extremes is more challenging than other parts of a hydrograph. To impute missing subhourly data, including extremes, within GW time-series data collected at multiple wells in the East River watershed, located in southwestern Colorado, we consider a single-well imputation (SWI) and a multiple-well imputation (MWI) approach. SWI gap-fills missing GW entries in a well using the same well's time-series data; MWI gap-fills a specific well's missing GW entry using the time series of neighboring wells. SWI takes advantage of linear interpolation and random forest (RF) approaches, whereas MWI exploits only the RF approach. We also use an information entropy framework to develop insights into how missing data patterns impact imputation. We discovered that if gaps were at random intervals, SWI could accurately impute up to 90% of missing data over an approximately two-year period. Contiguous gaps constituted more complex scenarios for imputation and required the use of MWI. Information entropy suggested that if gaps were contiguous, up to 50% of missing GW data could be estimated accurately over an approximately two-year period. The RF-feature importance suggested that a time feature (months) and a space feature (neighboring wells) were the most important predictors in the SWI and MWI. We also noted that neither SWI nor MWI methods could capture the missing extremes of a hydrograph. To counter this, we developed a new sequential approach and demonstrated the imputation of missing extremes in a GW time series with high accuracy.

Neural Computing and Applications, Dec 23, 2022

We present an approach that uses a deep learning model, in particular, a MultiLayer Perceptron, f... more We present an approach that uses a deep learning model, in particular, a MultiLayer Perceptron, for estimating the missing values of a variable in multivariate time series data. We focus on filling a long continuous gap (e.g., multiple months of missing daily observations) rather than on individual randomly missing observations. Our proposed gap filling algorithm uses an automated method for determining the optimal MLP model architecture, thus allowing for optimal prediction performance for the given time series. We tested our approach by filling gaps of various lengths (three months to three years) in three environmental datasets with different time series characteristics, namely daily groundwater levels, daily soil moisture, and hourly Net Ecosystem Exchange. We compared the accuracy of the gap-filled values obtained with our approach to the widely used R-based time series gap filling methods ImputeTS and mtsdi. The results indicate that using an MLP for filling a large gap leads to better results, especially when the data behave nonlinearly. Thus, our approach enables the use of datasets that have a large gap in one variable, which is common in many long-term environmental monitoring observations.

arXiv (Cornell University), Aug 28, 2019

Sustainable management of groundwater resources under changing climatic conditions require an app... more Sustainable management of groundwater resources under changing climatic conditions require an application of reliable and accurate predictions of groundwater levels. Mechanistic multi-scale, multi-physics simulation models are often too hard to use for this purpose, especially for groundwater managers who do not have access to the complex compute resources and data. Therefore, we analyzed the applicability and performance of four modern deep learning computational models for predictions of groundwater levels. We compare three methods for optimizing the models' hyperparameters, including two surrogate model-based algorithms and a random sampling method. The models were tested using predictions of the groundwater level in Butte County, California, USA, taking into account the temporal variability of streamflow, precipitation, and ambient temperature. Our numerical study shows that the optimization of the hyperparameters can lead to reasonably accurate performance of all models (root mean squared errors of groundwater predictions of 2 meters or less), but the "simplest" network, namely a multilayer perceptron (MLP) performs overall better for learning and predicting groundwater data than the more advanced long short-term memory or convolutional neural networks in terms of prediction accuracy and time-to-solution, making the MLP a suitable candidate for groundwater prediction.

Lawrence Berkeley National Laboratory, Feb 4, 2001

Most distributed applications today manage to utilize only a small percentage of the needed and a... more Most distributed applications today manage to utilize only a small percentage of the needed and available network bandwidth. Often application developers are not aware of the potential bandwidth of the network, and therefore do not know what to expect. Even when application developers are aware of the specifications of the machines and network links, they have few resources that can help determine why the expected performance was not achieved. What is needed is a ubiquitous and easy-to-use service that provides reliable, accurate, secure, and timely estimates of dynamic network properties. This service will help advise applications on how to make use of the network's increasing bandwidth and capabilities for traffic shaping and engineering. When fully implemented, this service will make building currently unrealizable levels of network awareness into distributed applications a relatively mundane task. For example, a remote data visualization application could choose between sending a wireframe, a pre-rendered image, or a 3-D representation, based on forecasts of CPU availability and power, compression options, and available bandwidth. The same service will provide on-demand performance information so that applications can compare predicted with actual results, and allow detailed queries about the end-to-end path for application and network tuning and debugging

Lawrence Berkeley National Laboratory, Jan 14, 2010

Carbon-climate, like other environmental sciences, has been changing. Large-scale synthesis studi... more Carbon-climate, like other environmental sciences, has been changing. Large-scale synthesis studies are becoming more common. These synthesis studies are often conducted by science teams that are geographically distributed and on datasets that are global in scale. A broad array of collaboration and data analytics tools are now available that could support these science teams. However, building tools that scientists actually use is hard. Also, moving scientists from an informal collaboration structure to one mediated by technology often exposes inconsistencies in the understanding of the rules of engagement between collaborators. We have developed a scientific collaboration portal, called fluxdata.org, which serves the community of scientists providing and analyzing the global FLUXNET carbon-flux synthesis dataset. Key things we learned or re-learned during our portal development include: minimize the barrier to entry, provide features on a just-intime basis, development of requirements is an ongoing process, provide incentives to change leaders and leverage the opportunity they represent, automate as much as possible, and you can only learn how to make it better if people depend on it enough to give you feedback. In addition, we also learned that splitting the portal roles between scientists and computer scientists improved user adoption and trust. The fluxdata.org portal has now been in operation for ~1.5 years and has become central to the FLUXNET synthesis efforts.

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), May 16, 2002

The diverse set of organizations and software components involved in a typical collaboratory make... more The diverse set of organizations and software components involved in a typical collaboratory make providing a seamless security solution difficult. In addition, the users need support for a broad range of frequency and locations for access to the collaboratory. A collaboratory security solution needs to be robust enough to ensure that valid participants are not denied access because of its failure. There are many tools that can be applied to the task of securing collaborative environments and these include public key infrastructure, secure sockets layer, Kerberos, virtual and real private networks, grid security infrastructure, and username/password. A combination of these mechanisms can provide effective secure collaboration capabilities. In this paper, we discuss the requirements of typical collaboratories and some proposals for applying various security mechanisms to collaborative environments.

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), Mar 1, 2001

Existing reliable ordered group communication protocols have been developed for local-area networ... more Existing reliable ordered group communication protocols have been developed for local-area networks and do not, in general, scale well to large numbers of nodes and wide-area networks. The InterGroup suite of protocols is a scalable group communication system that introduces a novel approach to handling group membership, and supports a receiver-oriented selection of service. The protocols are intended for a wide-area network, with a large number of nodes, that has highly variable delays and a high message loss rate, such as the Internet. The levels of the message delivery service range from unreliable unordered to reliable group timestamp ordered.

A gap-filled, quality assessed eddy covariance dataset has recently become available for the Amer... more A gap-filled, quality assessed eddy covariance dataset has recently become available for the AmeriFlux network. This dataset uses standard processing and produces commonly used science variables. This shared dataset enables robust comparisons across different analyses. Of course, there are many remaining questions. One of those is how to define "during the day" which is an important concept for many analyses. Some studies have used local timefor example 9am to 5pm; others have used thresholds on photosynthetic active radiation (PAR). A related question is how to derive quantities such as the Bowen ratio. Most studies compute the ratio of the averages of the latent heat (LE) and sensible heat (H). In this study, we use different methods of defining "during the day" for GPP, LE, and H. We evaluate the differences between methods in two ways. First, we look at a number of statistics of GPP. Second, we look at differences in the derived Bowen ratio. Our goal is not science per se, but rather informatics in support of the science.

AGU Fall Meeting Abstracts, Dec 1, 2011

ABSTRACT

AGU Fall Meeting Abstracts, Dec 1, 2009

ABSTRACT

A new version of the CD-1 continuous data protocol has been developed to support a multicasting e... more A new version of the CD-1 continuous data protocol has been developed to support a multicasting experiment. This new version was developed to study the use of reliable multicast for sending data to multiple receivers in a single operation. The purpose of the experiment is to evaluate the existing multicasting technology for possible use in the Global Communication Infrastructure. This paper describes the initial results of the experiment.

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), 2016

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), 2021

Scientific Data, Nov 14, 2022

Data Science Journal, Mar 18, 2021

Journal of machine learning for modeling and computing, 2022

Neural Computing and Applications, Dec 23, 2022

arXiv (Cornell University), Aug 28, 2019

Lawrence Berkeley National Laboratory, Feb 4, 2001

Lawrence Berkeley National Laboratory, Jan 14, 2010

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), May 16, 2002

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), Mar 1, 2001

AGU Fall Meeting Abstracts, Dec 1, 2011

ABSTRACT

AGU Fall Meeting Abstracts, Dec 1, 2009

ABSTRACT