GitHub - IQSS/dataverse-client-r: R Client for Dataverse Repositories (original) (raw)

R Client for Dataverse Repositories

CRAN Version Downloads

R-CMD-check-thorough R-CMD-check-daily R-CMD-check-dev codecov.io

Dataverse Project logo

The dataverse package provides access toDataverse APIs (versions 4+), enabling data search, retrieval, and deposit, thus allowing R users to integrate public data sharing into the reproducible research workflow.

Getting Started

You can find a stable release onCRAN, or install the latest development version fromGitHub:

Install from CRAN

install.packages("dataverse")

Install from GitHub

install.packages("remotes")

remotes::install_github("iqss/dataverse-client-r")

API Access Keys

Many features of the Dataverse API are public and require no authentication. This means in many cases you can search for and retrieve data without a Dataverse account or API key – you will not need to worry about this.

For features that require a Dataverse account for the specific server installation of the Dataverse software, and an API key linked to that account. Instructions for obtaining an account and setting up an API key are available in the Dataverse User Guide. (Note: if your key is compromised, it can be regenerated to preserve security.) Once you have an API key, this should be stored as an environment variable called DATAVERSE_KEY. It can be set as a default by adding

DATAVERSE_KEY="examplekey12345"

in your .Renviron file, where examplekey12345 should be replaced with your own key. The environment file can be opened byusethis::edit_r_environ().

Server

Because there are many Dataverse installations, all functions in the R client require specifying what server installation you are interacting with. There are multiple ways to specify the server:

  1. Set the server argument in each function. e.g.,server = "dataverse.harvard.edu" in the get_dataframe_by_name()function.
  2. Set the environment variable, DATAVERSE_SERVER, in the script to be used throughout the session. e.g.,

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

  1. Hard-code a default server in your own environment. Direct your.Renviron file directly or open it by usethis::edit_r_environ(). Then enter DATAVERSE_SERVER = "dataverse.harvard.edu". However, doing this may make your scripts not replicable to other people who do not have access to the environment.

In all cases, values should be the Dataverse server, without the “https” prefix or the “/api” URL path.

Data Download

The dataverse package provides multiple interfaces to obtain data into R. Users can supply a file DOI, a dataset DOI combined with a filename, or a dataverse object. They can read in the file as a raw binary or a dataset read in with the appropriate R function.

Reading data as R objects

Use the get_dataframe_*() functions, depending on the input you have. For example, we will read a survey dataset on Dataverse,nlsw88.dta(doi:10.70122/FK2/PPKHI1/ZYATZZ), originally in Stata dta form.

With a file DOI, we can use the get_dataframe_by_doi function:

nlsw <- get_dataframe_by_doi( filedoi = "10.70122/FK2/PPIAXE/MHDB0O", server = "demo.dataverse.org" )

## Downloading ingested version of data with readr::read_tsv. To download the original version and remove this message, set original = TRUE.

## Rows: 2246 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (17): idcode, age, race, married, never_married, grade, collgrad, south, smsa, c_city, industry, occupation, uni...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

which by default reads in the ingested file (not the original dta) by thereadr::read_tsvfunction.

Alternatively, we can download the same file by specifying the filename and the DOI of the “dataset” (in Dataverse, a collection of files is called a dataset).

nlsw_tsv <- get_dataframe_by_name( filename = "nlsw88.tab", dataset = "10.70122/FK2/PPIAXE", server = "demo.dataverse.org" )

Now, Dataverse often translates rectangular data into an ingested, or “archival” version, which is application-neutral and easily-readable.read_dataframe_*() defaults to taking this ingested version rather than using the original, through the argument original = FALSE.

This default is safe because you may not have the proprietary software that was originally used. On the other hand, the data may have lost information in the process of the ingestation.

Instead, to read the same file but its original version, specifyoriginal = TRUE and set an .f argument. In this case, we know thatnlsw88.tab is a Stata .dta dataset, so we will use thehaven::read_dta function.

nlsw_original <- get_dataframe_by_name( filename = "nlsw88.tab", dataset = "10.70122/FK2/PPIAXE", .f = haven::read_dta, original = TRUE, server = "demo.dataverse.org" )

Note that even though the file prefix is “.tab”, we usehaven::read_dta.

Of course, when the dataset is not ingested (such as a Rds file), users would always need to specify an .f argument for the specific file.

Note the difference between nls_tsv and nls_original. nls_originalpreserves the data attributes like value labels, whereas nls_tsv has dropped this or left this in file metadata.

class(nlsw_tsv$race) # tab ingested version only has numeric data

attr(nlsw_original$race, "labels") # original dta has value labels

## white black other 
##     1     2     3

Data Upload and Archiving

Note: There are known issues to using to dataverse creation and dataset addition functionalities listed here. adddatasetfile()appears stable as of again as of v0.3.11. One possible workaround is to mix the two workflows described above (See e.g. thiscomment).

Dataverse provides two - basically unrelated - workflows for managing (adding, documenting, and publishing) datasets. The first workflow is called the “native” API and uses create_dataset to make an empty dataset and adds files by add_dataset_file by taking a path to a dataset that is located in your local. Through the native API it is possible to update a dataset by modifying its metadata withupdate_dataset() or file contents using update_dataset_file() and then republish a new version using publish_dataset().

create the dataset. e/g/

ds <- create_dataset("mydataverse") # pick a name of dataset

add files

tmp <- tempfile() # In this example, we write to a temporary destiation write.csv(iris, file = tmp) add_dataset_file(file = tmp, dataset = ds)

publish dataset

publish_dataset(ds)

dataset will now be published

get_dataverse("mydataverse")

The second is built on SWORD (v2.0). This means that to create a new dataset listing, you will have to first initialize a dataset entry with some metadata, add one or more files to the dataset, and then publish it. This looks something like the following:

After setting appropriate dataverse server and environment, obtain SWORD

service doc

d <- service_document()

create a list of metadata for a file

metadat <- list( title = paste0("My-Study_", format(Sys.time(), '%Y-%m-%d_%H:%M')), creator = "Doe, John", description = "An example study" )

create the dataset, where "mydataverse" is to be replaced by the name

of the already-created dataverse as shown in the URL

ds <- initiate_sword_dataset("", body = metadat)

add files to dataset

readr::write_csv(iris, file = "iris.csv")

Search the initiated dataset and give a DOI and version of the dataverse as an identifier

mydoi <- "doi:10.70122/FK2/BMZPJZ&version=DRAFT"

add dataset

add_dataset_file(file = "iris.csv", dataset = mydoi)

publish new dataset

publish_sword_dataset(ds)

dataset will now be published

list_datasets("")

Limitations

The R client is current stable for data search and download. For more extensive features of uploading and maintaining data, see the issues reported in the Github repository. You may need to use alternative methods, such as working on the Dataverse GUI directly or usingpyDataverse.

Functions related to user management and permissions are currently not exported in the package (but are drafted in the source code).

dataverse is the next-generation iteration of the now removeddvn package, which works with Dataverse 3 (“Dataverse Network”) applications.

Dataverse clients in other programming languages includepyDataverse for Python and the Java client. For more information, see the Dataverse API page.

Users interested in downloading metadata from archives other than Dataverse may be interested in Kurt Hornik’sOAIHarvester and Scott Chamberlain’s oai, which offer metadata download from any web repository that is compliant with the Open Archives Initiativestandards. Additionally,rdryad uses OAIHarvester to interface with Dryad. Therfigshare package works in a similar spirit to dataverse with https://figshare.com/.

More Information

A 2021 talk demonstrating the Dataverse package is available athttps://www.youtube.com/watch?v=-J-eiPnmoNE.