GitHub - mgirlich/tibblify: Rectangle Nested Lists (original) (raw)

tibblify

Lifecycle: experimental CRAN status Codecov test coverage R build status R-CMD-check

The goal of tibblify is to provide an easy way of converting a nested list into a tibble.

Installation

You can install the released version of tibblify fromCRAN with:

install.packages("tibblify")

Or install the development version from GitHub with:

install.packages("devtools")

devtools::install_github("mgirlich/tibblify")

Introduction

With tibblify() you can rectangle deeply nested lists into a tidy tibble. These lists might come from an API in the form of JSON or from scraping XML. The reasons to use tibblify() over other tools likejsonlite::fromJSON() or tidyr::hoist() are:

Example

Let’s start with gh_users, which is a list containing information about four GitHub users.

library(tibblify)

gh_users_small <- purrr::map(gh_users, ~ .x[c("followers", "login", "url", "name", "location", "email", "public_gists")])

names(gh_users_small[[1]]) #> [1] "followers" "login" "url" "name" "location"
#> [6] "email" "public_gists"

Quickly rectangling gh_users_small is as easy as applying tibblify()to it:

tibblify(gh_users_small) #> The spec contains 1 unspecified field: #> • email #> # A tibble: 4 × 7 #> followers login url name location email public_gists #> #> 1 780 jennybc https://api.github.co… Jenn… Vancouv… 54 #> 2 3958 jtleek https://api.github.co… Jeff… Baltimo… 12 #> 3 115 juliasilge https://api.github.co… Juli… Salt La… 4 #> 4 213 leeper https://api.github.co… Thom… London,… 46

We can now look at the specification tibblify() used for rectangling

guess_tspec(gh_users_small) #> The spec contains 1 unspecified field: #> • email #> tspec_df( #> tib_int("followers"), #> tib_chr("login"), #> tib_chr("url"), #> tib_chr("name"), #> tib_chr("location"), #> tib_unspecified("email"), #> tib_int("public_gists"), #> )

If we are only interested in some of the fields we can easily adapt the specification

spec <- tspec_df( login_name = tib_chr("login"), tib_chr("name"), tib_int("public_gists") )

tibblify(gh_users_small, spec) #> # A tibble: 4 × 3 #> login_name name public_gists #> #> 1 jennybc Jennifer (Jenny) Bryan 54 #> 2 jtleek Jeff L. 12 #> 3 juliasilge Julia Silge 4 #> 4 leeper Thomas J. Leeper 46

Objects

We refer to lists like gh_users_small as collection and _objects_are the elements of such lists. Objects and collections are the typical input for tibblify().

Basically, an object is simply something that can be converted to a one row tibble. This boils down to a condition on the names of the object:

In other words, the names must fulfillvec_as_names(repair = "check_unique"). The name-value pairs of an object are the fields.

For example list(x = 1, y = "a") is an object with the fields (x, 1)and (y, "a") but list(1, z = 3) is not an object because it is not fully named.

A collection is basically just a list of similar objects so that the fields can become the columns in a tibble.

Specification

Providing an explicit specification has a couple of advantages:

As seen before the specification for a collection is done withtspec_df(). The columns of the output tibble are describe with thetib_*() functions. They describe the path to the field to extract and the output type of the field. There are the following five types of functions:

For convenience there are shortcuts for tib_scalar() andtib_vector() for the most common prototypes:

Scalar Elements

Scalar elements are the most common case and result in a normal vector column

tibblify( list( list(id = 1, name = "Peter"), list(id = 2, name = "Lilly") ), tspec_df( tib_int("id"), tib_chr("name") ) ) #> # A tibble: 2 × 2 #> id name #> #> 1 1 Peter #> 2 2 Lilly

With tib_scalar() you can also provide your own prototype

Let’s say you have a list with durations

x <- list( list(id = 1, duration = vctrs::new_duration(100)), list(id = 2, duration = vctrs::new_duration(200)) ) x #> [[1]] #> [[1]]$id #> [1] 1 #> #> [[1]]$duration #> Time difference of 100 secs #> #> #> [[2]] #> [[2]]$id #> [1] 2 #> #> [[2]]$duration #> Time difference of 200 secs

and then use it in tib_scalar()

tibblify( x, tspec_df( tib_int("id"), tib_scalar("duration", ptype = vctrs::new_duration()) ) ) #> # A tibble: 2 × 2 #> id duration #>
#> 1 1 100 secs #> 2 2 200 secs

Vector Elements

If an element does not always have size one then it is a vector element. If it still always has the same type ptype then it produces a list ofptype column:

x <- list( list(id = 1, children = c("Peter", "Lilly")), list(id = 2, children = "James"), list(id = 3, children = c("Emma", "Noah", "Charlotte")) )

tibblify( x, tspec_df( tib_int("id"), tib_chr_vec("children") ) ) #> # A tibble: 3 × 2 #> id children #> <list> #> 1 1 [2] #> 2 2 [1] #> 3 3 [3]

You can usetidyr::unnest() ortidyr::unnest_longer()to flatten these columns to regular columns.

Object Elements

For example in gh_repos_small

gh_repos_small <- purrr::map(gh_repos, ~ .x[c("id", "name", "owner")]) gh_repos_small <- purrr::map( gh_repos_small, function(repo) { repo$owner <- repo$owner[c("login", "id", "url")] repo } )

gh_repos_small[[1]] #> $id #> [1] 61160198 #> #> $name #> [1] "after" #> #> $owner #> ownerownerownerlogin #> [1] "gaborcsardi" #> #> ownerownerownerid #> [1] 660288 #> #> ownerownerownerurl #> [1] "https://api.github.com/users/gaborcsardi"

the field owner is an object itself. The specification to extract it uses tib_row()

spec <- guess_tspec(gh_repos_small) spec #> tspec_df( #> tib_int("id"), #> tib_chr("name"), #> tib_row( #> "owner", #> tib_chr("login"), #> tib_int("id"), #> tib_chr("url"), #> ), #> )

and results in a tibble column

tibblify(gh_repos_small, spec) #> # A tibble: 30 × 3 #> id name owner$login idid idurl
#>
#> 1 61160198 after gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 2 40500181 argufy gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 3 36442442 ask gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 4 34924886 baseimports gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 5 61620661 citest gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 6 33907457 clisymbols gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 7 37236467 cmaker gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 8 67959624 cmark gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 9 63152619 conditions gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 10 24343686 crayon gaborcsardi 660288 https://api.github.com/users/gaborcs… #> # ℹ 20 more rows

If you don’t like the tibble column you can unpack it withtidyr::unpack(). Alternatively, if you only want to extract some of the fields in owner you can use a nested path

spec2 <- tspec_df( id = tib_int("id"), name = tib_chr("name"), owner_id = tib_int(c("owner", "id")), owner_login = tib_chr(c("owner", "login")) ) spec2 #> tspec_df( #> tib_int("id"), #> tib_chr("name"), #> owner_id = tib_int(c("owner", "id")), #> owner_login = tib_chr(c("owner", "login")), #> )

tibblify(gh_repos_small, spec2) #> # A tibble: 30 × 4 #> id name owner_id owner_login #>
#> 1 61160198 after 660288 gaborcsardi #> 2 40500181 argufy 660288 gaborcsardi #> 3 36442442 ask 660288 gaborcsardi #> 4 34924886 baseimports 660288 gaborcsardi #> 5 61620661 citest 660288 gaborcsardi #> 6 33907457 clisymbols 660288 gaborcsardi #> 7 37236467 cmaker 660288 gaborcsardi #> 8 67959624 cmark 660288 gaborcsardi #> 9 63152619 conditions 660288 gaborcsardi #> 10 24343686 crayon 660288 gaborcsardi #> # ℹ 20 more rows

Required and Optional Fields

Objects usually have some fields that always exist and some that are optional. By default tib_*() demands that a field exists

x <- list( list(x = 1, y = "a"), list(x = 2) )

spec <- tspec_df( x = tib_int("x"), y = tib_chr("y") )

tibblify(x, spec) #> Error in tibblify(): #> ! Field y is required but does not exist in x[[2]]. #> ℹ Use required = FALSE if the field is optional.

You can mark a field as optional with the argument required = FALSE:

spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE) )

tibblify(x, spec) #> # A tibble: 2 × 2 #> x y
#> #> 1 1 a
#> 2 2

You can specify the value to use with the fill argument

spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE, fill = "missing") )

tibblify(x, spec) #> # A tibble: 2 × 2 #> x y
#>
#> 1 1 a
#> 2 2 missing

Converting a Single Object

To rectangle a single object you have two options: tspec_object()which produces a list or tspec_row() which produces a tibble with one row.

While tibbles are great for a single object it often makes more sense to convert them to a list.

For example a typical API response might be something like

api_output <- list( status = "success", requested_at = "2021-10-26 09:17:12", data = list( list(x = 1), list(x = 2) ) )

To convert to a one row tibble

row_spec <- tspec_row( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) )

api_output_df <- tibblify(api_output, row_spec) api_output_df #> # A tibble: 1 × 2 #> status data #> <list<tibble[,1]>> #> 1 success [2 × 1]

it is necessary to wrap data in a list. To access data one has to use api_output_df$data[[1]] which is not very nice.

object_spec <- tspec_object( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) )

api_output_list <- tibblify(api_output, object_spec) api_output_list #> $status #> [1] "success" #> #> $data #> # A tibble: 2 × 1 #> x #> #> 1 1 #> 2 2

Now accessing data does not required an extra subsetting step

api_output_list$data #> # A tibble: 2 × 1 #> x #> #> 1 1 #> 2 2

Code of Conduct

Please note that the tibblify project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.