GitHub - mgirlich/tibblify: Rectangle Nested Lists (original) (raw)

tibblify

The goal of tibblify is to provide an easy way of converting a nested list into a tibble.

Installation

You can install the released version of tibblify fromCRAN with:

install.packages("tibblify")

Or install the development version from GitHub with:

install.packages("devtools")

devtools::install_github("mgirlich/tibblify")

Introduction

With tibblify() you can rectangle deeply nested lists into a tidy tibble. These lists might come from an API in the form of JSON or from scraping XML. The reasons to use tibblify() over other tools likejsonlite::fromJSON() or tidyr::hoist() are:

It can guess the output format like jsonlite::fromJSON().
You can also provide a specification how to rectangle.
The specification is easy to understand.
You can bring most inputs into the shape you want in a single step.
Rectangling is much faster than with jsonlite::fromJSON().

Example

Let’s start with gh_users, which is a list containing information about four GitHub users.

library(tibblify)

gh_users_small <- purrr::map(gh_users, ~ .x[c("followers", "login", "url", "name", "location", "email", "public_gists")])

names(gh_users_small[[1]]) #> [1] "followers" "login" "url" "name" "location"
#> [6] "email" "public_gists"

Quickly rectangling gh_users_small is as easy as applying tibblify()to it:

tibblify(gh_users_small) #> The spec contains 1 unspecified field: #> • email #> # A tibble: 4 × 7 #> followers login url name location email public_gists #> #> 1 780 jennybc https://api.github.co… Jenn… Vancouv… 54 #> 2 3958 jtleek https://api.github.co… Jeff… Baltimo… 12 #> 3 115 juliasilge https://api.github.co… Juli… Salt La… 4 #> 4 213 leeper https://api.github.co… Thom… London,… 46

We can now look at the specification tibblify() used for rectangling

guess_tspec(gh_users_small) #> The spec contains 1 unspecified field: #> • email #> tspec_df( #> tib_int("followers"), #> tib_chr("login"), #> tib_chr("url"), #> tib_chr("name"), #> tib_chr("location"), #> tib_unspecified("email"), #> tib_int("public_gists"), #> )

If we are only interested in some of the fields we can easily adapt the specification

spec <- tspec_df( login_name = tib_chr("login"), tib_chr("name"), tib_int("public_gists") )

tibblify(gh_users_small, spec) #> # A tibble: 4 × 3 #> login_name name public_gists #> #> 1 jennybc Jennifer (Jenny) Bryan 54 #> 2 jtleek Jeff L. 12 #> 3 juliasilge Julia Silge 4 #> 4 leeper Thomas J. Leeper 46

Objects

We refer to lists like gh_users_small as collection and _objects_are the elements of such lists. Objects and collections are the typical input for tibblify().

Basically, an object is simply something that can be converted to a one row tibble. This boils down to a condition on the names of the object:

the object must have names (the names attribute must not beNULL),
every element must be named (no name can be NA or ""),
and the names must be unique.

In other words, the names must fulfillvec_as_names(repair = "check_unique"). The name-value pairs of an object are the fields.

For example list(x = 1, y = "a") is an object with the fields (x, 1)and (y, "a") but list(1, z = 3) is not an object because it is not fully named.

A collection is basically just a list of similar objects so that the fields can become the columns in a tibble.

Specification

Providing an explicit specification has a couple of advantages:

you can ensure type and shape stability of the resulting tibble in automated scripts.
you can give the columns different names.
you can restrict to parsing only the fields you need.
you can specify what happens if a value is missing.

As seen before the specification for a collection is done withtspec_df(). The columns of the output tibble are describe with thetib_*() functions. They describe the path to the field to extract and the output type of the field. There are the following five types of functions:

tib_scalar(ptype): a length one vector with type ptype
tib_vector(ptype): a vector of arbitrary length with type ptype
tib_variant(): a vector of arbitrary length and type; you should barely ever need this
tib_row(...): an object with the fields ...
tib_df(...): a collection where the objects have the fields ...

For convenience there are shortcuts for tib_scalar() andtib_vector() for the most common prototypes:

logical(): tib_lgl() and tib_lgl_vec()
integer(): tib_int() and tib_int_vec()
double(): tib_dbl() and tib_dbl_vec()
character(): tib_chr() and tib_chr_vec()
Date: tib_date() and tib_date_vec()
Date encoded as character: tib_chr_date() and tib_chr_date_vec()

Scalar Elements

Scalar elements are the most common case and result in a normal vector column

tibblify( list( list(id = 1, name = "Peter"), list(id = 2, name = "Lilly") ), tspec_df( tib_int("id"), tib_chr("name") ) ) #> # A tibble: 2 × 2 #> id name #> #> 1 1 Peter #> 2 2 Lilly

With tib_scalar() you can also provide your own prototype

Let’s say you have a list with durations

x <- list( list(id = 1, duration = vctrs::new_duration(100)), list(id = 2, duration = vctrs::new_duration(200)) ) x #> [[1]] #> [[1]]$id #> [1] 1 #> #> [[1]]$duration #> Time difference of 100 secs #> #> #> [[2]] #> [[2]]$id #> [1] 2 #> #> [[2]]$duration #> Time difference of 200 secs

and then use it in tib_scalar()

tibblify( x, tspec_df( tib_int("id"), tib_scalar("duration", ptype = vctrs::new_duration()) ) ) #> # A tibble: 2 × 2 #> id duration #>
#> 1 1 100 secs #> 2 2 200 secs

Vector Elements

If an element does not always have size one then it is a vector element. If it still always has the same type ptype then it produces a list ofptype column:

x <- list( list(id = 1, children = c("Peter", "Lilly")), list(id = 2, children = "James"), list(id = 3, children = c("Emma", "Noah", "Charlotte")) )

tibblify( x, tspec_df( tib_int("id"), tib_chr_vec("children") ) ) #> # A tibble: 3 × 2 #> id children #> <list> #> 1 1 [2] #> 2 2 [1] #> 3 3 [3]

You can usetidyr::unnest() ortidyr::unnest_longer()to flatten these columns to regular columns.

Object Elements

For example in gh_repos_small

gh_repos_small <- purrr::map(gh_repos, ~ .x[c("id", "name", "owner")]) gh_repos_small <- purrr::map( gh_repos_small, function(repo) { repo$owner <- repo$owner[c("login", "id", "url")] repo } )

gh_repos_small[[1]] #> $id #> [1] 61160198 #> #> $name #> [1] "after" #> #> $owner #> ownerownerownerlogin #> [1] "gaborcsardi" #> #> ownerownerownerid #> [1] 660288 #> #> ownerownerownerurl #> [1] "https://api.github.com/users/gaborcsardi"

the field owner is an object itself. The specification to extract it uses tib_row()

spec <- guess_tspec(gh_repos_small) spec #> tspec_df( #> tib_int("id"), #> tib_chr("name"), #> tib_row( #> "owner", #> tib_chr("login"), #> tib_int("id"), #> tib_chr("url"), #> ), #> )

and results in a tibble column

tibblify(gh_repos_small, spec) #> # A tibble: 30 × 3 #> id name owner$login idid idurl
#>
#> 1 61160198 after gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 2 40500181 argufy gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 3 36442442 ask gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 4 34924886 baseimports gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 5 61620661 citest gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 6 33907457 clisymbols gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 7 37236467 cmaker gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 8 67959624 cmark gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 9 63152619 conditions gaborcsardi 660288 https://api.github.com/users/gaborcs… #> 10 24343686 crayon gaborcsardi 660288 https://api.github.com/users/gaborcs… #> # ℹ 20 more rows

If you don’t like the tibble column you can unpack it withtidyr::unpack(). Alternatively, if you only want to extract some of the fields in owner you can use a nested path

spec2 <- tspec_df( id = tib_int("id"), name = tib_chr("name"), owner_id = tib_int(c("owner", "id")), owner_login = tib_chr(c("owner", "login")) ) spec2 #> tspec_df( #> tib_int("id"), #> tib_chr("name"), #> owner_id = tib_int(c("owner", "id")), #> owner_login = tib_chr(c("owner", "login")), #> )

tibblify(gh_repos_small, spec2) #> # A tibble: 30 × 4 #> id name owner_id owner_login #>
#> 1 61160198 after 660288 gaborcsardi #> 2 40500181 argufy 660288 gaborcsardi #> 3 36442442 ask 660288 gaborcsardi #> 4 34924886 baseimports 660288 gaborcsardi #> 5 61620661 citest 660288 gaborcsardi #> 6 33907457 clisymbols 660288 gaborcsardi #> 7 37236467 cmaker 660288 gaborcsardi #> 8 67959624 cmark 660288 gaborcsardi #> 9 63152619 conditions 660288 gaborcsardi #> 10 24343686 crayon 660288 gaborcsardi #> # ℹ 20 more rows

Required and Optional Fields

Objects usually have some fields that always exist and some that are optional. By default tib_*() demands that a field exists

x <- list( list(x = 1, y = "a"), list(x = 2) )

spec <- tspec_df( x = tib_int("x"), y = tib_chr("y") )

tibblify(x, spec) #> Error in tibblify(): #> ! Field y is required but does not exist in x[[2]]. #> ℹ Use required = FALSE if the field is optional.

You can mark a field as optional with the argument required = FALSE:

spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE) )

tibblify(x, spec) #> # A tibble: 2 × 2 #> x y
#> #> 1 1 a
#> 2 2

You can specify the value to use with the fill argument

spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE, fill = "missing") )

tibblify(x, spec) #> # A tibble: 2 × 2 #> x y
#>
#> 1 1 a
#> 2 2 missing

Converting a Single Object

To rectangle a single object you have two options: tspec_object()which produces a list or tspec_row() which produces a tibble with one row.

While tibbles are great for a single object it often makes more sense to convert them to a list.

For example a typical API response might be something like

api_output <- list( status = "success", requested_at = "2021-10-26 09:17:12", data = list( list(x = 1), list(x = 2) ) )

To convert to a one row tibble

row_spec <- tspec_row( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) )

api_output_df <- tibblify(api_output, row_spec) api_output_df #> # A tibble: 1 × 2 #> status data #> <list<tibble[,1]>> #> 1 success [2 × 1]

it is necessary to wrap data in a list. To access data one has to use api_output_df$data[[1]] which is not very nice.

object_spec <- tspec_object( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) )

api_output_list <- tibblify(api_output, object_spec) api_output_list #> $status #> [1] "success" #> #> $data #> # A tibble: 2 × 1 #> x #> #> 1 1 #> 2 2

Now accessing data does not required an extra subsetting step

api_output_list$data #> # A tibble: 2 × 1 #> x #> #> 1 1 #> 2 2

Code of Conduct

Please note that the tibblify project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.