Example 1: Basic usage (original) (raw)

Use tidyfst just like dplyr

This part of vignette has referred to dplyr’s vignette in https://dplyr.tidyverse.org/articles/dplyr.html. We’ll try to reproduce all the results. First load the needed packages.

Filter rows with filter_dt()

Note that comma could not be used in the expressions. Which meansfilter_dt(flights, month == 1,day == 1) would return error. ## Arrange rows with [arrange_dt()](../reference/arrange%5Fdt.html)

Use - (minus symbol) to order a column in descending order:

Select columns with select_dt()

select_dt(flights, year:day) andselect_dt(flights, -(year:day)) are not supported. But I have added a feature to help select with regular expression, which means you can:

The rename process is almost the same as that indplyr:

Add new columns with mutate_dt()

mutate_dt(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

However, if you just create the column, please split them. The following codes would not work:

mutate_dt(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

Instead, use:

If you only want to keep the new variables, use[transmute_dt()](../reference/mutate.html):

Summarise values with summarise_dt()

Randomly sample rows with sample_n_dt() andsample_frac_dt()

Grouped operations

For the below dplyr codes:

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

We could get it via:

flights %>% 
  summarise_dt( count = .N,
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE),by = tailnum)

summarise_dt (or summarize_dt) has a parameter “by”, you can specify the group. We could find the number of planes and the number of flights that go to each possible destination:

# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>% 
  arrange_dt(dest)

If you need to group by many variables, use:

# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N)

# (per_month <- summarise(per_day, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights))

# (per_year  <- summarise(per_month, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights)) %>% 
  summarise_dt(by = .(year),flights = sum(flights))

Comparison with data.table syntax

tidyfst provides a tidy syntax for data.table. For such design, tidyfst never runs faster than the analogous_data.table_ codes. Nevertheless, it facilitate the dplyr-users to gain the computation performance in no time and guide them to learn more about data.table for speed. Below, we’ll compare the syntax oftidyfst and data.table (referring to Introduction to data.table). This could let you know how they are different, and let users to choose their preference. Ideally, tidyfst will lead even more users to learn more about data.table and its wonderful features, so as to design more extentions for _tidyfst_in the future.

Data

Because we want a more stable data source, here we’ll use the flight data from the above nycflights13 package.

Subset rows

Select column(s)

# data.table
flights[, list(arr_delay)]
flights[, .(arr_delay, dep_delay)]
flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

# tidyfst
flights %>% select_dt(arr_delay)
flights %>% select_dt(arr_delay, dep_delay)
flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay)

Mixed computation

In the above examples, we could learn that in tidyfst, you could still use the methods in data.table, such as .N.

Refer to columns by names

# data.table
flights[, c("arr_delay", "dep_delay")]

select_cols = c("arr_delay", "dep_delay")
flights[ , ..select_cols]
flights[ , select_cols, with = FALSE]

flights[, !c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]

# returns year,month and day
flights[, year:day]
# returns day, month and year
flights[, day:year]
# returns all columns except year, month and day
flights[, -(year:day)]
flights[, !(year:day)]

# tidyfst
flights %>% select_dt(c("arr_delay", "dep_delay"))

select_cols = c("arr_delay", "dep_delay")
flights %>% select_dt(cols = select_cols)

flights %>% select_dt(-arr_delay,-dep_delay)

flights %>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day))

Aggregations

# data.table
flights[, .N, by = .(origin)]
flights[carrier == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]

# tidyfst
flights %>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>% 
  summarise_dt(mean(arr_delay), mean(dep_delay),
               by = .(origin, dest, month))

Note that currently keyby is not used in_tidyfst_. This featuer might be included in the future for better performance in order-independent tasks. Moreover,count_dt is sorted automatically by the counted number, this could be controlled by the parameter “sort”.

# data.table
flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[, .N, .(dep_delay>0, arr_delay>0)]

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  count_dt(origin,dest,sort = FALSE) %>% 
  arrange_dt(origin,-dest)
flights %>% 
  summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))

Now let’s try a more complex example:

# data.table
flights[carrier == "AA", 
        lapply(.SD, mean), 
        by = .(origin, dest, month), 
        .SDcols = c("arr_delay", "dep_delay")] 

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean)
           )

Let me explain what happens here, especially ingroup_dt. First filter by conditioncarrier == "AA", then group by three variables, which areorigin, dest, month. Last, summarise by columns with “_delay” in the column names and get the mean value of all such variables(with “_delay” in their column names). This is a very creative design, utilizing .SD in data.table and upgrade the group_by function in dplyr (because you never need to ungroup now, just put the group operations in thegroup_dt). And you can pipe in the group_dt function. Let’s play with it a little bit further:

However, I don’t recommend using it if you don’t acutually need it for group computation (just start another pipe followsgroup_dt). Now let’s end with some easy examples:

# data.table
flights[, head(.SD, 2), by = month]

# tidyfst
flights %>% 
  group_dt(by = month,head(2))

Deep inside, tidyfst is born from dplyr and_data.table_, and use stringr to make flexible APIs, so as to bring their superiority into full play.