Alternatives for rowwise jobs

Article overview

There are many alternatives to perform rowwise jobs in R. In this Article, we consider, in turns, these alternatives. We will stick to our example about drugs usage shown in introduction. The idea is to compare alternative ways to create a new variable named everused which indicates if each respondent has used any of the considered pain relievers for non medical purpose or not.

Loading packages

This Article requires you to load the following packages:

library(lay)        ## for lay() and the data
library(dplyr)      ## for many things
library(tidyr)      ## for pivot_longer() and pivot_wider()
library(purrr)      ## for pmap_lgl()
library(slider)     ## for slide()
library(data.table) ## for an alternative to base and dplyr

Please install them if they are not present on your system.

Alternative 1: vectorized solution

One solution is to simply do the following:

drugs_full |>
  mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)
#> # A tibble: 55,271 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 55,261 more rows

It is certainly very efficient from a computational point of view, but coding this way presents two main limitations:

you need to name all columns explicitly, which can be problematic when dealing with many columns
you are stuck with expressing your task with logical and arithmetic operators, which is not always sufficient

Alternative 2: 100% {dplyr}

drugs |>
  rowwise() |>
  mutate(everused = any(c_across(-caseid))) |>
  ungroup()
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

It is easy to use as c_across() turns its input into a vector and rowwise() implies that the vector only represents one row at a time. Yet, for now it remains quite slow on large datasets (see Efficiency below).

Alternative 3: {tidyr}

library(tidyr)  ## requires to have installed {tidyr}

drugs |>
  pivot_longer(-caseid) |>
  group_by(caseid) |>
  mutate(everused = any(value)) |>
  ungroup() |>
  pivot_wider() |>
  relocate(everused, .after = last_col())
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

Here the trick is to turn the rowwise problem into a column problem by pivoting the values and then pivoting the results back. Many find that this involves a little too much intellectual gymnastic. It is also not particularly efficient on large dataset both in terms of computation time and memory required to pivot the tables.

Alternative 4: {purrr}

library(purrr)  ## requires to have installed {purrr}

drugs |>
  mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

This is a perfectly fine solution and actually part of what one implementation of lay() relies on (if .method = "tidy"), but from a user perspective it is a little too geeky-scary.

Alternative 5: {slider}

library(slider)   ## requires to have installed {slider}

drugs |>
  mutate(everused = slide_vec(pick(-caseid), any))
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

The package {slider} is a powerful package which provides several sliding window functions. It can be used to perform rowwise operations and is quite similar to {lay} in terms syntax. It is however not as efficient as {lay} and I am not sure it supports the automatic splicing demonstrated above.

Alternative 6: {data.table}

library(data.table)  ## requires to have installed {data.table}

drugs_dt <- data.table(drugs)

drugs_dt[, ..I := .I]
drugs_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]
drugs_dt[, ..I := NULL]
as_tibble(drugs_dt)
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

This is a solution for those using {data.table}. It is not particularly efficient, nor particularly easy to remember for those who do not program frequently using {data.table}.

Alternative 7: `apply()`

drugs |>
  mutate(everused = apply(pick(-caseid), 1L, any))
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

This is the base R solution. Very efficient and actually part of the default method used in lay(). Our implementation of lay() strips the need of defining the margin (the 1L above) and benefits from the automatic splicing and the lambda syntax as shown above.

Alternative 8: `for (i in ...) {...}`

drugs$everused <- NA

columns_in <- !colnames(drugs) %in% c("caseid", "everused")

for (i in seq_len(nrow(drugs))) {
  drugs$everused[i] <- any(drugs[i, columns_in])
}

drugs
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor everused
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int> <lgl>   
#>  1 1            0       0       0       0       0       0       0 FALSE   
#>  2 2            0       0       0       0       0       0       0 FALSE   
#>  3 3            0       0       0       0       0       0       0 FALSE   
#>  4 4            0       0       0       0       0       0       0 FALSE   
#>  5 5            0       0       0       0       0       0       0 FALSE   
#>  6 6            0       0       0       0       0       0       0 FALSE   
#>  7 7            0       0       0       0       0       0       0 FALSE   
#>  8 8            0       0       0       0       0       0       0 FALSE   
#>  9 9            0       0       0       0       0       0       1 TRUE    
#> 10 10           0       0       0       0       0       0       0 FALSE   
#> # ℹ 90 more rows

This is another base R solution, which does not involve any external package. It is not very pretty, nor particularly efficient.

Other alternatives?

There are probably other ways. If you think of a nice one, please leave an issue and we will add it here!

Efficiency

The results of benchmarks comparing alternative implementations for our simple rowwise job are shown in another Article (see benchmarks). As you will see, lay() is not just simple and powerful, it is also quite efficient!