Benchmarks • lay

Article overview

The goal of this Article is to compare the performances of lay() to alternatives described here. As you will see, the code using lay() is quite efficient. The only alternative that is clearly more efficient is the one labeled below “vectorized”. Unfortunately, such a vectorized approach imply to refer explicitly to all column names which data are used. Furthermore, such a vectorized approach is not applicable generally, as it can only deal with logical and arithmetic operators and does allow the use of other types of functions.

Loading packages

This Article requires you to load the following packages:

library(lay)        ## for lay() and the data
library(dplyr)      ## for many things
library(tidyr)      ## for pivot_longer() and pivot_wider()
library(purrr)      ## for pmap_lgl()
library(slider)     ## for slide()
library(data.table) ## for an alternative to base and dplyr
library(bench)      ## for running the benchmarks
library(forcats)    ## for sorting levels in plot with fct_reorder()

Please install them if they are not present on your system.

An example of a rowwise task

Consider the dataset drugs_full from our package {lay}:

drugs_full
#> # A tibble: 55,271 × 8
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int>
#>  1 1            0       0       0       0       0       0       0
#>  2 2            0       0       0       0       0       0       0
#>  3 3            0       0       0       0       0       0       0
#>  4 4            0       0       0       0       0       0       0
#>  5 5            0       0       0       0       0       0       0
#>  6 6            0       0       0       0       0       0       0
#>  7 7            0       0       0       0       0       0       0
#>  8 8            0       0       0       0       0       0       0
#>  9 9            0       0       0       0       0       0       1
#> 10 10           0       0       0       0       0       0       0
#> # ℹ 55,261 more rows

In this dataset, all columns but caseid record the use of pain relievers for non medical purpose.

For each drug there is a certain number of users and non-users:

drugs_full |>
  pivot_longer(-caseid, names_to = "drug", values_to = "used") |>
  count(drug, used) |>
  mutate(used = if_else(used == 1, "have_used", "have_not_used")) |>
  pivot_wider(names_from = used, values_from = n)
#> # A tibble: 7 × 3
#>   drug    have_not_used have_used
#>   <chr>           <int>     <int>
#> 1 codeine         53642      1629
#> 2 hydrocd         52633      2638
#> 3 methdon         54685       586
#> 4 morphin         54326       945
#> 5 oxycodp         51461      3810
#> 6 tramadl         54480       791
#> 7 vicolor         50358      4913

In this Article, we compare the efficiency of alternative ways to create a new variable named everused which indicates if each respondent has used any of the considered pain relievers for non medical purpose or not.

We will run benchmarks on the dataset drugs_full and its 55271 rows, as well as on a subset of this data called drugs that only contains 100 rows.

Benchmarks on the full dataset (55271 rows)

Let’s compare the running time of different methods to do this job on the full dataset:

drugs_full_dt <- data.table(drugs_full) ## coercion to data.table

benchmark1 <- mark(
  vectorized = {
    drugs_full |>
      mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)},
  lay = {
    drugs_full |>
      select(-caseid) |>
      mutate(everused = lay(pick(everything()), any))},
  lay_alternative = {
    drugs_full |>
      mutate(everused = lay(pick(-caseid), any, .method = "tidy"))},
  c_across = {
    drugs_full |>
      rowwise() |>
      mutate(everused = any(c_across(-caseid))) |>
      ungroup()},
  pivot_pivot = {
    drugs_full |>
      pivot_longer(-caseid) |>
      group_by(caseid) |>
      mutate(everused = any(value)) |>
      ungroup() |>
      pivot_wider()},
  pmap = {
    drugs_full |>
      mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))},
  slider = {
    drugs_full |>
      mutate(everused = slide_vec(pick(-caseid), any))},
  data.table = {
    drugs_full_dt[, ..I := .I]
    drugs_full_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]},
  apply = {
    drugs_full |>
      mutate(everused = apply(pick(-caseid), 1, any))},
  'for' = {
    everused <- logical(nrow(drugs_full))
    columns_in <- colnames(drugs_full) != "caseid"
    for (i in seq_len(nrow(drugs_full))) everused[i] <- any(drugs_full[i, columns_in])},
  iterations = 5,
  time_unit = "ms",
  check = FALSE
  )
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

Here are the results of this first series of benchmarks:

benchmark1 |>
  arrange(median)
#> # A tibble: 10 × 6
#>    expression           min   median `itr/sec` mem_alloc `gc/sec`
#>    <bch:expr>         <dbl>    <dbl>     <dbl> <bch:byt>    <dbl>
#>  1 vectorized          3.14     3.26  299.        2.74MB    0    
#>  2 apply             116.     117.      8.50      4.01MB    3.40 
#>  3 lay               144.     147.      6.67      5.49MB    1.33 
#>  4 pivot_pivot       321.     329.      3.03     62.38MB    1.21 
#>  5 pmap              383.     385.      2.60    251.29KB    2.60 
#>  6 lay_alternative  1566.    1584.      0.633     15.3MB    1.65 
#>  7 data.table       4075.    4095.      0.238     3.04MB    2.05 
#>  8 slider           4705.    4728.      0.211   915.15KB    1.98 
#>  9 for              9269.    9390.      0.106   271.73KB    2.16 
#> 10 c_across        93000.   93327.      0.0107  424.32MB    0.932

benchmark1 |>
  mutate(expression = fct_reorder(as.character(expression), median, .desc = TRUE)) |>
  plot()

Note that the x-axis of the plot is on a logarithmic scale.

Benchmarks on a subset of the data (100 rows)

Let’s repeat our benchmarks using a only a subset of the original dataset:

drugs_dt <- data.table(drugs) ## coercion to data.table

benchmark2 <- mark(
  vectorized = {
    drugs |>
      mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)},
  lay = {
    drugs |>
      select(-caseid) |>
      mutate(everused = lay(pick(everything()), any))},
  lay_alternative = {
    drugs |>
      mutate(everused = lay(pick(-caseid), any, .method = "tidy"))},
  c_across = {
    drugs |>
      rowwise() |>
      mutate(everused = any(c_across(-caseid))) |>
      ungroup()},
  pivot_pivot = {
    drugs |>
      pivot_longer(-caseid) |>
      group_by(caseid) |>
      mutate(everused = any(value)) |>
      ungroup() |>
      pivot_wider()},
  pmap = {
    drugs |>
      mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))},
  slider = {
    drugs |>
      mutate(everused = slide_vec(pick(-caseid), any))},
  data.table = {
    drugs_dt[, ..I := .I]
    drugs_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]},
  apply = {
    drugs |>
      mutate(everused = apply(pick(-caseid), 1, any))},
  'for' = {
    everused <- logical(nrow(drugs))
    columns_in <- colnames(drugs) != "caseid"
    for (i in seq_len(nrow(drugs))) everused[i] <- any(drugs[i, columns_in])},
  iterations = 30,
  time_unit = "ms",
  check = FALSE
  )

Here are the results of this second series of benchmarks:

benchmark2 |>
  arrange(median)
#> # A tibble: 10 × 6
#>    expression         min median `itr/sec` mem_alloc `gc/sec`
#>    <bch:expr>       <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#>  1 vectorized        1.29   1.35    729.      7.33KB     0   
#>  2 apply             3.61   3.74    264.     16.92KB     0   
#>  3 pmap              4.09   4.22    235.      9.91KB     8.11
#>  4 lay               5.95   6.09    163.     27.12KB     0   
#>  5 lay_alternative   6.23   6.35    157.     38.05KB     0   
#>  6 data.table        8.06   8.48    118.     67.66KB     4.06
#>  7 slider           11.9   12.4      80.3    11.18KB     0   
#>  8 for              20.4   21.0      47.7    17.74KB     3.41
#>  9 pivot_pivot      21.9   22.6      44.2   189.47KB     1.52
#> 10 c_across        170.   174.        5.75  794.53KB     1.75

benchmark2 |>
  mutate(expression = fct_reorder(as.character(expression), median, .desc = TRUE)) |>
  plot(type = "violin")

Note again that the x-axis of the plot is on a logarithmic scale.

Benchmarks’ environment

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_1.0.0     bench_1.1.3       data.table_1.14.8 slider_0.3.1     
#> [5] purrr_1.0.2       tidyr_1.3.0       dplyr_1.1.3       lay_0.1.3        
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.7        utf8_1.2.4        generics_0.1.3    stringi_1.7.12   
#>  [5] digest_0.6.33     magrittr_2.0.3    evaluate_0.22     grid_4.3.1       
#>  [9] fastmap_1.1.1     rprojroot_2.0.3   jsonlite_1.8.7    fansi_1.0.5      
#> [13] scales_1.2.1      textshaping_0.3.7 jquerylib_0.1.4   cli_3.6.1        
#> [17] rlang_1.1.1       munsell_0.5.0     withr_2.5.1       cachem_1.0.8     
#> [21] yaml_2.3.7        ggbeeswarm_0.7.2  tools_4.3.1       memoise_2.0.1    
#> [25] colorspace_2.1-0  ggplot2_3.4.4     profmem_0.6.0     vctrs_0.6.4      
#> [29] R6_2.5.1          lifecycle_1.0.3   stringr_1.5.0     fs_1.6.3         
#> [33] vipor_0.4.5       ragg_1.2.6        beeswarm_0.4.0    pkgconfig_2.0.3  
#> [37] desc_1.4.2        warp_0.2.0        pkgdown_2.0.7     pillar_1.9.0     
#> [41] bslib_0.5.1       gtable_0.3.4      glue_1.6.2        systemfonts_1.0.5
#> [45] xfun_0.40         tibble_3.2.1      tidyselect_1.2.0  knitr_1.44       
#> [49] farver_2.1.1      htmltools_0.5.6.1 rmarkdown_2.25    compiler_4.3.1