Skip to contents

Create efficiently new column(s) in data frame (including tibble) by applying a function one row at a time.

Usage

lay(.data, .fn, ..., .method = c("apply", "tidy"))

Arguments

.data

A data frame or tibble (or other data frame extensions).

.fn

A function to apply to each row of .data. Possible values are:

  • A function, e.g. mean

  • An anonymous function, .e.g. function(x) mean(x, na.rm = TRUE)

  • An anonymous function with shorthand, .e.g. \(x) mean(x, na.rm = TRUE)

  • A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)

    (wrap the output in a data frame to apply several functions at once, e.g. ~ tibble(min = min(.x), max = max(.x)))

...

Additional arguments for the function calls in .fn (must be named!).

.method

This is an experimental argument that allows you to control which internal method is used to apply the rowwise job:

  • "apply", the default internally uses the function apply().

  • "tidy", internally uses purrr::pmap() and is stricter with respect to class coercion across columns.

The default has been chosen based on these benchmarks.

Value

A vector with one element per row of .data, or a data frame (or tibble) with one row per row of .data. The class of the output is determined by .fn.

Details

lay() create a vector or a data frame (or tibble), by considering in turns each row of a data frame (.data) as the vector input of some function(s) .fn.

This makes the creation of new columns based on a rowwise operation both simple (see Examples; below) and efficient (see the Article benchmarks).

The function should be fully compatible with {dplyr}-based workflows and follows a syntax close to dplyr::across().

Yet, it takes .data instead of .cols as a main argument, which makes it possible to also use lay() outside dplyr verbs (see Examples).

The function lay() should work in a wide range of situations, provided that:

  • The input .data should be a data frame (including tibble) with columns of same class, or of classes similar enough to be easily coerced into a single class. Note that .method = "apply" also allows for the input to be a matrix and is more permissive in terms of data coercion.

  • The output of .fn should be a scalar (i.e., vector of length 1) or a 1 row data frame (or tibble).

If you use lay() within dplyr::mutate(), make sure that the data used by dplyr::mutate() contain no row-grouping, i.e., what is passed to .data in dplyr::mutate() should not be of class grouped_df or rowwise_df. If it is, lay() will be called multiple times, which will slow down the computation despite not influencing the output.

Examples


# usage without dplyr -------------------------------------------------------------------------

# lay can return a vector
lay(drugs[1:10, -1], any)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

# lay can return a data frame
## using the shorthand function syntax \(x) .fn(x)
lay(drugs[1:10, -1],
   \(x) data.frame(drugs_taken = sum(x), drugs_not_taken = sum(x == 0)))
#>    drugs_taken drugs_not_taken
#> 1            0               7
#> 2            0               7
#> 3            0               7
#> 4            0               7
#> 5            0               7
#> 6            0               7
#> 7            0               7
#> 8            0               7
#> 9            1               6
#> 10           0               7

## using the rlang lambda syntax ~ fn(.x)
lay(drugs[1:10, -1],
   ~ data.frame(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0)))
#>    drugs_taken drugs_not_taken
#> 1            0               7
#> 2            0               7
#> 3            0               7
#> 4            0               7
#> 5            0               7
#> 6            0               7
#> 7            0               7
#> 8            0               7
#> 9            1               6
#> 10           0               7

# lay can be used to augment a data frame
cbind(drugs[1:10, ],
      lay(drugs[1:10, -1],
         ~ data.frame(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0))))
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor drugs_taken
#> 1       1       0       0       0       0       0       0       0           0
#> 2       2       0       0       0       0       0       0       0           0
#> 3       3       0       0       0       0       0       0       0           0
#> 4       4       0       0       0       0       0       0       0           0
#> 5       5       0       0       0       0       0       0       0           0
#> 6       6       0       0       0       0       0       0       0           0
#> 7       7       0       0       0       0       0       0       0           0
#> 8       8       0       0       0       0       0       0       0           0
#> 9       9       0       0       0       0       0       0       1           1
#> 10     10       0       0       0       0       0       0       0           0
#>    drugs_not_taken
#> 1                7
#> 2                7
#> 3                7
#> 4                7
#> 5                7
#> 6                7
#> 7                7
#> 8                7
#> 9                6
#> 10               7


# usage with dplyr ----------------------------------------------------------------------------

if (require("dplyr")) {

  # apply any() to each row
  drugs |>
    mutate(everused = lay(pick(-caseid), any))

  # apply any() to each row using all columns
  drugs |>
    select(-caseid) |>
    mutate(everused = lay(pick(everything()), any))

  # a workaround would be to use `rowSums`
  drugs |>
    mutate(everused = rowSums(pick(-caseid)) > 0)

  # but we can lay any function taking a vector as input, e.g. median
  drugs |>
    mutate(used_median = lay(pick(-caseid), median))

  # you can pass arguments to the function
  drugs_with_NA <- drugs
  drugs_with_NA[1, 2] <- NA

  drugs_with_NA |>
    mutate(everused = lay(pick(-caseid), any))
  drugs_with_NA |>
    mutate(everused = lay(pick(-caseid), any, na.rm = TRUE))

  # you can lay the output into a 1-row tibble (or data.frame)
  # if you want to apply multiple functions
  drugs |>
    mutate(lay(pick(-caseid),
             ~ tibble(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0))))

  # note that naming the output prevent the automatic splicing and you obtain a df-column
  drugs |>
    mutate(usage = lay(pick(-caseid),
              ~ tibble(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0))))

  # if your function returns a vector longer than a scalar, you should turn the output
  # into a tibble, which is the job of as_tibble_row()
  drugs |>
    mutate(lay(pick(-caseid), ~ as_tibble_row(quantile(.x))))

  # note that you could also wrap the output in a list and name it to obtain a list-column
  drugs |>
    mutate(usage_quantiles = lay(pick(-caseid), ~ list(quantile(.x))))
}
#> Loading required package: dplyr
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
#> # A tibble: 100 × 9
#>    caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor
#>    <chr>    <int>   <int>   <int>   <int>   <int>   <int>   <int>
#>  1 1            0       0       0       0       0       0       0
#>  2 2            0       0       0       0       0       0       0
#>  3 3            0       0       0       0       0       0       0
#>  4 4            0       0       0       0       0       0       0
#>  5 5            0       0       0       0       0       0       0
#>  6 6            0       0       0       0       0       0       0
#>  7 7            0       0       0       0       0       0       0
#>  8 8            0       0       0       0       0       0       0
#>  9 9            0       0       0       0       0       0       1
#> 10 10           0       0       0       0       0       0       0
#> # ℹ 90 more rows
#> # ℹ 1 more variable: usage_quantiles <list>