Create efficiently new column(s) in data frame (including tibble) by applying a function one row at a time.
Usage
lay(.data, .fn, ..., .method = c("apply", "tidy"))
Arguments
- .data
A data frame or tibble (or other data frame extensions).
- .fn
A function to apply to each row of
.data
. Possible values are:A function, e.g.
mean
An anonymous function, .e.g.
function(x) mean(x, na.rm = TRUE)
An anonymous function with shorthand, .e.g.
\(x) mean(x, na.rm = TRUE)
A purrr-style lambda, e.g.
~ mean(.x, na.rm = TRUE)
(wrap the output in a data frame to apply several functions at once, e.g.
~ tibble(min = min(.x), max = max(.x))
)
- ...
Additional arguments for the function calls in
.fn
(must be named!).- .method
This is an experimental argument that allows you to control which internal method is used to apply the rowwise job:
"apply", the default internally uses the function
apply()
."tidy", internally uses
purrr::pmap()
and is stricter with respect to class coercion across columns.
The default has been chosen based on these benchmarks.
Value
A vector with one element per row of .data
, or a data frame (or tibble) with one row per row of .data
. The class of the output is determined by .fn
.
Details
lay()
create a vector or a data frame (or tibble), by considering in turns each row of a data
frame (.data
) as the vector input of some function(s) .fn
.
This makes the creation of new columns based on a rowwise operation both simple (see Examples; below) and efficient (see the Article benchmarks).
The function should be fully compatible with {dplyr}
-based workflows and follows a syntax close
to dplyr::across()
.
Yet, it takes .data
instead of .cols
as a main argument, which makes it possible to also use
lay()
outside dplyr
verbs (see Examples).
The function lay()
should work in a wide range of situations, provided that:
The input
.data
should be a data frame (including tibble) with columns of same class, or of classes similar enough to be easily coerced into a single class. Note that.method = "apply"
also allows for the input to be a matrix and is more permissive in terms of data coercion.The output of
.fn
should be a scalar (i.e., vector of length 1) or a 1 row data frame (or tibble).
If you use lay()
within dplyr::mutate()
, make sure that the data used by dplyr::mutate()
contain no row-grouping, i.e., what is passed to .data
in dplyr::mutate()
should not be of
class grouped_df
or rowwise_df
. If it is, lay()
will be called multiple times, which will
slow down the computation despite not influencing the output.
Examples
# usage without dplyr -------------------------------------------------------------------------
# lay can return a vector
lay(drugs[1:10, -1], any)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# lay can return a data frame
## using the shorthand function syntax \(x) .fn(x)
lay(drugs[1:10, -1],
\(x) data.frame(drugs_taken = sum(x), drugs_not_taken = sum(x == 0)))
#> drugs_taken drugs_not_taken
#> 1 0 7
#> 2 0 7
#> 3 0 7
#> 4 0 7
#> 5 0 7
#> 6 0 7
#> 7 0 7
#> 8 0 7
#> 9 1 6
#> 10 0 7
## using the rlang lambda syntax ~ fn(.x)
lay(drugs[1:10, -1],
~ data.frame(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0)))
#> drugs_taken drugs_not_taken
#> 1 0 7
#> 2 0 7
#> 3 0 7
#> 4 0 7
#> 5 0 7
#> 6 0 7
#> 7 0 7
#> 8 0 7
#> 9 1 6
#> 10 0 7
# lay can be used to augment a data frame
cbind(drugs[1:10, ],
lay(drugs[1:10, -1],
~ data.frame(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0))))
#> caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor drugs_taken
#> 1 1 0 0 0 0 0 0 0 0
#> 2 2 0 0 0 0 0 0 0 0
#> 3 3 0 0 0 0 0 0 0 0
#> 4 4 0 0 0 0 0 0 0 0
#> 5 5 0 0 0 0 0 0 0 0
#> 6 6 0 0 0 0 0 0 0 0
#> 7 7 0 0 0 0 0 0 0 0
#> 8 8 0 0 0 0 0 0 0 0
#> 9 9 0 0 0 0 0 0 1 1
#> 10 10 0 0 0 0 0 0 0 0
#> drugs_not_taken
#> 1 7
#> 2 7
#> 3 7
#> 4 7
#> 5 7
#> 6 7
#> 7 7
#> 8 7
#> 9 6
#> 10 7
# usage with dplyr ----------------------------------------------------------------------------
if (require("dplyr")) {
# apply any() to each row
drugs |>
mutate(everused = lay(pick(-caseid), any))
# apply any() to each row using all columns
drugs |>
select(-caseid) |>
mutate(everused = lay(pick(everything()), any))
# a workaround would be to use `rowSums`
drugs |>
mutate(everused = rowSums(pick(-caseid)) > 0)
# but we can lay any function taking a vector as input, e.g. median
drugs |>
mutate(used_median = lay(pick(-caseid), median))
# you can pass arguments to the function
drugs_with_NA <- drugs
drugs_with_NA[1, 2] <- NA
drugs_with_NA |>
mutate(everused = lay(pick(-caseid), any))
drugs_with_NA |>
mutate(everused = lay(pick(-caseid), any, na.rm = TRUE))
# you can lay the output into a 1-row tibble (or data.frame)
# if you want to apply multiple functions
drugs |>
mutate(lay(pick(-caseid),
~ tibble(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0))))
# note that naming the output prevent the automatic splicing and you obtain a df-column
drugs |>
mutate(usage = lay(pick(-caseid),
~ tibble(drugs_taken = sum(.x), drugs_not_taken = sum(.x == 0))))
# if your function returns a vector longer than a scalar, you should turn the output
# into a tibble, which is the job of as_tibble_row()
drugs |>
mutate(lay(pick(-caseid), ~ as_tibble_row(quantile(.x))))
# note that you could also wrap the output in a list and name it to obtain a list-column
drugs |>
mutate(usage_quantiles = lay(pick(-caseid), ~ list(quantile(.x))))
}
#> Loading required package: dplyr
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
#> # A tibble: 100 × 9
#> caseid hydrocd oxycodp codeine tramadl morphin methdon vicolor
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 0 0 0 0 0 0 0
#> 2 2 0 0 0 0 0 0 0
#> 3 3 0 0 0 0 0 0 0
#> 4 4 0 0 0 0 0 0 0
#> 5 5 0 0 0 0 0 0 0
#> 6 6 0 0 0 0 0 0 0
#> 7 7 0 0 0 0 0 0 0
#> 8 8 0 0 0 0 0 0 0
#> 9 9 0 0 0 0 0 0 1
#> 10 10 0 0 0 0 0 0 0
#> # ℹ 90 more rows
#> # ℹ 1 more variable: usage_quantiles <list>