sample <- function(x, size, replace = FALSE, prob = NULL) {
if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 1) {
if (missing(size))
size <- x
sample.int(x, size, replace, prob)
} else {
if (missing(size))
size <- length(x)
x[sample.int(length(x), size, replace, prob)]
}
}
Three functions in a trench coat
What’s the problem?
Sometimes a function that implements multiple strategies might be better off as independent functions. Two signs that this might be the case:
- You’re struggling to document how the arguments interact. Maybe you can set
a
andb
anda
andc
but notb
andc
. - The implementation of your function has a couple of big if branches that share relatively little code.
Splitting one complex multi-strategy function into multiple simpler functions can make maintenance and testing easier, while also improving the user experience. I think of this the three (or more generally \(n\)) functions in a trench coat1 problem, because the reality of the separate functions tends to become very obvious in time.
What are some examples?
forcats::fct_lump()
chooses between one of three lumping strategies depending on whether you supply justn
, justprop
, or neither, while supplying bothn
andprop
is an error. Additionally, theties.method
argument only does anything if you supply onlyn
.fct_lump()
is hard to understand and document because it’s really three smaller functions.Depending on the arguments used
library()
can load a package, it can list all installed packages, or display the help for a package.diag()
is used to both extract the diagonal of a matrix and construct a matrix from a vector of diagonal values. This combination of purposes makes its arguments hard to understand: ifx
is a matrix, you cannames
but notnrow
andncol
; ifx
is a vector, you can usenrow
andncol
but notnames
.sample()
is used to both randomly reorder a vector and generate a random vector of specified length. This function is particularly troublesome because it picks between the two strategies based on the length of the first argument.rep()
is used to both repeat each element of a vector and to repeat the complete vector. I discuss this more in Chapter 16.
These functions all also suffer from the problem that the strategies are implicit, not explicit. That’s because they use either the presence or absence of different arguments or the type of an argument pick strategies. This combination tends to produce particularly opaque code.
How do I identify the problem?
Typically this problem arises as the scope of your function grows over time, and because the growth tends to be gradual it’s hard to notice exactly when it becomes an issue. One way to splot this problem is notice that your function consists of a big if statement where the branches share very little code. An extreme example of this is the base sample()
function. As of R 4.3.0 it looks something like this:
You can see that there are two branches that share very little code, and each branch uses a different default value for size
. This suggests it might be better to have two functions:
sample_vec <- function(x, size = length(x), replace = FALSE, prob = NULL) {
# check_vector(x)
# check_number_whole(size)
x[sample.int(length(x), size, replace, prob)]
}
sample_int <- function(x, size = x, replace = FALSE, prob = NULL) {
# check_number_whole(x)
# check_number_whole(size)
x[sample.int(length(x), size, replace, prob)]
}
In other cases you might spot the problem because you’re having trouble explaining the arguments in the documentation. If it feels like
How do I remediate past mistakes?
Remediating past mistakes is straightforward: define, document, and export one function for each strategy. Then rewrite the original function to use those strategies, deprecating that entire function if desired. For example, this is what fct_lump()
looked like after we realised it was really the combination of three simpler functions:
fct_lump <- function(f,
n,
prop,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")) {
if (missing(n) && missing(prop)) {
fct_lump_lowfreq(f, w = w, other_level = other_level)
} else if (missing(prop)) {
fct_lump_n(f, n, w = w, other_level = other_level, ties.method = ties.method)
} else if (missing(n)) {
fct_lump_prop(f, prop, w = w, other_level = other_level)
} else {
cli::cli_abort("Must supply only one of {.arg n} and {.arg prop}.")
}
}
We decided to supersede fct_lump()
rather than deprecating it, so we kept old function around and working. If we wanted to deprecate it, we’d need to add one deprecation for each branch:
fct_lump <- function(f,
n,
prop,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")) {
if (missing(n) && missing(prop)) {
lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_lowfreq()")
fct_lump_lowfreq(f, w = w, other_level = other_level)
} else if (missing(prop)) {
lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_n()")
fct_lump_n(f, n, w = w, other_level = other_level, ties.method = ties.method)
} else if (missing(n)) {
lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_prop()")
fct_lump_prop(f, prop, w = w, other_level = other_level)
} else {
cli::cli_abort("Must supply only one of {.arg n} and {.arg prop}.")
}
}