12  Make strategies explicit

12.1 What’s the problem?

If your function exposes multiple implementation strategies, make those explicit through an argument. This makes it clear how to control the operation of your function and extends gracefully if different strategies expose different options.

One sign that this pattern is needed is a complex dependency pattern between arguments: maybe you can set a and b and a and c but not b and c. This tends to make the function harder to use because it suggests more viable input combination that actually exist. You have to learn and the remember the set of allowed combinations, rather that seeing the underlying strategies.

12.2 What are some examples?

Some functions that express this pattern well include:

  • rank() exposes six different methods for handling ties with the ties.method argument.
  • p.adjust() exposes eight strategies for adjusting P values to account for multiple comparisons using the p.adjust.methods argument.
  • quantile() exposes nine different approaches to computing a quantile through the type argument.
  • dplyr::left_join() uses a very advanced form of this pattern where the different strategies for joining two data frames together are expressed in a mini-DSL provided by dplyr::join_by().

Some functions that could do better are:

  • forcats::fct_lump() exposes three lumping strategies through the presence of absence of the n and prop arguments. We remediated this issue in forcats 0.5.0 but splitting fct_lump() into three separate functions: fct_lump_prop(), fct_lump_n(), and fct_lump_lowfreq(). This allows the function name to hint at the purpose, prevents you from supplying both n and prop through the design of the functions, and only has the ties.method argument where it makes sense.

  • ggplot2::geom_histogram() has three main strategies for generating the bins: you can supply the number of bins, as the width of each bin (binwidth, or the exact breaks.

  • grepl() has perl and fixed which can be either TRUE or FALSE, but you’re not really toggling two independent values, you’re picking from one of three regular expression engines (the default, the engine used by Perl, and fixed matches). Additionally, the ignore.case argument only applies to two of the strategies. Learn more in Chapter 17.

  • rep() exposes two basic strategies: repeat each element of the vector (by setting each to a scalar or times to a vector) or repeat the entire vector (by setting times to a scalar). Learn more in Chapter 13.

12.3 How do I use this pattern?

There are two ways to use this pattern, depending on whether or not the different strategies have different arguments.

12.3.1 Name and enumerate

The simplest approach to exposing different strategies is to name each option and then this to use character vector of the possible options, as described in Chapter 10. This approach typically pairs well with switch(). For example, take stringr::str_trim() which looks something like this:

str_trim <- function (string, side = c("both", "left", "right")) 
{
  switch(
    arg_match(side),
    left = stri_trim_left(string),
    right = stri_trim_right(string),
    both = stri_trim_both(string)
  )
}

This is particularly simple because stringr relies on the stringi package for implementation. But I think it’s still straightforward even if we implement more ourselves:

str_trim <- function (string, side = c("both", "left", "right")) 
{
  pattern <- switch(
    arg_match(side),
    left = "^\\s+",
    right = "\\s$",
    both = "^\\s+|\\s$"
  )
  str_replace_all(string, pattern, "")
}

12.3.2 Only two strategies

If your function provides only two strategies it’s tempting to use a logical to switch between them. I recommend against this unless you’re REALLY SURE there won’t ever be another strategy. In many cases, it’s also easier to remember what an enumeration value means compared to a Boolean argument. For example compare these equivalent calls to sort() and vctrs::vec_sort():

x <- sample(10)
sort(x, decreasing = TRUE)
vctrs::vec_sort(x, direction = "desc")

sort(x, decreasing = FALSE)
vctrs::vec_sort(x, direction = "asc")

It’s unlikely that you’ll ever need to implement another sorting direction, but I think it’s easier to understand direction = "asc" than decreasing = FALSE at a glance. Another example where I think an enumeration is easier to understand is the cols_vary argument to pivot_longer(). This argument can be either fastest or slowest, and it’s hard to imagine how you might encode that into a Boolean flag.

12.3.3 Varying arguments

In more complicated cases, different strategies will require different arguments, so you’ll need a bit more infrastructure. The basic idea is build on the options object described in Chapter 11, but instead of providing just one helper function, you’ll provide one function per strategy. A good example of this approach is readr, which provides regex(), boundary(), coll(), and fixed() to pick between four different strategies for matching text. You can learn more about why we picked that interface in Chapter 17, so here I wanted to focus on the implementation.

If you take a look at one of these functions, you’ll see it’s a wrapper around a stringi function that performs a similar job. But fixed() does two extra things compared to stri_opts_fixed(): it more aggressively checks the input arguments and it combines the stringi options with the pattern and adds a class.

fkxed <- function(pattern, ignore_case = FALSE) {
  pattern <- as_bare_character(pattern)
  check_bool(ignore_case)

  options <- stri_opts_fixed(case_insensitive = ignore_case)

  structure(
    pattern,
    options = options,
    class = c("stringr_fixed", "stringr_pattern", "character")
  )
}

This class is important because it allows us to check that the user has provided the expect input type and give a useful error message if not. Since pretty much every stringr function needs to do this, stringr provides a internal function called type() that looks something like this1:

type <- function(x) {
  if (inherits(x, "stringr_boundary")) {
    "bound"
  } else if (inherits(x, "stringr_regex")) {
    "regex"
  } else if (inherits(x, "stringr_coll")) {
    "coll"
  } else if (inherits(x, "stringr_fixed")) {
    "fixed"
  } else if (is.character(x)) {
    if (identical(x, "")) "empty" else "regex"
  } else {
    cli::cli_abort("Must be a string or stringr pattern object")
  }
}

Then individual stringr functions can use type() plus a switch statement:

str_detect <- function(string, pattern) {
  check_lengths(string, pattern)
  check_bool(negate)

  switch(type(pattern),
1    empty = no_empty(),
    bound = no_boundary(),
2    fixed = stri_detect_fixed(string, pattern, opts_fixed = opts(pattern)),
    coll  = stri_detect_coll(string,  pattern, opts_collator = opts(pattern)),
    regex = stri_detect_regex(string, pattern, opts_regex = opts(pattern))
  )
}
1
no_empty() and no_boundary() are helper functions that generate errors when a stringr function doesn’t support a specific engine.
2
opts() is a helper function for extracting the stringi options back out of the stringr wrapper object.

You can implement this same strategy using if or OOP, but here I particularly like the switch pattern because it keeps the stringi function calls close together, which makes it easier to keep them in sync.

12.3.4 Escape hatches

It’s sometimes useful to build in an escape hatch from canned strategies. This allows users to access alternative strategies, and allows for experimentation that can later turn into a official strategies. One example of such an escape hatch is in name repair, which occurs in many places throughout the tidyverse. One place you might encounter it is in tibble():

tibble::tibble(a = 1, a = 2)
#> Error in `tibble::tibble()`:
#> ! Column name `a` must not be duplicated.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names must be unique.
#> ✖ These names are duplicated:
#>   * "a" at locations 1 and 2.

Beneath the surface all tidyverse functions that expose some sort of name repair eventually end up calling vctrs::vec_as_names():

vctrs::vec_as_names(c("a", "a"), repair = "check_unique")
#> Error:
#> ! Names must be unique.
#> ✖ These names are duplicated:
#>   * "a" at locations 1 and 2.
vctrs::vec_as_names(c("a", "a"), repair = "unique")
#> New names:
#> • `a` -> `a...1`
#> • `a` -> `a...2`
#> [1] "a...1" "a...2"
vctrs::vec_as_names(c("a", "a"), repair = "unique_quiet")
#> [1] "a...1" "a...2"

vec_as_names() exposes six strategies, but it also allows you to supply a function:

vctrs::vec_as_names(c("a", "a"), repair = toupper)
#> [1] "A" "A"

The implementation looks something like this:

vec_as_names <- function(
    repair = c(
      "minimal",
      "unique",
      "universal",
      "check_unique",
      "unique_quiet",
      "universal_quiet"
    )
) {
  if (is.character(repair)) {
    repair <- switch(
      arg_match(repair), 
      minimal = minimal_names, 
      universal = function(names) as_universal_names(names, quiet = FALSE),
      universal_quiet = function(names) as_universal_names(names, quiet = TRUE),
      ...
    )
  } else if (!is.function(repair)) {
    cli::cli_abort("{.arg repair} must be a function or string.")
  }
}

Notice we use switch to convert a string to a function from a set of known values.

12.4 How do I remediate past mistakes?

It’s very easy to violate this pattern because your function often begins by implementing a single strategy, and then later you discover a new strategy. It’s easy to implement this as a Boolean flag or with some argument magic, which leads to problems when you later discover a third strategy. In this section, you’ll see a few ways that you can fix these problems if you discover later that you’ve made a mistake.

12.4.1 Boolean to enumeration

One sign that you’re stretching the limit of a Boolean flag is that you’ve given an additional meaning to NA so instead of selecting between two possible options, your argument now selects between three. Take sort() for example: its na.last argument exposes three strategies for handling missing values:

  • na.last = TRUE means put them last.
  • na.last = FALSE means put them first.
  • na.list = NA means to drop them.

I think we could make this function more clear by using an enumeration that takes one of three values: last, first, or drop. In this case, we can change both the argument name and its interface making remediation relatively simple: we add a new argument called na and deprecate the old argument.

sort <- function(na.last = deprecated(), na = c("drop", "first", "last")) {
  
  if (lifecycle::is_present(na.last)) {
    lifecycle::deprecate_warn("1.0.0", "sort(na.last)", "sort(na)")

    if (!is.logical(na.last) || length(na.last) != 1) {
      cli::cli_abort("{.arg na.last} must be a single TRUE, FALSE, or NA.")
    }
    
    if (isTRUE(na.last)) {
      na <- "last"
    } else if (isFALSE(na.last)) {
      na <- "first"
    } else {
      na <- "drop"
    }
  } else {
    na <- arg_match(na)
  }
}

Note that the full signature of sort() is sort(x, decreasing = FALSE, na.last = NA, …) so when doing this it would be good practice to put the new argument after the (Chapter 8) so that it doesn’t change the meaning of existing calls that relying on partial matching, e.g. sort(x, na = TRUE).

It would also be nice to make the default value "last" since it’s very unusual for an R function to silently remove missing values. However, that change is much harder to do because it’s much more likely to affect existing code.

A similar case arose in haven::write_sav() where original the compress argument could be TRUE or FALSE. But then SPSS introduced a new way of compressing, expanding the set of options to three (which we have called byte, none, and zsav). Since the compress argument name is still good we chose to accept either a TRUE or FALSE or one of the new enumeratiion:

write_sav <- function(data, path, compress = c("byte", "none", "zsav"), adjust_tz = TRUE) {
  if (isTRUE(compress)) {
    compress <- "zsav"
  } else if (isFALSE(compress)) {
    compress <- "none"
  } else {
    compress <- arg_match(compress)
  }

  ...
}

You could also imagine deprecating the old logical options, but here we chose to the keep them since it’s a small amount of extra code, and means that existing users never need to worry about it. See ?haven::read_sav for how we communicated the in the docs.

12.4.2 Using a strategy function

Sometimes the strategy will be tangled in with many other arguments, or they might be multiple strategies used simultaenously. In these situations you want to avoid creating a combinatorial explosion of functions, and instead might want to use a strategy object.

For example, generating the bins for a histogram is a surprisingly complex topic. ggplot2::stat_bin(), which powers ggplot2::geom_histogram(), has a total of 5 arguments that control where the bins are placed:

  • You can supply either binwidth or bins to specify either the width or the number of evenly spaced bins. Alternatively, you supply breaks to specify the exact bin locations yourself (which allows you to create unevenly sized bins2).
  • If you use binwidth or bins, you’re specifying the width of each bin, but not where the bins start. So additionally you can use either boundary or center3 to specify the location of a side (boundary) or the middle (center) of a bin4. boundary and center are mutually exclusive; you can only specify one (see Chapter 15 for more).
  • Regardless of the way that you specify the locations of the bins, you need to choose where a bin from a to b, is [a, b) or (a, b], which is the job of the closed argument.

One way to resolve this problem would encapsulate the three basic strategies into three functions:

  • bin_width(width, center, boundary, closed)
  • bin_number(bins, center, boundary, closed)
  • bin_breaks(breaks, closed)

That immediately makes the relationship between the arguments and the strategies more clear.

Note that these functions create “strategies”; i.e. they don’t take the data needed to actual perform the operation — none of these functions take range of the data. This makes these functions function factories, which is a relatively complex technique.

bin_width <- function(width, center, boundary, closed = c("left", "right")) {
  # https://adv-r.hadley.nz/function-factories.html#forcing-evaluation
  list(width, center, boundary, closed)
  
  function(range) {
    
  }
}
Argument checking

As in Chapter 11, you may want to give these functions custom classes so that the function that uses them can provide better error messages if the user supplies the wrong type of object.

Alternatively, you might want to just check that the input is a function with the correct formals; that allows the user to supply their own strategy function. It’s probably something that few people will take advantage of, but it’s a nice escape hatch.

12.4.3 Enumeration to strategy

If your function used an enumeration, but now you realise that some strategies use different argument you can remediate like this:

# OLD 
my_fun <- function(strategy = c("a", "b")) {
  strategy <- arg_match(strategy)
}

check_strategy <- function(f) {
  if (!is.function(f) || !identical(names(formals), c("range"))) {
    cli::abort("{.fun f} must be a function with `range` argument")
  }
}


# NEW
my_fun <- function(strategy = my_stragegy_a()) {
  if (is.characer(strategy)) {
    strategy <- switch(
      arg_match(strategy),
      a = my_strategy_a(),
      b = my_strategy_b()
    )
  } else {
    check_strategy(strategy)
  }
}

12.5 See also

  • The original strategy pattern defined in Design Patterns. This pattern has a rather different implementation in a classic OOP language.
  • Chapter 14 the related problem of one argument affecting the interpretation of another argument.
  • Chapter 11 is about the general problem of moving unimportant arguments to another function.

There are also two exceptions to this pattern that we’ll come back to in future chapters:

  • Chapter 15: sometimes you need a pair of mutual exclusive arguments.
  • Chapter 16: sometimes you want to provide all the data in a single argument, and other times it’s useful to spread the same data across multiple arguments.

  1. The actual function is more complicated because it takes more care to generate an informative error message, and it uses S3 instead of nested if statements. But the overall strategy is the same.↩︎

  2. One nice application of this principle is to create a histogram where each bin contains (approximately) the same number of points, as implemented in https://github.com/eliocamp/ggpercentogram/.↩︎

  3. center is also a little problematic as an argument name, because UK English would prefer centre. It’s probably ok here since this it’s a very rarely used argument, but middle would be good alternatives that don’t have the same US/UK problem. Alternatively the pair could be endpoint and midpoint which perhaps suggest a tighter pairing than center and boundary.↩︎

  4. It can be any bin; stat_bin() will automatically adjust all the other bins.↩︎