Required args shouldn’t have defaults

What’s the pattern?

Required arguments shouldn’t have defaults; optional arguments should have defaults. In other words, an argument should have a default if and only if it’s optional.

This simple convention ensures that you can tell which arguments are optional and which arguments are required from a glance at the function signature. Otherwise you need to rely on a careful reading of documentation. Additionally, if you don’t follow this convention and want to provide helpful error messages, you’ll need to implement them yourself rather than relying on R’s defaults.

This pattern raises the question of when an argument should be required, and when you should provide a default. I think this usually seems “obvious” but I wanted to discuss a few functions that might get it wrong:

  • rnorm() and runif() are interesting cases as they set default values for mean/sd and min/max. Giving them defaults makes them feels like less important, and inconsistent with the other RNGs which generally require that you specify the parameters of the distribution. But both the normal and uniform distributions have very high-profile “standard” versions that make sense as defaults.

  • You can use predict() directly on a model and it gives predictions for the data used to fit the model:

    mod <- lm(Employed ~ ., data = longley)
    head(predict(mod))
    #>     1947     1948     1949     1950     1951     1952 
    #> 60.05566 61.21601 60.12471 61.59711 62.91129 63.88831

    In my opinion, predict() should always require a dataset because prediction is primary about applying the model to new situations.

  • stringr::str_sub() has default values for start and end. This allows you to do clever things like str_sub(x, end = 3) or str_sub(x, -3) to select the first or last three characters, but I now believe that leads to code that is harder to read, and it would have been better to make start and end required arguments.

What are some examples?

This is a straightforward convention that the vast majority of functions follow. There are a few exceptions that exist in base R, mostly for historical reasons. Here are a couple of examples:

  • In sample() neither x not size has a default value:

    args(sample)
    #> function (x, size, replace = FALSE, prob = NULL) 
    #> NULL

    This suggests that size is required, but it’s actually optional:

    sample(1:4)
    #> [1] 2 1 4 3
    sample(4)
    #> [1] 1 2 4 3
  • lm() does not have defaults for formula, data, subset, weights, na.action, or offset.

    args(lm)
    #> function (formula, data, subset, weights, na.action, method = "qr", 
    #>     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    #>     contrasts = NULL, offset, ...) 
    #> NULL

    But only formula is actually required:

    x <- 1:5
    y <- 2 * x + 1 + rnorm(length(x))
    lm(y ~ x)
    #> 
    #> Call:
    #> lm(formula = y ~ x)
    #> 
    #> Coefficients:
    #> (Intercept)            x  
    #>       0.611        2.161

In the tidyverse, one function that fails to follow this pattern is ggplot2::geom_abline(), slope and intercept don’t have defaults but are not required. If you don’t supply them they default to slope = 1 and intercept = 0, or are taken from aes() if they’re provided there. This is a mistake caused by trying to have geom_abline() do too much — it can be both used as an annotation (i.e. with a single slope and intercept) or used to draw multiple lines from data (i.e. with one line for each row).

How do I use the pattern?

This pattern is generally easy to follow: if you don’t use missing() it’s very hard to do this by mistake.

How do I remediate past mistakes?

If an argument is required, remove the default argument. If an argument is optional, either set it to the default value, or if the computation is complicated, set it to NULL and then compute inside the body of the function.