Make inputs explicit

What’s the problem?

A function is easier to understand if its output depends only on its inputs (i.e. its arguments). If a function returns different results with the same inputs, then some inputs must be implicit, typically because the function relies on an option or some locale setting. Implicit inputs are not always bad, as some functions like Sys.time(), read.csv(), and the random number generators, fundamentally depend on them. But they should be used as sparingly as possible, and never when not related to the core purpose of the function.

Explicit arguments make code easier to understand because you can see what will affect the outputs just by reading the code; you don’t need to run it. Implicit arguments can lead to code that returns different results on different computers, and the differences are usually hard to track down.

What are some examples?

One common source of hidden arguments is the use of global options:

  • Historically, the worst offender was the stringsAsFactors option which changed how a number of functions1 treated character vectors. This option was part of a multi-year procedure to move R away toward character vectors and away from vectors. You can learn more in stringsAsFactors: An unauthorized biography by Roger Peng and stringsAsFactors = <sigh> by Thomas Lumley.

  • lm()’s handling of missing values depends on the global option of na.action. The default is na.omit which drops the missing values prior to fitting the model (which is inconvenient because then the results of predict() don’t line up with the input data.

Another common source of subtle bugs is relying on the system locale, i.e. the country and language specific settings controlled by your operating system. Relying on the system locale is always done with the best of intentions (you want your code to respect the user’s preferences) but can lead to subtle differences when the same code is run by different people. Here are a few examples:

  • strptime() relies on the names of weekdays and months in the current locale. That means strptime("1 Jan 2020", "%d %b %Y") will work on computers with an English locale, and fail elsewhere.

  • as.POSIXct() depends on the current timezone. The following code returns different underlying times when run on different computers:

    as.POSIXct("2020-01-01 09:00")
    #> [1] "2020-01-01 09:00:00 UTC"
  • toupper() and tolower() depend on the current locale. It is fairly uncommon for this to cause problems because most languages either use their own character set, or use the same rules for capitalisation as English. However, this behaviour did cause a bug in ggplot2 because internally it takes geom = "identity" and turns it into GeomIdentity to find the object that actually does computation. In Turkish, however, the upper case version of i is İ, and Geomİdentity does not exist. This meant that for some time ggplot2 did not work on Turkish computers.

  • sort() and order() rely on the lexicographic order (i.e. how different alphabets sort their letters) defined by the current locale. lm() automatically converts character vectors to factors with factor(), which uses order(), which means that it’s possible for the coefficients to vary2 if your code is run in a different country!

How can I remediate the problem?

At some level, implicit inputs are easy to avoid when creating new functions: just don’t use the locale or global options! But it’s easy for such problems to creep in indirectly, when you call a function not knowing that it has hidden inputs. The best way to prevent that is to consult the list of common offenders provided above.

Make an option explicit

If you want depend on an option or locale, make sure it’s an explicit argument. Such arguments generally should not affect computation (Chapter 22), just side-effects like printed output or status messages. If they do affect results, follow Chapter 21 to make sure the user knows what’s happening. For example, lets take as.POSIXct() which basically looks something like this:

as.POSIXct <- function(x, tz = "") {
  base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
#> [1] "2020-01-01 09:00:00 UTC"

The tz argument is present, but it’s not obvious that "" means the current time zone. Let’s first make that explicit:

as.POSIXct <- function(x, tz = Sys.timezone()) {
  base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
#> [1] "2020-01-01 09:00:00 UTC"

Since this is an important default whose value can change, we also print it out if the user hasn’t explicitly set it:

as.POSIXct <- function(x, tz = Sys.timezone()) {
  if (missing(tz)) {
    message("Using `tz = \"", tz, "\"`")
  }
  base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
#> Using `tz = "UTC"`
#> [1] "2020-01-01 09:00:00 UTC"

Since most people don’t like lots of random output this provides a subtle incentive to supply the timezone:

as.POSIXct("2020-01-01 09:00", tz = "America/Chicago")
#> [1] "2020-01-01 09:00:00 CST"

Temporarily adjust global state

If you’re calling a function with implicit arguments and those implicit arguments are causing problems with your code, you can always work around them by temporarily changing the global state which it uses. The easiest way to do so is to use the withr package, which provides a variety of tools to change temporarily change global state.

See also


  1. Such as data.frame(), as.data.frame(), and read.csv()↩︎

  2. Predictions and other diagnostics won’t be affected, but you’re likely to be surprised that your coefficients are different.↩︎