as.POSIXct("2020-01-01 09:00")
#> [1] "2020-01-01 09:00:00 UTC"
Make inputs explicit
What’s the problem?
A function is easier to understand if its output depends only on its inputs (i.e. its arguments). If a function returns different results with the same inputs, then some inputs must be implicit, typically because the function relies on an option or some locale setting. Implicit inputs are not always bad, as some functions like Sys.time()
, read.csv()
, and the random number generators, fundamentally depend on them. But they should be used as sparingly as possible, and never when not related to the core purpose of the function.
Explicit arguments make code easier to understand because you can see what will affect the outputs just by reading the code; you don’t need to run it. Implicit arguments can lead to code that returns different results on different computers, and the differences are usually hard to track down.
What are some examples?
One common source of hidden arguments is the use of global options:
Historically, the worst offender was the
stringsAsFactors
option which changed how a number of functions1 treated character vectors. This option was part of a multi-year procedure to move R away toward character vectors and away from vectors. You can learn more in stringsAsFactors: An unauthorized biography by Roger Peng and stringsAsFactors = <sigh> by Thomas Lumley.lm()
’s handling of missing values depends on the global option ofna.action
. The default isna.omit
which drops the missing values prior to fitting the model (which is inconvenient because then the results ofpredict()
don’t line up with the input data.
Another common source of subtle bugs is relying on the system locale, i.e. the country and language specific settings controlled by your operating system. Relying on the system locale is always done with the best of intentions (you want your code to respect the user’s preferences) but can lead to subtle differences when the same code is run by different people. Here are a few examples:
strptime()
relies on the names of weekdays and months in the current locale. That meansstrptime("1 Jan 2020", "%d %b %Y")
will work on computers with an English locale, and fail elsewhere.-
as.POSIXct()
depends on the current timezone. The following code returns different underlying times when run on different computers: toupper()
andtolower()
depend on the current locale. It is fairly uncommon for this to cause problems because most languages either use their own character set, or use the same rules for capitalisation as English. However, this behaviour did cause a bug in ggplot2 because internally it takesgeom = "identity"
and turns it intoGeomIdentity
to find the object that actually does computation. In Turkish, however, the upper case version of i is İ, andGeomİdentity
does not exist. This meant that for some time ggplot2 did not work on Turkish computers.sort()
andorder()
rely on the lexicographic order (i.e. how different alphabets sort their letters) defined by the current locale.lm()
automatically converts character vectors to factors withfactor()
, which usesorder()
, which means that it’s possible for the coefficients to vary2 if your code is run in a different country!
How can I remediate the problem?
At some level, implicit inputs are easy to avoid when creating new functions: just don’t use the locale or global options! But it’s easy for such problems to creep in indirectly, when you call a function not knowing that it has hidden inputs. The best way to prevent that is to consult the list of common offenders provided above.
Make an option explicit
If you want depend on an option or locale, make sure it’s an explicit argument. Such arguments generally should not affect computation (Chapter 22), just side-effects like printed output or status messages. If they do affect results, follow Chapter 21 to make sure the user knows what’s happening. For example, lets take as.POSIXct()
which basically looks something like this:
as.POSIXct <- function(x, tz = "") {
base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
#> [1] "2020-01-01 09:00:00 UTC"
The tz
argument is present, but it’s not obvious that ""
means the current time zone. Let’s first make that explicit:
as.POSIXct <- function(x, tz = Sys.timezone()) {
base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
#> [1] "2020-01-01 09:00:00 UTC"
Since this is an important default whose value can change, we also print it out if the user hasn’t explicitly set it:
as.POSIXct <- function(x, tz = Sys.timezone()) {
if (missing(tz)) {
message("Using `tz = \"", tz, "\"`")
}
base::as.POSIXct(x, tz = tz)
}
as.POSIXct("2020-01-01 09:00")
#> Using `tz = "UTC"`
#> [1] "2020-01-01 09:00:00 UTC"
Since most people don’t like lots of random output this provides a subtle incentive to supply the timezone:
as.POSIXct("2020-01-01 09:00", tz = "America/Chicago")
#> [1] "2020-01-01 09:00:00 CST"
Temporarily adjust global state
If you’re calling a function with implicit arguments and those implicit arguments are causing problems with your code, you can always work around them by temporarily changing the global state which it uses. The easiest way to do so is to use the withr package, which provides a variety of tools to change temporarily change global state.
See also
- Chapter 22 and Chapter 21: how to make an option as explicit as possible.
- Chapter 34: where a function changes global state in a surprising way.
Such as
data.frame()
,as.data.frame()
, andread.csv()
↩︎Predictions and other diagnostics won’t be affected, but you’re likely to be surprised that your coefficients are different.↩︎