Spooky action

What’s the problem?

There are no limits to what an function or script can do. After you call draw_plot() or source("analyse-data.R"), you could discover that all the variables in your global environment have been deleted, or that 1000 new files have been created on your desktop. But these actions would be surprising, because generally you expect the impact of a function (or script) to be as limited as possible. Collectively, we call such side-effects “spooky actions” because the connection between action (calling a function or sourcing a script) and result (deleting objects or upgrading packages) is surprising. It’s like flipping a light-switch and discovering that the shower starts running, or having a poltergeist that rearranges the contents of your kitchen cupboards when you’re not looking.

Deleting variables and creating files on your desktop are obviously surprising even if you’ve only just started using R. But there are other actions that are less obviously destructive, and only start to become surprising as your mental model of R matures. These include actions like:

  • Attaching packages with library(). For example, ggplot2::geom_map() used to call library(maps) in order to make map data available to the function. This seems harmless, but if you were using purrr, it would break map() map() would now refer to maps::map() rather than purrr::map(). Because of functions in different packages can have the same name, attaching a package can change the behaviour of existing code.

  • Installing packages with install.packages(). If a script needs dplyr to work, and it’s not installed, it seems polite to install it on behalf of the user. But installing a new package can upgrade existing packages, which might break code in other projects. Install a package is a potentially destructive operation which should be done with care.

  • Deleting objects in the global environment with rm(list = ls()). This might seem like a good way to reset the environment so that your script can run cleanly. But if someone else source()s your script, it will delete objects that might be important to them. (Of course, you’d hope that all of those objects could easily be recreated from another script, but that is not always the case).

Because R doesn’t constrain the potential scope of functions and scripts, you have to. By avoiding these actions, you will create code that is less surprising to other R users. At first, this might seem like tedious busywork. You might find that spooky action is convenient in the moment, and you might convince yourself that it’s necessary or a good idea. But as you share your code with more people and run more code that has been shared with you1, you’ll find spooky action to get more and more surprising and frustrating.

What precisely is a spooky action?

We can make the notion of spooky action precise by thinking about trees. Code should only affect the tree beneath where it lives, so any action that reaches up, or across, the tree is a spooky action.

There are two important types of trees to consider:

  • The tree formed by files and directories. A script should only read from and write to directories beneath the directory where it lives. This explains why you shouldn’t install packages (because the package library usually lives elsewhere), and also explains why you shouldn’t create files on the desktop.

    This rule can be relaxed in two small ways. Firstly, if the script lives in a project, it’s ok to read from and write to anywhere in the project (i.e. a file in R/ can read from data-raw/ and write to data/). Secondly, it’s always ok to write to the session specific temporary directory, tempdir(). This directory is automatically deleted when R closes, so does not have any lasting effects.

  • The tree of environments created by function calls. A function should only create and modify variables in its own environment or environments that it creates (typically by calling other functions). This explains why you shouldn’t attach packages (because that changes the search path), why you shouldn’t delete variables with rm(list = ls()), or assign to variables that you didn’t create with <<-.

How can I remediate spooky actions?

If you have read the above cautions, and still want to proceed, there are three ways you can make the spooky action as safe as possible:

  • Allow the user to control the scope.

  • Make the action less spooky by giving it a name that clearly describes what it will do.

  • Explicitly check with the user before proceeding with the action.

  • Advertise what’s happening, so while the action might still be spooky, at least it isn’t surprising.

Parameterise the action

The first technique is to allow the user to control where the action will occur. For example, instead of save_output_desktop(), you would write save_output(path), and require that the user provide the path.

Ask for confirmation

If you can’t parameterise the operation, and need to perform it from somewhere deep within the cope, make sure to confirm with the user before performing the action. The code below shows how you might do so when installing a package:

install_if_needed <- function(package) {
  if (requireNamespace(package, quietly = TRUE)) {
    return(invisible(TRUE))
  }
  
  if (!interactive()) {
    stop(package, " is not installed", call. = FALSE)
  }
  
  title <- paste0(package, " is not installed. Do you wish to install now?")
  if (menu(c("Yes", "No"), title = title) != 1) {
    stop("Confirmation not received", call. = FALSE)
  }
  
  invisible(TRUE)
}

Note the use of interactive() here: if the user is not in an interactive setting (i.e. the code is being run with Rscript) and we can not get explicit confirmation, we shouldn’t make any changes. Also that all failures are errors: this ensures that the remainder of the function or script does not run if the user doesn’t confirm.

Ideally this function would also clearly describe the consequences of your decision. For example, it would be nice to know if it will download a significant amount of data (since you might want to wait until your at a fast connection if downloading a 1 Gb data package), or if it will upgrade existing packages (since that might break other code).

Writing code that checks with the user requires some care, and it’s easy to get the details wrong. That’s why it’s better to prefer one of the prior techniques.

Case studies

save() and load()

load() has a spooky action because it modifies variables in the current environment:

x <- 1
load("spooky-action.rds")
x
#> [1] 10

You can make it less spooky by supplying verbose = TRUE. Here we learn that it also loaded a y object:

load("spooky-action.rds", verbose = TRUE)
#> Loading objects:
#>   x
#>   y
y
#> [1] 100

(In an ideal world verbose = TRUE would be default)

But generally, I’d avoid save() and load() altogether, and instead use saveRDS() and readRDS(), which read and write individual R objects to individual R files and work with <-. This eliminates all spooky action:

saveRDS(x, "x.rds")
x <- readRDS("x.rds")
unlink("x.rds")

(readr provides readr::read_rds() and readr::write_rds() if the inconsistent naming conventions bother you like they bother me.)

usethis

The usethis package is designed to support the process of developing a package of R code. It automates many of tedious setup steps by providing function like use_r() or use_test(). Many usethis functions modify the DESCRIPTION and create other files. usethis makes these actions as pedestrian as possible by:

  • Making it clear that the entire package is designed for the purpose of creating and modifying files, and the purpose of each function is clearly encoded in its named.

  • For any potential risky operation, e.g. overwriting an existing file, usethis explicitly asks for confirmation from the user. To make it harder to “click” through prompts without reading them, usethis uses random prompts in a random ordering.

    usethis::ui_yeah("Do you want to proceed?")
    #> Do you want to proceed?
    #> 
    #> 1: Absolutely not
    #> 2: Not now
    #> 3: I agree

    usethis also works in concert with git to make sure that change are captured in a way that can easily be undone.

  • When you call it, every usethis function describes what it is doing as it it doing it:

    usethis::create_package("mypackage", open = FALSE)
    #> ✔ Creating 'mypackage'
    #> ✔ Setting active project to 'mypackage'
    #> ✔ Creating 'R/'
    #> ✔ Writing 'DESCRIPTION'
    #> ✔ Writing 'NAMESPACE'
    #> ✔ Writing 'mypackage.Rproj'
    #> ✔ Adding '.Rproj.user' to '.gitignore'
    #> ✔ Adding '^mypackage\\.Rproj$', '^\\.Rproj\\.user$' to '.Rbuildignore'

    This is important, but it’s not clear how impactful it is because many functions produce enough output that reading through it all seems onerous and so generally most people don’t read it.

<<-

If you haven’t heard of <<-, the super-assignment operator, before, feel free to skip this section as it’s an advanced technique that has relatively limited applications. They’re most important in the context of functionals, which you can read more about in Advanced R.

<<- is safe if you use it to modify a variable in an environment that you control. For example, the following code creates a function that counts the number of times it is called. The use of <<- is safe because it only affects the environment created by make_counter(), not an external environment.

make_counter <- function() {
  i <- 0
  function() {
    i <<- i + 1
    i
  }
}
c1 <- make_counter()
c2 <- make_counter()
c1()
#> [1] 1
c1()
#> [1] 2
c2()
#> [1] 1

A more common use of <<- is to break one of the limitations of map()2 and use it like a for loop to iteratively modify input. For example, imagine you want to compute a cumulative sum. That’s straightforward to write with a for loop:

x <- rpois(10, 10)

out <- numeric(length(x))
for (i in seq_along(x)) {
  if (i == 1) {
    out[[i]] <- x[[i]]
  } else {
    out[[i]] <- x[[i]] + out[[i - 1]]
  }
}

rbind(x, out)
#>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> x      8    5   12   10   11    4   15   11   10     8
#> out    8   13   25   35   46   50   65   76   86    94

A simple transformation from to use map() doesn’t work:

library(purrr)
out <- numeric(length(x))
map_dbl(seq_along(x), function(i) {
  if (i == 1) {
    out[[i]] <- x[[i]]
  } else {
    out[[i]] <- x[[i]] + out[[i - 1]]
  }
})
#>  [1]  8  5 12 10 11  4 15 11 10  8

Because the modification of out is happening inside of a function, R creates a copy of out (this is called copy-on-modify principle). Instead we need to use <<- to reach outside of the function to modify the outer out:

map_dbl(seq_along(x), function(i) {
  if (i == 1) {
    out[[i]] <<- x[[i]]
  } else {
    out[[i]] <<- x[[i]] + out[[i - 1]]
  }
})
#>  [1]  8 13 25 35 46 50 65 76 86 94

This use of <<- is a spooky action because we’re reaching up the tree of environments to modify an object created outside of the function. In this case, however, there’s no point in using map(): the point of those functions is to restrict what you can do compared to a for loop so that your code is easier to understand. R for data science has other examples of for loops that could be rewritten with map(), but shouldn’t be.

Note that we could still wrap this code into a function to eliminate the spooky action:

cumsum2 <- function(x) {
  out <- numeric(length(x))
  map_dbl(seq_along(x), function(i) {
    if (i == 1) {
      out[[i]] <<- x[[i]]
    } else {
      out[[i]] <<- x[[i]] + out[[i - 1]]
    }
  })
  out
}

This eliminates the spooky action because it’s now modifying an object that the function “owns”, but I still wouldn’t recommend it, as the use of map() and <<- only increases complexity for no gain compared to the use of a for loop.

assign() in a for loop

It’s not uncommon for people to ask how to create multiple objects from for a loop. For example, maybe you have a vector of file names, and want to read each file into an individual object. With some effort you typically discover assign():

paths <- c("a.csv", "b.csv", "c.csv")
names(paths) <- c("a", "b", "c")

for (i in seq_along(paths)) {
  assign(names(paths)[[i]], read.csv(paths[[i]]))
}

The main problem with this approach is that it does not facilitate composition. For example, imagine that you now wanted to figure out how many rows are in each of the data frames. Now you need to learn how to loop over a series of objects, where the name of the object is stored in a character vector. With some work, you might discover get() and write:

lengths <- numeric(length(names))
for (i in seq_along(paths)) {
  lengths[[i]] <- nrow(get(names(paths)[[i]]))
}

This approach is not necessarily bad in and of itself3, but it tends to lead you down a high-friction path. Instead, if you learn a little about lists and functional programming techniques (e.g. with purrr), you’ll be able to write code like this:

library(purrr)

files <- map(paths, read.csv)
lengths <- map_int(files, nrow)

This obviously requires that you learn some new tools - but learning about map() and map_int() will pay off in many more situations than learning about assign() and get(). And because you can reuse map() and friends in many places, you’ll find that they get easier and easier to use over time.

It would certainly be possible to build tools in purrr to avoid having to learn about assign() and get() and to provide a polished interface for working with character vectors containing object names. But such functions would need to reach up the tree of environments, so would violate the “spooky action” principle, and thus I believe are best avoided.


  1. Spooky actions tend to be particularly frustrating to those who teach R, because they have to run scripts from many students, and those scripts can end up doing wacky things to their computers.↩︎

  2. purrr::map() is basically interchangeable with base::lapply() so if you’re more familiar with lapply(), you can mentally substitute it for map() in all the code here.↩︎

  3. However, if you take this approach in multiple places in your code, you’ll need to make sure that you don’t use the same name in multiple loops because assign() will silently overwrite an existing variable. This might not happen commonly but because it’s silent, it will create a bug that is hard to detect.↩︎