Put the most important arguments first

What’s the pattern?

In a function call, the most important arguments should come first. As a general rule, the most important arguments will be the ones that are used most often, but that’s often hard to tell until your function has existed in the wild for a while. Fortunately, there are a few rules of thumb that can help:

  • If the output is a transformation of an input (e.g. log(), stringr::str_replace(), dplyr::left_join()) then that argument the most important.
  • Other arguments that determine the type or shape of the output are typically very important.
  • Optional arguments (i.e. arguments with a default) are the least important, and should come last.

This convention makes it easy to understand the structure of a function at a glance: the more important an argument is, the earlier you’ll see it. When the output is very strongly tied to an input, putting that argument first also ensures that your function works well with the pipe, leading to code that focuses on the transformations rather than the object being transformed.

What are some examples?

The vast majority of functions get this right, so we’ll pick on a few examples which I think get it wrong:

  • I think the arguments to base R string functions (grepl(), gsub(), etc) are in the wrong order because they consistently make the regular expression (pattern) the first argument, rather than the character vector being manipulated (x).

  • The first two arguments to lm() are formula and data. I’d argue that data should be the first argument; while it doesn’t affect the shape of the output which is always an lm S3 object, it does affect the shape of many important functions like predict(). However, the designers of lm() wanted data to be optional, so you could still fit models even if you hadn’t collected the individual variables into a data frame. Because formula is required and data is not, this means that formula had to come first.

  • The first two arguments to ggplot() are data and mapping. Both data and mapping are required for every plot, so why make data first? I picked this ordering because in most plots there’s one dataset shared across all layers and only the mapping changes.

    On the other hand, the layer functions, like geom_point(), flip the order of these arguments because in an individual layer you’re more likely to specify mapping than data, and in many cases if you do specify data you’ll want mapping as well. This makes these the argument order inconsistent with ggplot(), but overall supports the most common use cases.

  • ggplot2 functions work by creating an object that is then added on to a plot, so the plot, which is really the most important argument, is not obvious at all. ggplot2 works this way in part because it was written before the pipe was discovered, and the best way I came up to define plots from left to right was to rely on + (so-called operator overloading). As an interesting historical fact, ggplot (the precursor to ggplot2) actually works great with the pipe, and a couple of years ago I bought it back to life as ggplot1.

How do I remediate past mistakes?

Generally, it is not possible to change the order of the first few arguments because it will break existing code (since these are the arguments that are mostly likely to be used unnamed). This means that the only real solution is to dperecate the entire function and replace it with a new one. Because this is invasive to the user, it’s best to do sparingly: if the mistake is minor, you’re better off waiting until you’ve collected other problems before fixing it. For example, take tidyr::gather(). It has a number of problems with its design, including the argument order, that makes it harder to use. Because it wasn’t possible to easily fix this mistake, we accumulated other gather() problems for several years before fixing them all at once in pivot_longer().

See also

  • Chapter 8: If the function uses , it should come in between the required and optional arguments.