Extract strategies into objects

What’s the problem?

Sometimes different strategies need different arguments. In this case, instead of using an enum, you’ll need to use richer objects capable of storing optional values as well as the strategy name.

This pattern is similar to combining Chapter 11 and Chapter 10 together.

What are some examples?

  • grepl() has Boolean perl and fixed arguments, but you’re not really toggling two independent settings, you’re picking from one of three regular expression engines (the default, the engine used by Perl, and fixed matches). Additionally, the ignore.case argument only applies to two of the strategies.

    In stringr, however, you use helper functions like regex() and fixed() to wrap around the pattern, and supply optional arguments that only apply to that strategy.

  • ggplot2::geom_histogram() has three main strategies for defining the bins: you can supply the number of bins, the width of each bin (the binwidth), or the exact breaks. But it’s currently difficult to derive this from the function specification, and there are complex argument dependencies (e.g. you can only supply one of boundary and center, and neither applies if you use breaks).

  • dplyr::left_join() uses an advanced form of this pattern where the different strategies for joining two data frames together are expressed in a mini-DSL provided by dplyr::join_by().

How do you use the pattern?

In more complicated cases, different strategies will require different arguments, so you’ll need a bit more infrastructure. The basic idea is to build on the options object described in Chapter 11, but instead of providing just one helper function, you’ll provide one function per strategy. This is the way stringr works: you can select a different matching engine by wrapping the pattern in one of regex(), boundary(), coll(), or fixed(). We’ll explore how stringr ended up with design and how you can implement something similar yourself by looking at the base regular expression functions.

Selecting a pattern engine

The basic regular expression functions (grep(), grepl(), sub(), gsub(), regexpr(), gregexpr(), regexec(), and gregexec()) all fixed and perl arguments that allow to select the regular expression engine that’s used:

  • perl = FALSE, fixed = FALSE, the default, uses POSIX 1003.2 extended regular expressions.
  • perl = TRUE, fixed = FALSE uses Perl-style regular expressions.
  • perl = FALSE, fixed = TRUE uses fixed matching.
  • perl = TRUE, fixed = TRUE is an error.

You could make this choice more clear by using an enumeration (Chapter 10) maybe something like engine = c("POSIX", "perl", "fixed"). That might look something like this:

grepl(pattern, string, engine = "regex")
grepl(pattern, string, engine = "fixed")
grepl(pattern, string, engine = "perl")

But there’s an additional argument that throws a spanner in the works: ignore.case = TRUE only works with two of the three engines: POSIX and perl. Additionally, it’s a bit unforunate that the engine argument, which is likely to come later in the call, affects the pattern, the first argument. That means you have to read the call until you see the engine argument before you can understand precisely what the pattern means.

An alternative approach, as used by stringr, is to provide some helper functions that encode the engine as an attribute of the pattern:

grepl(pattern, regex(string))
grepl(pattern, fixed(string))
grepl(pattern, perl(string))

And because these are separate functions, they can take different arguments:

regex <- function(pattern, ignore.case = FALSE) {}
perl <- function(pattern, ignore.case = FALSE) {}
fixed <- function(pattern) {}

This gives a very flexible interface which is particularly nice in stringr because it means there’s an easy way to support boundary matching, which doesn’t even take a pattern:

library(stringr)
str_view("This is a sentence.", boundary("word"))
#> [1] │ <This> <is> <a> <sentence>.
str_view("This is a sentence.", boundary("sentence"))
#> [1] │ <This is a sentence.>

Implementation

Lets flesh this interface into an implementation. First we flesh out the pattern engine wrappers. These need to return an object that has the name of engine, the pattern, and any other arguments:

regex <- function(pattern, ignore.case = FALSE) {
  list(pattern = pattern, engine = "regex", ignore.case = ignore.case)
}
perl <- function(pattern, ignore.case = FALSE) {
  list(pattern = pattern, engine = "perl", ignore.case = ignore.case)
}
fixed <- function(pattern) {
  list(pattern = pattern, engine = "fixed")
}

Then you could create a new grepl() variant that might look something like this:

my_grepl <- function(pattern, x, useBytes = FALSE) {
  switch(pattern$engine, 
    regex = grepl(pattern$pattern, x, ignore.case = pattern$ignore.case, useBytes = useBytes),
    perl = grepl(pattern$pattern, x, perl = TRUE, ignore.case = pattern$ignore.case, useBytes = useBytes),
    fixed = grepl(pattern$pattern, x, fixed = TRUE, useBytes = useBytes)
  )
}

Or if you wanted to make it more clear how the engines differ, you could pull out a helper function that pulls out the repeated code:

my_grepl <- function(pattern, x, useBytes = FALSE) {
  grepl_wrapper <- function(...) {
    grepl(pattern$pattern, x, ..., useBytes = useBytes)
  }
  
  switch(pattern$engine, 
    regex = grepl_wrapper(ignore.case = pattern$ignore.case),
    perl = grepl_wrapper(perl = TRUE, ignore.case = pattern$ignore.case),
    fixed = grepl_wrapper(fixed = TRUE)
  )
}

Here I’m just wrapping around the existing grepl() implementation because I don’t want to go into the details of its implementation; for your own code you’d probably inline the implementation.

I particularly like the switch pattern here and in stringr because it keeps the function calls close together, which makes it easier to keep them in sync. You could also implement the same strategy using if or S7 generic functions, depending on your needs.

This is implementation a sketch that gives you the basic ideas. For a real implementation you’d also need to consider:

  • Are fixed(), perl(), and regex() the right names? Would it be useful to give them a common prefix?
  • It would be better for the engines to return an S7 object instead of a list, so we could provide a print method to make them display more nicely.
  • grepl() needs some error checking to ensure that pattern is generated by one of the engines, and probably should have a default path to handle bare character vectors as regular expressions (the current default).

You can see these detailed worked out in the stringr package if you look at the source code, particularly that of fixed(), type(), opts(), then str_detect().

How do I remediate past problems?

Changing from a complex dependency of individual arguments to a stra