Extract strategies into objects
What’s the problem?
Sometimes different strategies need different arguments. In this case, instead of using an enum, you’ll need to use richer objects capable of storing optional values as well as the strategy name.
This pattern is similar to combining Chapter 11 and Chapter 10 together.
What are some examples?
-
grepl()
has Booleanperl
andfixed
arguments, but you’re not really toggling two independent settings, you’re picking from one of three regular expression engines (the default, the engine used by Perl, and fixed matches). Additionally, theignore.case
argument only applies to two of the strategies.In stringr, however, you use helper functions like
regex()
andfixed()
to wrap around the pattern, and supply optional arguments that only apply to that strategy. ggplot2::geom_histogram()
has three main strategies for defining the bins: you can supply the number ofbins
, the width of each bin (thebinwidth
), or the exactbreaks
. But it’s currently difficult to derive this from the function specification, and there are complex argument dependencies (e.g. you can only supply one ofboundary
andcenter
, and neither applies if you usebreaks
).dplyr::left_join()
uses an advanced form of this pattern where the different strategies for joining two data frames together are expressed in a mini-DSL provided bydplyr::join_by()
.
How do you use the pattern?
In more complicated cases, different strategies will require different arguments, so you’ll need a bit more infrastructure. The basic idea is to build on the options object described in Chapter 11, but instead of providing just one helper function, you’ll provide one function per strategy. This is the way stringr works: you can select a different matching engine by wrapping the pattern
in one of regex()
, boundary()
, coll()
, or fixed()
. We’ll explore how stringr ended up with design and how you can implement something similar yourself by looking at the base regular expression functions.
Selecting a pattern engine
The basic regular expression functions (grep()
, grepl()
, sub()
, gsub()
, regexpr()
, gregexpr()
, regexec()
, and gregexec()
) all fixed
and perl
arguments that allow to select the regular expression engine that’s used:
-
perl = FALSE
,fixed = FALSE
, the default, uses POSIX 1003.2 extended regular expressions. -
perl = TRUE
,fixed = FALSE
uses Perl-style regular expressions. -
perl = FALSE
,fixed = TRUE
uses fixed matching. -
perl = TRUE
,fixed = TRUE
is an error.
You could make this choice more clear by using an enumeration (Chapter 10) maybe something like engine = c("POSIX", "perl", "fixed")
. That might look something like this:
But there’s an additional argument that throws a spanner in the works: ignore.case = TRUE
only works with two of the three engines: POSIX and perl. Additionally, it’s a bit unforunate that the engine
argument, which is likely to come later in the call, affects the pattern
, the first argument. That means you have to read the call until you see the engine
argument before you can understand precisely what the pattern
means.
An alternative approach, as used by stringr, is to provide some helper functions that encode the engine as an attribute of the pattern:
And because these are separate functions, they can take different arguments:
regex <- function(pattern, ignore.case = FALSE) {}
perl <- function(pattern, ignore.case = FALSE) {}
fixed <- function(pattern) {}
This gives a very flexible interface which is particularly nice in stringr because it means there’s an easy way to support boundary matching, which doesn’t even take a pattern:
Implementation
Lets flesh this interface into an implementation. First we flesh out the pattern engine wrappers. These need to return an object that has the name of engine, the pattern, and any other arguments:
regex <- function(pattern, ignore.case = FALSE) {
list(pattern = pattern, engine = "regex", ignore.case = ignore.case)
}
perl <- function(pattern, ignore.case = FALSE) {
list(pattern = pattern, engine = "perl", ignore.case = ignore.case)
}
fixed <- function(pattern) {
list(pattern = pattern, engine = "fixed")
}
Then you could create a new grepl()
variant that might look something like this:
my_grepl <- function(pattern, x, useBytes = FALSE) {
switch(pattern$engine,
regex = grepl(pattern$pattern, x, ignore.case = pattern$ignore.case, useBytes = useBytes),
perl = grepl(pattern$pattern, x, perl = TRUE, ignore.case = pattern$ignore.case, useBytes = useBytes),
fixed = grepl(pattern$pattern, x, fixed = TRUE, useBytes = useBytes)
)
}
Or if you wanted to make it more clear how the engines differ, you could pull out a helper function that pulls out the repeated code:
my_grepl <- function(pattern, x, useBytes = FALSE) {
grepl_wrapper <- function(...) {
grepl(pattern$pattern, x, ..., useBytes = useBytes)
}
switch(pattern$engine,
regex = grepl_wrapper(ignore.case = pattern$ignore.case),
perl = grepl_wrapper(perl = TRUE, ignore.case = pattern$ignore.case),
fixed = grepl_wrapper(fixed = TRUE)
)
}
Here I’m just wrapping around the existing grepl()
implementation because I don’t want to go into the details of its implementation; for your own code you’d probably inline the implementation.
I particularly like the switch
pattern here and in stringr because it keeps the function calls close together, which makes it easier to keep them in sync. You could also implement the same strategy using if
or S7 generic functions, depending on your needs.
This is implementation a sketch that gives you the basic ideas. For a real implementation you’d also need to consider:
- Are
fixed()
,perl()
, andregex()
the right names? Would it be useful to give them a common prefix? - It would be better for the engines to return an S7 object instead of a list, so we could provide a print method to make them display more nicely.
-
grepl()
needs some error checking to ensure thatpattern
is generated by one of the engines, and probably should have a default path to handle bare character vectors as regular expressions (the current default).
You can see these detailed worked out in the stringr package if you look at the source code, particularly that of fixed()
, type()
, opts()
, then str_detect()
.
How do I remediate past problems?
Changing from a complex dependency of individual arguments to a stra