17  Case study: stringr

This chapter explores some of the considerations designing the stringr package.

17.1 Function names

When the base regular expression functions were written, most R users were familiar with the command line and tools like grepl. This made naming R’s string manipulation functions after these tools seem natural. When I started work on stringr, the majority of R users were not familiar with linux or the command line, so it made more sense to start afresh.

I think there were successes and failures here. On the whole, I think str_replace_all(), str_locate(), and str_detect() are easier to remember than gsub(), regexpr(), and grepl(). However, it’s harder to remember what makes str_subset() and str_which() different. If I was to do stringr again, I would make more of an effort to distinguish between functions that operated on individual matches and individual strings as str_locate() and str_which() seem like their names should be more closely related as str_locate() returns the location of a match within each string in the vector, and str_subset() returns the matching locations within a vector.

17.2 Argument order and names

Base R string functions mostly have pattern as the first argument, with the chief exception being strsplit(). stringr functions always have string as the first argument.

I regret using string; I now think x would be a more appropriate name.

17.3 Selecting a pattern engine

grepl(), has three arguments that take either FALSE or TRUE: ignore.case, perl, fixed, which might suggest that there are 2 ^ 3 = 8 possible options. But fixed = TRUE overrides perl = TRUE, and ignore.case = TRUE only works if fixed = FALSE so there are only 5 valid combinations.

x <- grepl("a", letters, fixed = TRUE, ignore.case = TRUE)
#> Warning in grepl("a", letters, fixed = TRUE, ignore.case = TRUE): argument
#> 'ignore.case = TRUE' will be ignored
x <- grepl("a", letters, fixed = TRUE, perl = TRUE)
#> Warning in grepl("a", letters, fixed = TRUE, perl = TRUE): argument 'perl =
#> TRUE' will be ignored

It’s easier to understand fixed and perl once you realise their combination is used to pick from one of three engines for matching text:

  • The default is POSIX 1003.2 extended regular expressions.
  • perl = TRUE uses Perl-style regular expressions.
  • fixed = TRUE uses fixed matching.

This makes it clear why perl = TRUE and fixed = TRUE isn’t permitted: you’re trying to pick two conflicting engines.

An alternative interface that makes this choice more clear would be to use Chapter 10 and create a new argument called something like engine = c("POSIX", "perl", "fixed"). This also has the nice feature of making it easier to extend in the future. That might look something like this:

grepl(pattern, string, engine = "regex")
grepl(pattern, string, engine = "fixed")
grepl(pattern, string, engine = "perl")

But stringr takes a different approach, because of a problem hinted at in grepl() and friends: ignore.case only works with two of the three engines: POSIX and perl. Additionally, having an engine argument that affects the meaning of the pattern argument is a little unfortunate — that means you have to read the call until you see the engine argument before you can understand precisely what the pattern means.

stringr takes a different approach, encoding the engine as an attribute of the pattern:

x <- str_detect(letters, "a")
# short for:
x <- str_detect(letters, regex("a"))

# Which is where you supply additional arguments
x <- str_detect(letters, regex("a", ignore_case = TRUE))

This has the advantage that each engine can take different arguments. In base R, the only argument of this nature of ignore.case, but stringr’s regex() has arguments like multiline, comments, and dotall which change how some components of the pattern are matched.

Using an engine argument also wouldn’t work in stringr because of the boundary() engine which rather than matching specific patterns uses matches based on boundaries between things like letters or words or sentences.

str_view("This is a sentence.", boundary("word"))
str_view("This is a sentence.", boundary("sentence"))

This is more appealing than creating a separate function for each engine because there are many other functions in the same family as grepl(). If we created grepl_fixed(), we’d also need gsub_fixed(), regexp_fixed() etc.

17.4 str_flatten()

str_flatten() was a relatively recent addition to stringr. It took me a long time to realise that one of the challenges of understanding paste() was that depending on the presence or absence of the collapse argument it could either transform a string (i.e. return something the same length) or summarise a string (i.e. always return a single string).

Once str_flatten() existed it become more clear that it would be useful to have str_flatten_comma() which made it easier to use the Oxford comma (which seems to be something that’s only needed for English, and ironically the Oxford comma is more common in US English than UK English).

17.5 Recycling rules

stringr implements recycling rules so that you can either supply a vector of strings or a vector of patterns:

alphabet <- str_flatten(letters, collapse = "")
vowels <- c("a", "e", "i", "o", "u")
grepl(vowels, alphabet)
#> Warning in grepl(vowels, alphabet): argument 'pattern' has length > 1 and only
#> the first element will be used
#> [1] TRUE
str_detect(alphabet, vowels)
#> [1] TRUE TRUE TRUE TRUE TRUE

On the whole I regret this. It’s generally not that useful (since you typically have more than one string, not more than one pattern), most people don’t use it, and now it feels overly clever.

17.6 Redundant functions

There are a couple of stringr functions that were very useful at the time, but are now less important.

  • nchar(NA) used to return 2, and nchar(factor("abc")) used to return 1. str_length() fixed both of these problems, but those fixes also migrated to base R, leaving str_length() as less useful.
  • paste0() did not exist so str_c() was very useful. But now str_c() primarily only useful for its recycling logic.