Names attribute
Coverage in tidyverse style guide
Existing name-related topics in http://style.tidyverse.org
The names
attribute of an object
Here we address how to manage the names
attribute of an object. Our initial thinking was motivated by how to handle the column or variable names of a tibble, but is evolving into a name-handling strategy for vectors, in general.
The name repair described below is exposed to users via the .name_repair
argument of tibble::tibble(), tibble::as_tibble(), readxl::read_excel(), and, eventually other packages in the tidyverse. This work was initiated in the tibble package, but is migrating to the vctrs package. Name repair was first introduced in tibble v2.0.0 and this write-up is being rendered with tibble v3.2.1 and vctrs v0.6.4.
These are the kind of names we’re talking about:
Minimal, unique, universal
We identify three nested levels of naminess that are practically useful:
- Minimal: The
names
attribute is notNULL
. The name of an unnamed element is""
(the empty string) and neverNA
. - Unique: No element of
names
appears more than once. A couple specific names are also forbidden in unique names, such as""
(the empty string).- All columns can be accessed by name via
df[["name"]]
and, more generally, by quoting with backticks:df$`name`
,subset(df, select = `name`)
, anddplyr::select(df, `name`)
.
- All columns can be accessed by name via
- Universal: The
names
are unique and syntactic.- Names work everywhere, without quoting:
df$name
andlm(name1 ~ name2, data = df)
anddplyr::select(df, name)
all work.
- Names work everywhere, without quoting:
Below we give more details and describe implementation.
Minimal names
Minimal names exist. The names
attribute is not NULL
. The name of an unnamed element is ""
(the empty string) and never NA
.
Consider an unnamed vector, i.e. it has names attribute of NULL
.
x <- letters[1:3]
names(x)
#> NULL
This means that the names of x
are sometimes a character vector the same length of x
, and sometimes NULL
. rlang papers of this problem by providing names2()
which always returns a character vector:
rlang::names2(x)
#> [1] "" "" ""
And you can also use this to ensure a vector has minimal names:
Minimal names appear to be a useful baseline requirement, if the names
attribute of an object is going to be actively managed. Why? General name handling and repair can be implemented more simply if the baseline strategy guarantees that names(x)
returns a character vector of the correct length with no NA
s.
This is also a reasonable interpretation of base R’s intent for named vectors, based on the docs for names()
, although base R’s implementation/enforcement of this is uneven. From ?names
:
The name
""
is special: it is used to indicate that there is no name associated with an element of a (atomic or generic) vector. Subscripting by""
will match nothing (not even elements which have no name).A name can be character
NA
, but such a name will never be matched and is likely to lead to confusion.
tbl_df
objects created by tibble::tibble() and tibble::as_tibble() have variable names that are minimal, at the very least.
Unique names
Unique names meet the requirements for minimal and have no duplicates. In the tidyverse, we go further and repair a few specific names: ""
(the empty string), ...
(R’s ellipsis or “dots” construct), and ..j
where j
is a number. They are basically all treated like ""
, which is always repaired.
Example of unique-ified names:
## original unique-ified
## "" ...1
## x x...2
## "" ...3
## ... ...4
## y y
## x x...6
This augmented definition of unique has a specific motivation: it ensures that each element can be identified by name, at least when protected by backtick quotes. Literally, all of these work:
This has practical significance for variable names inside a data frame, because so many workflows rely on indexing by name. Note that uniqueness refers implicitly to a vector of names.
Let’s explore a few edge cases: A single dot followed by a number, .j
, does not need repair.
Two dots followed by a number, ..j
, does need repair. The same goes for three dots, ...
, the ellipsis or “dots” construct. These can’t function as names, even if quoted with backticks, so they have to be repaired.
df <- tibble(`..1` = "not ok")
#> Error in `tibble()`:
#> ! Column 1 must not have names of the form ... or ..j.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names can't be of the form `...` or `..j`.
#> ✖ These names are invalid:
#> * "..1" at location 1.
with(df, `..1`)
#> Error in eval(substitute(expr), data, enclos = parent.frame()): ..1 used in an incorrect context, no ... to look in
dplyr::select(df, `..1`)
#> Error in dot_call(capture_dots, frame_env = frame_env, named = named, : '...' used in an incorrect context
df <- tibble(`...` = "not ok")
#> Error in `tibble()`:
#> ! Column 1 must not have names of the form ... or ..j.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names can't be of the form `...` or `..j`.
#> ✖ These names are invalid:
#> * "..." at location 1.
subset(df, select = `...`)
#> Error in eval(expr, envir, enclos): '...' used in an incorrect context
dplyr::select(df, `...`)
#> Error in eval(expr, envir, enclos): '...' used in an incorrect context
Both are repaired as if they were ""
.
Making names unique
There are many ways to make names unique. We append a suffix of the form ...j
to any name that is a duplicate or ""
or ...
, where j
is the position. Why?
- An absolute position
j
is more helpful than numbering within the elements that share a name. Context: troubleshooting data import with lots of columns and dysfunctional names. - We hypothesize that it’s better have a “level playing field” when repairing names, i.e. if
foo
appears twice, both instances get repaired, not just the second occurrence.
The unique level of naminess is regarded as normative for a tibble and a user must expressly request a tibble with names that violate this (but that is possible).
Base R’s function for this is make.unique(). We revisit the example above, comparing the tidyverse strategy for making names unique vs. what make.unique()
does.
## Original Unique names Result of
## names (tidyverse) make.unique()
## "" ...1 ""
## x x...2 x
## "" ...3 .1
## ... ...4 ...
## y y y
## x x...6 x.1
Roundtrips
When unique-ifying names, we assume that the input names have been repaired by the same strategy, i.e. that we are consuming dogfood. Therefore, pre-existing suffixes of the form ...j
are stripped, prior to (re-)constructing the suffixes. If this interacts poorly with your names, you need to take control of name repair.
Example of re-unique-ified names:
## original unique-ified
## ...5 ...1
## x x...2
## x...3 x...3
## "" ...4
## x...1...5 x...5
JB: it is conceivable that this should be under the control of an argument, e.g. dogfood = TRUE
, in the (currently unexported) function that does this
When is minimal better than unique?
Why would you ever want to import a tibble and enforce only minimal names, instead of unique? Sometimes the first row of a data source – allegedly variable names – actually contains data and the resulting tibble will be reshaped with, e.g., tidyr::gather()
. In this case, it is better to not munge the names at import. This is a common special case of the “data stored in names” phenomenon.
In general, you may want to tolerate minimal names when the dysfunctional names are just an awkward phase that an object is passing through and a more definitive solution is applied downstream.
Ugly, with a purpose
You might say that names like x...5
are ugly and you would be right. We’re calling this a feature, not a bug! Names that have been automatically unique-ified by the tidyverse should catch the eye and give the user strong encouragement to take charge of the situation.
Why so many dots?
The suffix of ...j
, with 3 leading dots, is the result of jointly satisfying multiple requirements. It is important to anticipate a missing name, where the suffix becomes the entire name. We have elected to make the suffix a syntactic name (more below), because non-syntactic names are a frequent cause of unexpected friction for users. This means the suffix can’t be j
, .j
, or ..j
, because all are non-syntactic. It must be ...j
.
Why dot(s) in the first place?
The underscore _
was also considered when choosing the suffix strategy, but was rejected. Why? Because syntactic names can’t start with an underscore and we want the suffix itself to be syntactic. Also, the dot .
is already used by base R’s make.names()
to replace invalid characters. It seems simpler and, therefore, better to use the same character, in the same way, as much as possible in name repair. We use the dot .
, we put it at the front, as many times as necessary.
Universal names
Universal names are unique, in the sense described above, and syntactic, in the normal R sense. Universal names are appealing because they play nicely with base R and tidyverse functions that accept unquoted variable names.
Syntactic names
A syntactic name in R:
- Consists of letters, numbers, and the dot
.
or underscore_
characters. - Starts with a letter or starts with a dot
.
followed by anything but a number. - Is not a reserved word, such as
if
orfunction
orTRUE
. - Is not
...
, R’s special ellipsis or “dots” construct. - Is not of the form
..j
, wherej
is a number.
See R’s documentation for Reserved words and Quotes, specifically the section on names and identifiers.
A syntactic name can be used “as is” in code. For example, it does not require quoting in order to work with non-standard evaluation, such as list indexing via $
, in a formula, or in packages like dplyr and ggplot2.
Note that being syntactic is a property of an individual name.
Making an individual name syntactic
There are many ways to fix a non-syntactic name. Here’s how our logic compares to base::make.names() for a single name:
- Same: Definition of what is syntactically valid.
- Claim: If
syn_name
is a name that we have made syntactic, thensyn_name == make.names(syn_name)
. If you find a counterexample, tell us!
- Claim: If
- Same: An invalid character is replaced with a dot
.
. - Different: We always fix a name by prepending a dot
.
. base::make.names() sometimes prefixes withX
and at other times appends a dot.
.- This means we turn
...
into....
and..j
into...j
, wherej
is a number. base::make.names() does not modify...
or..j
, which could be regarded as a bug (?).
- This means we turn
- Different: We treat
NA
and""
the same: both become.
. This is because we first make names minimal. base::make.names() turnsNA
into"NA."
and""
into"X"
.
Examples of the tidyverse approach to making individual names syntactic versus base::make.names()
:
## Original Syntactic name Result of
## name (tidyverse) make.names()
## "" . X
## NA . NA.
## (y) .y. X.y.
## _z ._z X_z
## .2fa ..2fa X.2fa
## FALSE .FALSE FALSE.
## ... .... ...
## ..3 ...3 ..3
Currently implemented in the unexported function tibble:::make_syntactic()
.
Why universal?
Now we can state the motivation for universal names, which have the group-wise property of being unique and the element-wise property of being syntactic.
In practice, if you want syntactic names, you probably also want them to be unique. You need both in order to refer to individual elements easily, without ambiguity and without quoting.
Universal names can be requested in the tidyverse via .name_repair = "universal"
, in functions that expose name repair.
Making names universal
Universal names are implemented as a variation on unique names. Basically, suffixes are stripped and ...
is replaced with ""
. These draft names are transformed with tibble:::make_syntactic()
(this step is omitted for unique names). Then ...j
suffixes are appended as necessary.
Note that suffix stripping and the substitution of ""
for ...
happens before the draft names are made syntactic. So, although tibble:::make_syntactic
turns ...
into ....
, universal or unique name repair will turn ...
into something of the form ...j
.
Messaging user about name repair
Name repair should be communicated to the user. Here’s how tibble messages:
x <- tibble::tibble(
x = 1, x = 2, `a1:` = 3, `_x_y}` = 4,
.name_repair = "universal"
)
#> New names:
#> • `x` -> `x...1`
#> • `x` -> `x...2`
#> • `a1:` -> `a1.`
#> • `_x_y}` -> `._x_y.`