[feature request] Allow setnames() to skip column names that don't exist #3030

M-YD · 2018-09-04T10:52:52Z

There have been a number of similar posts to this, but nothing that seems to directly address the issue that I am proposing and which I think would be a very useful update.

The issue I am having is that I have implemented setnames() as part of a large and complicated function (500 lines) that I have built and to be presented with an error after having spent quite some time waiting for it to execute is unhelpful and also annoying because it means that the time I had spent waiting for it to complete was for nothing.

What would have been really helpful in this scenario is an if/else-type condition when checking old columns and skipping them if they didn't exist.

Something along the lines of:

if(!(old %in% colnames)){ # Checks if old columns names do not exist in existing column names
    next # Skip to the next item in the list if TRUE
}

The reason I would check it with a ! (NOT) is because the very nature of setnames() checks to see if the names do exist; if so, it continues as normal because that is literally what it is designed for. Therefore, checking for existing names within setnames() is futile because that is what setnames() is doing in the first instance.

The trouble with the current setup of setnames() is that it works on the assumption that I know precisely which columns exist at any given point.

I don't.

I have a good idea, but it isn't always correct because there are literally dozens of potential scenarios that could occur based on numerous variations of how the data I am extracting are gathered.

Approximately 90.00%-95.00% of the time, the columns that I am working with are typically the same ones, but there are those odd occasions (the other 5.00%-10.00%) where this is not the case and all of a sudden, something breaks unexpectedly and I am left trying to figure out a new way to hack another version of this function together to handle this single use case just one time, which is neither a good thing to have to do nor good practice.

All of this could be avoided with a simple check inside setnames() to see if the columns exist or not.

Referring to columns by number might work in some cases, but I think it's neither here nor there because different people will need to refer to columns in different ways; names work best in some scenarios and numbers in others.

In my case, names work best because the columns aren't always in the same order and as such using column numbers could break my data frame and it's possible that I wouldn't know until the end.

Also, in my case if the column doesn't exist then it isn't a problem for me and I am happy to proceed onto the next name in the list. I'm sure that some people would like a warning notification if this happens, and I agree that it is a helpful feature to have, which is why I'm not advocating for it to be removed - not at all. Perhaps a more condensed version of the warning can appear on the fly, or even a summary at the end would be good to have.

An if/else-type condition like this would most certainly make things easier for me, at least, and I'm sure several others would agree.

Update:

@HughParsonage suggested an additional argument and this is precisely what I was thinking of.

Something along the lines of:

setnames(DT, old, new, skip_absent = FALSE)

It would include an additional argument (skip_absent) which is set to FALSE by default and which won't affect that the way that setnames() functions for anybody unless the skip_absent flag is specifically set to TRUE at the time of calling.

The text was updated successfully, but these errors were encountered:

jangorecki · 2018-09-04T10:57:24Z

It would be much easier to follow your request if you would include minimal reproducible example and expected output.
also...

in my case if the column doesn't exist then it isn't a problem for me and I am happy to proceed onto the next name in the list.

I think your case is not very common one.

M-YD · 2018-09-04T11:07:41Z

@jangorecki I can't give a reproducible example because it isn't fitting in this case; the function I have built gathers data from several external backend systems and then several other operations are carried out from there and the parameters used in the function/queries can affect the outcome in several ways, most of which can't be predicted accurately.

The problem I have is how setnames() handles columns that don't exist; I know for certain that I am not the only person to experience this due to the other issues and questions that have been posted online over the last few years.

In terms of an expected outcome, what I want isn't an outcome, per se; my issue revolves more around the behaviour/execution of setnames() when dealing with column names that don't exist.

In that case, my expected outcome is for setnames() to ignore names that don't exist and to proceed onto the next.

In terms of how common my case is, I think it's more common than you realise because people have been posting issues about very similar things for a number of years, the flavour of which is typically focused on wanting setnames() to skip columns that don't exist rather than throwing an error.

franknarf1 · 2018-09-04T20:38:43Z

If your workflow is complicated enough that its behavior cannot be predicted, you can afford a one-liner to define your own setnames, I guess:

setfoundnames = function(x, old, new){ 
  w <- match(old, names(x), nomatch=0); setnames(x, old[w], new[w])
}

This is a reproducible example (illustrating what I assume you're after):

library(data.table)
DT = data.table(a = 1, b = 2, d = 3)
setfoundnames(DT, c("a", "A"), c("z", "Z"))

It is fitting to make an example to be clear about the change in behavior that you're requesting. Within your wall of text, it may be a chore to parse out which parts are describing the request and which are arguing for it.

HughParsonage · 2018-09-05T17:34:13Z

Probably an extra argument to setnames could be sufficient.

setnames(DT, old, new, halt_if_absent = TRUE)

library(data.table)
DT = data.table(a = 1, b = 2, d = 3)
setnames(DT, "A", "B") 
#> Error ...
setnames(DT, "A", "B", halt_if_not_found = FALSE)
#> (Returns DT)

Note that the current behaviour of setnames has saved my bacon a few times. And it has probably made debugging much easier than I appreciate. I've reviewed code that used plyr::rename on non-existent names. The warnings were simply ignored; but when the names were corrected, the conclusion was the opposite of that originally calculated.

I appreciate that it sucks when a long-running script is halted by a seemingly inconsequential error, but it's hard to see that skipping in the absence of a direction to change names is a useful default.

In a similar vein, one could imagine someone writing

DT[, x := y / 1000]

when y does not exist in DT. In the philosophy of 'no errors', we could just leave DT unchanged. It's not difficult to construct an example where this would be disastrous.

M-YD · 2018-09-06T09:19:56Z

@HughParsonage I concur; this is precisely the sort of functionality I had in mind.

setnames(DT, old, new, skip_absent = FALSE)

An additional argument (skip_absent) which is set to FALSE by default would be ideal here because it would mean that the way that setnames() functions won't change for anybody by default and people in my scenario who need this extra functionality can simply set the skip_absent flag to TRUE at the time of calling the function.

MichaelChirico · 2018-09-06T09:31:30Z

I'm really not sold on the utility of such an argument.

IMHO it's an error to try and supply non-existent names.

The workaround only takes one line of code for those, so the convenience factor is next to 0 as well.

M-YD · 2018-09-06T09:53:28Z

@MichaelChirico

IMHO it's an error to try and supply non-existent names.

I disagree because people won't be trying to supply non-existent names deliberately.

In my specific case, all of the names in old could exist, and I would expect them to almost every time, but there are some edge cases where they won't always exist, and it is those times that I need to be able to define within the function that it is okay to automatically skip and move onto the next name in the list.

I appreciate that not everybody will want this, which is why the argument should be set to FALSE by default; as such, it won't affect the majority of users in the way that it currently operates in that it won't skip missing values by default.

Also, I anticipate that there will be lots of people using setnames() for the first time over the coming years and some of those people will encounter this error without knowing what is wrong. When they look at ?setnames they will see the skip_absent argument and also see that it can be set to TRUE. From there, they will be able to make a decision based on their specific needs as to whether or not they want to continue stopping on missing old values or if they want to skip those instances and move onto the next.

Ultimately, I expect that this would provide a vehicle for a much more self-sufficient resolution because the whole investigative process for newcomers can all happen in a shorter space of time compared to having to trawl through Google/StackOverflow/GitHub in search of a solution.

MichaelChirico · 2018-09-06T09:59:35Z

people won't be trying to supply non-existent names deliberately.

R is not really a language like Python where you might try to make errors deliberately...

Adding an argument that will benefit a tiny % of users is not really "for free" as it adds another part of the API for us to potentially maintain.

It might be another thing if the workaround was particularly data.table-flavored; it's not. It's extremely standard base R to use match like in Frank's suggestion.

slin30 · 2018-09-06T11:46:13Z

IMO, the implication of setnames is that the target columns are the reference, and therefore I expect an error if I attempt to set a name on a non-existent column. The possibility that this expectation could be violated, even when requiring a bespoke additional arg to be changed from a hypothetical default, means I would no longer be able to take the current explicit behavior for granted.

I'd much rather the base function remain explicit, which allows me to write a custom version that is less explicit if needed. Especially when, as mentioned by others, doing so is quite trivial in this case.

M-YD · 2018-09-06T12:12:33Z

...means I would no longer be able to take the current explicit behavior for granted.

Of course you could because the default behaviour will remain exactly the same unless the additional argument is specifically changed to TRUE (from a default of FALSE).

In which scenario(s) do you imagine the default FALSE setting being violated?

mattdowle · 2018-09-07T02:11:23Z

It may help that I write down that on this project we've always tried to lean on the side of 'yes'. If a user wants something, especially this strongly, then why not? If 5% or even 1% of users would find it useful, that's good enough for me. Unless it could harm the other 95%, or it's particularly difficult or time-consuming to implement. In this case though, adding skip_absent with a default of FALSE would cause no harm and is easy and quick to implement. The extra match() call Frank proposed, while one extra call, is not quite as nice as a built-in skip_absent. In this case, Mus is claiming others raise the same issue, so that should tilt us to say 'yes' even more.
But Mus - the original post was way too long and set everyone off on the wrong foot, right?
@MusTheDataGuy Are you able to submit a PR? The changes would be to data.table.R, man/setattr.Rd, tests.Rraw and NEWS. More info here: https://github.com/Rdatatable/data.table/wiki/Contributing

M-YD · 2018-09-07T11:10:02Z

@mattdowle I'm glad that you agree and that you see value in this.

Yes, admittedly the original post was too long; I wanted to make a good case is all.

I would be delighted to submit a PR; before doing so I will first take a look at the current codebase to familiarise myself with it.

UweBlock · 2019-03-18T06:56:29Z

For the sake of completeness, there is a question on SO, Rename variables based on values in another dataframe, which was answered using this new feature.

M-YD changed the title ~~Allow setnames() to skip column names that don't exist and move onto the next.~~ Allow setnames() to skip column names that don't exist and move onto the next Sep 4, 2018

M-YD changed the title ~~Allow setnames() to skip column names that don't exist and move onto the next~~ [feature request] Allow setnames() to skip column names that don't exist and move onto the next Sep 7, 2018

mattdowle changed the title ~~[feature request] Allow setnames() to skip column names that don't exist and move onto the next~~ [feature request] Allow setnames() to skip column names that don't exist Sep 7, 2018

M-YD mentioned this issue Oct 17, 2018

Added skip_absent() feature to setnames() #3111

Merged

mattdowle added this to the 1.12.0 milestone Oct 17, 2018

mattdowle closed this as completed in #3111 Nov 14, 2018

MichaelChirico mentioned this issue Nov 9, 2020

Allow setnames to include old columns not in x #1099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Allow setnames() to skip column names that don't exist #3030

[feature request] Allow setnames() to skip column names that don't exist #3030

M-YD commented Sep 4, 2018 •

edited

Loading

jangorecki commented Sep 4, 2018 •

edited

Loading

M-YD commented Sep 4, 2018 •

edited

Loading

franknarf1 commented Sep 4, 2018

HughParsonage commented Sep 5, 2018

M-YD commented Sep 6, 2018

MichaelChirico commented Sep 6, 2018

M-YD commented Sep 6, 2018 •

edited

Loading

MichaelChirico commented Sep 6, 2018

slin30 commented Sep 6, 2018

M-YD commented Sep 6, 2018

mattdowle commented Sep 7, 2018 •

edited

Loading

M-YD commented Sep 7, 2018

UweBlock commented Mar 18, 2019

[feature request] Allow setnames() to skip column names that don't exist #3030

[feature request] Allow setnames() to skip column names that don't exist #3030

Comments

M-YD commented Sep 4, 2018 • edited Loading

jangorecki commented Sep 4, 2018 • edited Loading

M-YD commented Sep 4, 2018 • edited Loading

franknarf1 commented Sep 4, 2018

HughParsonage commented Sep 5, 2018

M-YD commented Sep 6, 2018

MichaelChirico commented Sep 6, 2018

M-YD commented Sep 6, 2018 • edited Loading

MichaelChirico commented Sep 6, 2018

slin30 commented Sep 6, 2018

M-YD commented Sep 6, 2018

mattdowle commented Sep 7, 2018 • edited Loading

M-YD commented Sep 7, 2018

UweBlock commented Mar 18, 2019

M-YD commented Sep 4, 2018 •

edited

Loading

jangorecki commented Sep 4, 2018 •

edited

Loading

M-YD commented Sep 4, 2018 •

edited

Loading

M-YD commented Sep 6, 2018 •

edited

Loading

mattdowle commented Sep 7, 2018 •

edited

Loading