Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added skip_absent() feature to setnames() #3111

Merged
merged 27 commits into from
Nov 14, 2018
Merged

Conversation

M-YD
Copy link
Contributor

@M-YD M-YD commented Oct 17, 2018

Closes #3030

Added skip_absent() feature to setnames() so that elements not present in old won't cause the function to exit but will instead skip onto the next element.

skip_absent() is set to FALSE by default and thus will not affect the default behaviour of setnames().

When building this element, I became aware of some issues when building the package on a Windows 10 machine that related to .dll file conflicts, and as such updated .gitignore and .Rbuildignore which rectified the problem:

  • Added *.dll to .gitignore
  • Added ^.*\.dll$ to .Rbuildignore

I initially intended to add the isTrue(skip_absent) check as a standalone element within setnames() but found that it kept failing due to i being NA, which subsequently halted the setnames() function.

From there it was clear that the solution was to incorporate the isTrue(skip_absent) check within the if(anyNA(i)) { ... } check that occurs so that it is called when i is NA (i.e. absent). This in turn indicates that said element is not present, which logically implies that this is the correct place to check for any missing elements of old and to skip accordingly.

I would also like to thank @mattdowle for his assistance throughout this process.

.gitignore Outdated
.DS_Store
.idea
*.sw[op]
inst/tests/winallquoted.csv.bz2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these being ignored? aren't they part of the test suite?

NEWS.md Outdated
**If you are viewing this file on CRAN, please check [latest news on GitHub](https://github.com/Rdatatable/data.table/blob/master/NEWS.md) where the formatting is also better.**

### Changes in v1.11.9 (to be v1.12.0)

#### NEW FEATURES

1. `fread()` can now read a remote compressed file in one step; `fread("https://domain.org/file.csv.bz2")`. The `file=` argument now supports `.gz` and `.bz2` too; i.e. `fread(file="file.csv.gz")` works now where only `fread("file.csv.gz")` worked in 1.11.8.
1. In those cases where you need to rename columns in a `DT` but the columns aren't always known, `setnames()` now contains an additional argument (`skip_absent`) to skip them if they aren't present. For example, if you know that columns `a`, `b` and `d` are present in `DT`, but you don't know if column `c` is or isn't, then you can include `c` in `old` and if it isn't found, `setnames()` will simply skip to the next item of `old` rather than exit the function. **Note: The default behaviour of `setnames()` has not been altered as `skip_absent` is set to `FALSE` by default.** [#3030](https://github.com/Rdatatable/data.table/issues/3030)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add new NEWS items at the bottom

R/data.table.R Outdated
@@ -2532,7 +2531,7 @@ setnames <- function(x,old,new) {
if (!is.character(old)) stop("'old' is type ",typeof(old)," but should be integer, double or character")
if (any(duplicated(old))) stop("Some duplicates exist in 'old': ", paste(old[duplicated(old)],collapse=","))
i = chmatch(old,names(x))
if (anyNA(i)) stop("Items of 'old' not found in column names: ",paste(old[is.na(i)],collapse=","))
if (anyNA(i)){ if (skip_absent == TRUE){ w <- old %chin% names(x); old = old[w]; new = new[w]; i = i[w] } else { stop("Items of 'old' not found in column names: ",paste(old[is.na(i)],collapse=",")) } }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please split the branch logic onto new lines, it's a bit hard to read at the moment

DT <- data.table(a = 1, b = 2, d = 3)
old <- c("a", "b", "c", "d")
new <- c("A", "B", "C", "D")
test(1953, setnames(DT, old, new, skip_absent = TRUE), DT <- data.table(A = 1, B = 2, D = 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also test potential erroneous behavior/mis-use of skip_absent

Copy link
Member

@mattdowle mattdowle Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree need more tests please. skip_absent should only accept TRUE and FALSE and fail with error on anything else (use stopifnot). There should be a test when TRUE and all are missing, none are missing, two are missing, and when there are missings and the non-missings in old don't correspond to names(DT) in the same order. The DT <- in the y part of test 1953 can be removed too.

@MichaelChirico
Copy link
Member

Thanks for the PR!

@codecov
Copy link

codecov bot commented Oct 17, 2018

Codecov Report

Merging #3111 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3111      +/-   ##
==========================================
+ Coverage   91.95%   91.96%   +<.01%     
==========================================
  Files          61       61              
  Lines       11437    11444       +7     
==========================================
+ Hits        10517    10524       +7     
  Misses        920      920
Impacted Files Coverage Δ
R/data.table.R 92.62% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9935f2d...bf8db81. Read the comment docs.

@MichaelChirico
Copy link
Member

MichaelChirico commented Oct 17, 2018

I'm seeing a big whitespace diff on .gitignore... you're on Windows right? I suspect your editor is inserting \r\n newlines instead of the standard \n... could you try and confirm and, if so, stick to \n? 😅

screen shot 2018-10-17 at 8 55 20 pm

@MichaelChirico
Copy link
Member

Could you remove the new NEWS.html?

R/data.table.R Outdated
@@ -2532,7 +2531,13 @@ setnames <- function(x,old,new) {
if (!is.character(old)) stop("'old' is type ",typeof(old)," but should be integer, double or character")
if (any(duplicated(old))) stop("Some duplicates exist in 'old': ", paste(old[duplicated(old)],collapse=","))
i = chmatch(old,names(x))
if (anyNA(i)) stop("Items of 'old' not found in column names: ",paste(old[is.na(i)],collapse=","))
if (anyNA(i)){ if (skip_absent == TRUE){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either if (skip_absent) or (generally more preferred) if (isTRUE(skip_absent))

R/data.table.R Outdated Show resolved Hide resolved
@mattdowle mattdowle added this to the 1.12.0 milestone Oct 17, 2018
@mattdowle mattdowle changed the title [Closes #3030] Added skip_absent() feature to setnames() Added skip_absent() feature to setnames() Oct 17, 2018
DT <- data.table(a = 1, b = 2, d = 3)
old <- c("a", "b", "c", "d")
new <- c("A", "B", "C", "D")
test(1953, setnames(DT, old, new, skip_absent = TRUE), DT <- data.table(A = 1, B = 2, D = 3))
Copy link
Member

@mattdowle mattdowle Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree need more tests please. skip_absent should only accept TRUE and FALSE and fail with error on anything else (use stopifnot). There should be a test when TRUE and all are missing, none are missing, two are missing, and when there are missings and the non-missings in old don't correspond to names(DT) in the same order. The DT <- in the y part of test 1953 can be removed too.

man/setattr.Rd Outdated Show resolved Hide resolved
man/setattr.Rd Outdated Show resolved Hide resolved
DESCRIPTION Outdated
@@ -12,7 +12,8 @@ Authors@R: c(
person("Eduard","Antonyan", role="ctb"),
person("Markus","Bonsch", role="ctb"),
person("Hugh","Parsonage", role="ctb"),
person("Scott","Ritchie", role="ctb"))
person("Scott","Ritchie", role="ctb"),
person("Mus","Yaramaz-David", role="ctb"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People don't normally add themselves here. There are 53 contributors who have done something, no matter how small: https://github.com/Rdatatable/data.table/graphs/contributors. You would be listed there, and in NEWS. When a contributor has made sustained contributions, or a few big ones, I add them to DESCRIPTION which is displayed on CRAN. The contributors listed in DESCRIPTION are roughly listed at the top of https://github.com/Rdatatable/data.table/graphs/contributors.
At least, this is the way this has been done to date. We could change so that every single contributor is listed in DESCRIPTION if people want to.

Copy link
Contributor Author

@M-YD M-YD Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that something midway between the two would be ideal.

For example, those who make small, non-functional changes (such as correcting typos) wouldn't be included here but those who extend the functionality of data.table in some meaningful and/or useful way should be included.

That way it would prevent this file from becoming too large, whilst still retaining its relevance and meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the author field should be limited to those who have contributed code that the authors would not have ordinarily come up with. This PR is an enhancement but its implementation is quite elementary so should not qualify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this addition for now in this PR but have created #3144 to discuss further. If #3144 goes ahead, Mus would be added then along with other names. Trying to be utterly fair to everyone, past and present.

NEWS.md Outdated

4. In those cases where you need to rename columns in a `DT` but the columns aren't always known, `setnames()` now contains an additional argument (`skip_absent`) to skip them if they aren't present. For example, if you know that columns `a`, `b` and `d` are present in `DT`, but you don't know if column `c` is or isn't, then you can include `c` in `old` and if it isn't found, `setnames()` will simply skip to the next item of `old` rather than exit the function. **Note: The default behaviour of `setnames()` has not been altered as `skip_absent` is set to `FALSE` by default.** [#3030](https://github.com/Rdatatable/data.table/issues/3030)

3. In those cases where you need to rename columns in a `DT` but the columns aren't always known, `setnames()` now contains an additional argument (`skip_absent`) to skip them if they aren't present. For example, if you know that columns `a`, `b` and `d` are present in `DT`, but you don't know if column `c` is or isn't, then you can include `c` in `old` and if it isn't found, `setnames()` will simply skip to the next item of `old` rather than exit the function. **Note: The default behaviour of `setnames()` has not been altered as `skip_absent` is set to `FALSE` by default.** [#3030](https://github.com/Rdatatable/data.table/issues/3030)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a duplicate new item here. Please also add "Thanks to @MusTheDataGuy for the suggestion (#3030) and the PR (#3111)." (i.e. item 2 in Contributing guide mentions this.)

@mattdowle
Copy link
Member

Looks great. Just a few minor tweaks and then it can be merged (see comments inline). Thanks!

@M-YD
Copy link
Contributor Author

M-YD commented Oct 18, 2018

I'm seeing a big whitespace diff on .gitignore... you're on Windows right? I suspect your editor is inserting \r\n newlines instead of the standard \n... could you try and confirm and, if so, stick to \n? 😅

screen shot 2018-10-17 at 8 55 20 pm

That's right, I'm running Windows 10. I have updated the file now and will do this for any other affected files also.

For those who are interested, the way I fixed the problem was to change the setting of my code editor (VS Code). It is set to CRLF by default; I have switched it to LF and the problem is now fixed.

@M-YD M-YD closed this Oct 18, 2018
@M-YD M-YD reopened this Oct 18, 2018
R/data.table.R Outdated
@@ -18,7 +17,7 @@ setPackageName("data.table",.global)

.SD = .N = .I = .GRP = .BY = .EACHI = NULL
# These are exported to prevent NOTEs from R CMD check, and checkUsage via compiler.
# But also exporting them makes it clear (to users and other packages) that data.table uses these as symbols.
# But also exporting them makes it clear (to users and other packages) that data.table uses these as symbols.`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete.

@mattdowle mattdowle merged commit 41b63ea into Rdatatable:master Nov 14, 2018
@Atrebas
Copy link

Atrebas commented Apr 5, 2019

Minor remark, in ?setnames, the comments on the lines below are inaccurate:

setnames(DT,"b","B")               # by name, no match() needed (warning if "b" is missing)
setnames(DT,c("a","E"),c("A","F")) # multiple by name (warning if either "a" or "E" is missing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants