You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's a simple reproducible example that gets to the point quickly. Let's start with a very simple data.table: DT <- data.table( col1 = c(1, 1, 1), col2 = c("a", "b", "a"), col3 = c("A", "B", "A"), col4 = c(2, 2, 2) ) print(DT)
col1 col2 col3 col4
1: 1 a A 2
2: 1 b B 2
3: 1 a A 2
Note that rows 1 & 3 are identical, with a differing row (2) between them. This "interrupting" row is key to the bug that follows.
If we run a simple grouping using columns 1 through 4 using two different syntax, we get the same (correct) result: DT[, .N, by = c("col1", "col2", "col3", "col4")]
col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1
DT[, .N, by = col1:col4]
col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1
Now, let's set a key, using columns 1 & 4, and re-run the above grouping commands: setkey(DT, col1, col4) key(DT)
[1] "col1" "col4"
DT[, .N, by = c("col1", "col2", "col3", "col4")]
col1 col2 col3 col4 N
1: 1 a A 2 2
2: 1 b B 2 1
DT[, .N, by = col1:col4]
col1 col2 col3 col4 N
1: 1 a A 2 1
2: 1 b B 2 1
3: 1 a A 2 1
Notice that the "by = col1:col4" now produces a different result.
Removing the key -- or setting some key other than ("col1", "col4") -- will restore the correct results for both syntax. (Not shown)
It's as though the presence of the key ("col1", "col4") induces the "by=col1:col4" syntax to assume that the data.table is already sorted by (col1, col2, col3, col4). And thus, the intervening row (2) causes the grouping to miss later matching row.
So far, I have noticed this bug in only one case: when the key is ("colB", "colG") and the same two columns are named as endpoints in the by ":" syntax ("by = colB:colG").
FWIW, today is my first time ever using GitHub, so please forgive if I've missed something. (I joined today so that I could report what I noticed.) I searched the NEWS, the development version, open issues, and stack overflow .. but I found nothing similar. Perhaps I don't know the correct search terms ... or perhaps this is an edge case.
As a mitigation for now, I've resorted to using only the "by = c("colA","colB", ..) syntax. The colB:colG syntax is very convenient for ad-hoc analysis, which is a good share of my daily work.
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Here's a simple reproducible example that gets to the point quickly. Let's start with a very simple data.table:
DT <- data.table( col1 = c(1, 1, 1), col2 = c("a", "b", "a"), col3 = c("A", "B", "A"), col4 = c(2, 2, 2) )
print(DT)
Note that rows 1 & 3 are identical, with a differing row (2) between them. This "interrupting" row is key to the bug that follows.
If we run a simple grouping using columns 1 through 4 using two different syntax, we get the same (correct) result:
DT[, .N, by = c("col1", "col2", "col3", "col4")]
DT[, .N, by = col1:col4]
Now, let's set a key, using columns 1 & 4, and re-run the above grouping commands:
setkey(DT, col1, col4)
key(DT)
DT[, .N, by = c("col1", "col2", "col3", "col4")]
DT[, .N, by = col1:col4]
Notice that the "by = col1:col4" now produces a different result.
Removing the key -- or setting some key other than ("col1", "col4") -- will restore the correct results for both syntax. (Not shown)
It's as though the presence of the key ("col1", "col4") induces the "by=col1:col4" syntax to assume that the data.table is already sorted by (col1, col2, col3, col4). And thus, the intervening row (2) causes the grouping to miss later matching row.
So far, I have noticed this bug in only one case: when the key is ("colB", "colG") and the same two columns are named as endpoints in the by ":" syntax ("by = colB:colG").
FWIW, today is my first time ever using GitHub, so please forgive if I've missed something. (I joined today so that I could report what I noticed.) I searched the NEWS, the development version, open issues, and stack overflow .. but I found nothing similar. Perhaps I don't know the correct search terms ... or perhaps this is an edge case.
As a mitigation for now, I've resorted to using only the "by = c("colA","colB", ..) syntax. The colB:colG syntax is very convenient for ad-hoc analysis, which is a good share of my daily work.
sessionInfo()
The text was updated successfully, but these errors were encountered: