Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate sum of integer64 returned using by clause #1647

Closed
mlandry22-h2o opened this issue Apr 13, 2016 · 7 comments · Fixed by #3737
Closed

Inaccurate sum of integer64 returned using by clause #1647

mlandry22-h2o opened this issue Apr 13, 2016 · 7 comments · Fixed by #3737
Assignees
Labels
bit64 bug GForce issues relating to optimized grouping calculations (GForce)
Milestone

Comments

@mlandry22-h2o
Copy link

Attached is a data.table with an integer64 column. Data.table returns the correct sum when the by clause is not used. However, when the by clause is used, the result is incorrect. The sum returned is interestingly the sum of the results except the final row.

Related scripts:

> setnames(DT,c("key","value"))
> DT
               key   value
1: 7355 1100047023 -168067
2: 7355 1100047023  168067
3: 7355 1100047023  141091
4: 7355 1100047023 -141091
> DT[,sum(value)]
integer64
[1] 0
> DT[,sum(value),by=key]
               key      V1
1: 7355 1100047023 -141091
> save(DT,file="integer64_sum_with_group_example.Rdata")

integer64_sum_with_group_example.zip

@mattdowle mattdowle added the bug label Apr 14, 2016
@mattdowle mattdowle added this to the v1.9.8 milestone Apr 14, 2016
@mattdowle
Copy link
Member

Yes confirmed. GForce needs to be integer64 aware. Thanks for reporting!

@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Jul 21, 2016
@stanmart
Copy link

stanmart commented Nov 2, 2017

Would it be possible to disable GForce if an integer64 column is detected until there is a proper fix? It is dangerous that data.table returns an incorrect result without failing or warning.

@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@eutwt
Copy link

eutwt commented Oct 29, 2018

@jangorecki jangorecki added this to the 1.12.0 milestone Oct 29, 2018
@t-morrison
Copy link

From the SO question (asked by myself), I will add some detail to this post. I also found the summing to be incorrect when negative integer64 class values are in the data.table, but I think it plays out differently than OP has described:

The sum returned is interestingly the sum of the results except the final row.

[this does not appear to be what is happening in OPs example]

By playing around with the data.table, I found that dt[, sum(col, na.rm=FALSE), by=group] is returning the negative row nearest to 0 in the data.table group as the sum.

With na.rm=TRUE, it returns the sum by group excluding the negative values.

Example:

library(data.table); library(bit64)
z <- data.table(
  group = c("A","A","A","A","B","B","B","B","B","B"),
  int64 = as.integer64(c(10,-2,20,-7,5,-8,-2,-1,-3,0))
)

z[, sum(int64, na.rm=FALSE), by=group]
#group  V1
#    A  -2
#    B  -1

z[, sum(int64, na.rm=TRUE), by=group]
#group V1
#    A 30
#    B  5

@franknarf1
Copy link
Contributor

@moman822 Fyi, you see those results because gsum uses the core data (that you can see with unclass(z$int64)), which contains some NaNs.

> z[, v := replace(int64, is.nan(unclass(int64)), NA)]
> z
    group int64  v
 1:     A    10 10
 2:     A    -2 NA
 3:     A    20 20
 4:     A    -7 NA
 5:     B     5  5
 6:     B    -8 NA
 7:     B    -2 NA
 8:     B    -1 NA
 9:     B    -3 NA
10:     B     0  0

To avoid using gsum till this is supported, besides what akrun pointed out, you can wrap in parentheses:

z[, (sum(int64, na.rm = TRUE)), by=group]
z[, (sum(int64, na.rm = FALSE)), by=group]

@t-morrison
Copy link

@franknarf1

That makes sense to me for when na.rm=TRUE, but not for when na.rm=FALSE.

@franknarf1
Copy link
Contributor

@moman822 The same idea applies:

library(integer64) # no need for data.table to see it
x = as.integer64(c(10,-2,20,-7))
structure(sum(unclass(x)), class="integer64")
# integer64
# [1] -2

We could take it to StackOverflow if it's still unclear (eg this chat room) since the package maintainers here on github seem to understand the problem already (from the second comment).

@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019
@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019
@MichaelChirico MichaelChirico added the GForce issues relating to optimized grouping calculations (GForce) label Feb 25, 2019
@jangorecki jangorecki self-assigned this Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bit64 bug GForce issues relating to optimized grouping calculations (GForce)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants