-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obtaining different object from same file #5346
Comments
That certainly sounds like bad news! I don't know of any sources of randomness off the top of my head. The only thing I can think of is threading? Can you try again with Beyond that it will be very tough for us to solve the problem without a reproducible example. Please share the data if you can, or scrub out details as much as possible if there's some privacy/proprietary concerns. |
If the data is not shareable, the output of verbose alone would be interesting for the case where Since the problem is appearing with |
thank you for your replies @MichaelChirico and @ben-schwen. Anyway, I had the chance to work a bit on these files, and what happens is that, for example, in a table 25000 x 24 I have 45 'errors' that do not happen in the same cells in consecutive iterations. This errors seems to behave this way: the character rapresentation could be someting like "5.760602" and the numeric 'visible' representation would be for two consecutive fread with colClasses = 'numeric' the same 5.760602... but doing dump() one would be 5.7606020000000004 and the other 5.7606019999999996. I also tried on other similar tables exported from the same software (Imaging mass cytometry), but this problem does affect only random files. |
@MichaelChirico, yes nThread = 1 seems to solve the problem. Thank you |
What apparently happens is that parsing doubles is dependent on the thread and not deterministic. Decimal | Sign | Exponent | Mantissa
5.7606020000000004 | 0 | 10000000001 | 0111000010101101101101000000001011010001011010111010
5.7606019999999996 | 0 | 10000000001 | 0111000010101101101101000000001011010001011010111001 |
what does the standard say about which 64-bit double is "correct" in this case? i.e. 5.760602 is exactly halfway between the two nearest representable doubles, is there any heuristic for which is preferred? |
@MichaelChirico you actually fixed this in #4463. Unfortunately, this hasn't made it onto CRAN yet because Matt only pushed the OMP patch #5172 for 1.14.2. Some code to reproduce and check that it's fixed on dev: library('data.table')
setDTthreads(2)
width <- 12
length <- 25000
numbers <- paste(c(paste(as.character(1:width), collapse = ','), rep(paste(rep('5.760602', width), collapse =','), length)), collapse = '\n')
d1 <- fread(text=numbers, header = TRUE, verbose = TRUE)
d2 <- fread(text=numbers, header = TRUE, verbose = TRUE)
identical(d1, d2)
# [1] TRUE
# ! This is usually TRUE, only rarely FALSE
e <- d1[[1]]
for (i in 2:length) {
if (!identical(e[i-1], e[i])) {
print(i - 1)
}
}
# Prints the following on CRAN and nothing on dev:
# [1] 12500 The last bit prints where the parsed numbers change. I'm only 95% sure about freads logic on how many threads to use. Make sure you see it using two. This only shows up with at least two threads. Using more or larger data which gets broken up into more chunks introduces some randomness in the output (hence why OP even noticed it) based on, I presume, which chunk gets read by which thread. I don't really understand why the rounding would be different with the old lookup table based on thread, but, well, now it's not. @luigidolcetti could you confirm that this is indeed fixed for your data with the latest development version? You should be able to install it with |
@tlapak I'm unable to reproduce the |
I can confirm @tlapak example on my windows machine with |
I have been able to test my example on a Ubuntu machine now where it does indeed not work. So this issue seems to be Windows specific. Did this occur on a Windows machine for you, @luigidolcetti? |
@tlapak Yes, a relatively recent windows 10 |
@luigidolcetti Would you mind sharing the output of Does the issue still appear if you upgrade to |
@ben-schwen, with 1.14.3 works fine with @tlapak example on a 12x250000 iterated some 20 times. Here is session info for the PC where I tried... sorry cannot access at the moment the pc where I first noticed the issue. R version 4.0.3 (2020-10-10) Matrix products: default locale: attached base packages: other attached packages: loaded via a namespace (and not attached): |
@luigidolcetti could you also try to update to 1.14.3 on the orignal PC and also retry it with the original problem? |
Hi @ben-schwen, so I tryed on the first PC with my original dataset and unfortunatelly it seems that the problem persists even with version 1.14.3.... R version 4.1.0 (2021-05-18) Matrix products: default locale: attached base packages: other attached packages: loaded via a namespace (and not attached): with nThread =1 I get no problems but with any other value there are differences. 'data.frame': 250000 obs. of 54 variables: I get 91 discrepancies like:
|
I can confirm that if you use my example with 2.414798 instead it shows the differing results on 1.14.3. Bizarrely, for me, it does not on 1.14.2. Still only on Windows and not on Ubuntu. |
@luigidolcetti could you test again with R 4.2 and a current build of data.table 1.14.3? I can no longer reproduce any issues since upgrading to R 4.2. If running I strongly suspect that there was an issue with the compiler/openmp in the toolchain which now moved from gcc 8 to gcc 10. |
Hi,
probably a very simple issue to fix but I am strugling to solve:
I have a txt numeric table with column header.
identical(
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'),
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'))
return FALSE most of the time
while
identical(
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'),
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'))
return always TRUE
On the other hand base read.table() does not have this issue (but it's way slower).
I would prefere to avoid loading the table as character to coerce it to numeric later (because of speed, otherwise I would have used read.table).
Any suggestion on how to read the same file twice and obtain identical objects (and why this is happening)?
Thank you in advance for help,
Luigi
The text was updated successfully, but these errors were encountered: