Obtaining different object from same file #5346

luigidolcetti · 2022-03-10T16:51:43Z

Hi,

probably a very simple issue to fix but I am strugling to solve:

I have a txt numeric table with column header.

identical(
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'),
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'numeric'))

return FALSE most of the time

while

identical(
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'),
data.table::fread(fileName,sep='\t',header = T,check.names = F,colClasses = 'character'))

return always TRUE

On the other hand base read.table() does not have this issue (but it's way slower).
I would prefere to avoid loading the table as character to coerce it to numeric later (because of speed, otherwise I would have used read.table).

Any suggestion on how to read the same file twice and obtain identical objects (and why this is happening)?

Thank you in advance for help,
Luigi

MichaelChirico · 2022-03-10T23:43:43Z

That certainly sounds like bad news! I don't know of any sources of randomness off the top of my head. The only thing I can think of is threading? Can you try again with nThread=1?

Beyond that it will be very tough for us to solve the problem without a reproducible example. Please share the data if you can, or scrub out details as much as possible if there's some privacy/proprietary concerns.

ben-schwen · 2022-03-11T00:17:19Z

If the data is not shareable, the output of verbose alone would be interesting for the case where identical is FALSE.

Since the problem is appearing with numeric, maybe there is an issue with parsing double or type bumps. Maybe worth to take a look at the absolute value differences? But all of these things should be deterministic.

luigidolcetti · 2022-03-11T13:36:06Z

thank you for your replies @MichaelChirico and @ben-schwen.
Sorry, I do not feel like uploading files at the moment because they are coming from collaborators that might disagree...

Anyway, I had the chance to work a bit on these files, and what happens is that, for example, in a table 25000 x 24 I have 45 'errors' that do not happen in the same cells in consecutive iterations. This errors seems to behave this way: the character rapresentation could be someting like "5.760602" and the numeric 'visible' representation would be for two consecutive fread with colClasses = 'numeric' the same 5.760602... but doing dump() one would be 5.7606020000000004 and the other 5.7606019999999996.

I also tried on other similar tables exported from the same software (Imaging mass cytometry), but this problem does affect only random files.

luigidolcetti · 2022-03-11T16:15:26Z

@MichaelChirico, yes nThread = 1 seems to solve the problem. Thank you

ben-schwen · 2022-03-11T21:14:37Z

What apparently happens is that parsing doubles is dependent on the thread and not deterministic.
The reason might be that 5.760602 is not exactly representable as 64-bit double.

Decimal            | Sign | Exponent    | Mantissa
5.7606020000000004 | 0    | 10000000001 | 0111000010101101101101000000001011010001011010111010
5.7606019999999996 | 0    | 10000000001 | 0111000010101101101101000000001011010001011010111001

MichaelChirico · 2022-03-12T03:26:19Z

what does the standard say about which 64-bit double is "correct" in this case? i.e. 5.760602 is exactly halfway between the two nearest representable doubles, is there any heuristic for which is preferred?

tlapak · 2022-03-12T22:12:06Z

@MichaelChirico you actually fixed this in #4463. Unfortunately, this hasn't made it onto CRAN yet because Matt only pushed the OMP patch #5172 for 1.14.2.

Some code to reproduce and check that it's fixed on dev:

library('data.table')

setDTthreads(2)

width <- 12
length <- 25000

numbers <- paste(c(paste(as.character(1:width), collapse = ','), rep(paste(rep('5.760602', width), collapse =','), length)), collapse = '\n')
d1 <- fread(text=numbers, header = TRUE, verbose = TRUE)
d2 <- fread(text=numbers, header = TRUE, verbose = TRUE)
identical(d1, d2)
# [1] TRUE
# ! This is usually TRUE, only rarely FALSE

e <- d1[[1]]
for (i in 2:length) {
  if (!identical(e[i-1], e[i])) {
    print(i - 1)
  }
}
# Prints the following on CRAN and nothing on dev:
# [1] 12500

The last bit prints where the parsed numbers change. I'm only 95% sure about freads logic on how many threads to use. Make sure you see it using two. This only shows up with at least two threads. Using more or larger data which gets broken up into more chunks introduces some randomness in the output (hence why OP even noticed it) based on, I presume, which chunk gets read by which thread. I don't really understand why the rounding would be different with the old lookup table based on thread, but, well, now it's not.

@luigidolcetti could you confirm that this is indeed fixed for your data with the latest development version? You should be able to install it with update.dev.pkg()

OfekShilon · 2022-03-13T12:42:05Z

@tlapak I'm unable to reproduce the !identical in your example with the CRAN version.
Maybe it's worth it to _mm_getcsr and fegetround at the start of every DT thread, and verify they are identical? Perhaps just dump them in verbose mode for now? Anyway that's the only source of nondeterminism I can suggest.

ben-schwen · 2022-03-13T13:20:40Z

I can confirm @tlapak example on my windows machine with 1.4.2.

tlapak · 2022-03-14T21:05:29Z

I have been able to test my example on a Ubuntu machine now where it does indeed not work. So this issue seems to be Windows specific. Did this occur on a Windows machine for you, @luigidolcetti?

luigidolcetti · 2022-03-15T12:52:40Z

@tlapak Yes, a relatively recent windows 10

ben-schwen · 2022-03-15T20:08:37Z

@luigidolcetti Would you mind sharing the output of sessionInfo() with us?

Does the issue still appear if you upgrade to 1.14.3 with data.table::update.dev.pkg()?

luigidolcetti · 2022-03-15T20:32:54Z

@ben-schwen, with 1.14.3 works fine with @tlapak example on a 12x250000 iterated some 20 times. Here is session info for the PC where I tried... sorry cannot access at the moment the pc where I first noticed the issue.

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.14.3

loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3

ben-schwen · 2022-03-23T09:47:35Z

@luigidolcetti could you also try to update to 1.14.3 on the orignal PC and also retry it with the original problem?

luigidolcetti · 2022-03-23T17:29:49Z

Hi @ben-schwen, so I tryed on the first PC with my original dataset and unfortunatelly it seems that the problem persists even with version 1.14.3....
here is my sessioninfo

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] RUNIMCTEMP_0.5.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 rstudioapi_0.13 raster_3.4-13 BiocGenerics_0.38.0 munsell_0.5.0
[6] colorspace_2.0-1 flowCore_2.4.0 lattice_0.20-44 R6_2.5.0 rlang_0.4.11
[11] tools_4.1.0 parallel_4.1.0 grid_4.1.0 Biobase_2.52.0 data.table_1.14.3
[16] matrixStats_0.59.0 digest_0.6.27 RcppParallel_5.1.4 randomForest_4.6-14 lifecycle_1.0.0
[21] crayon_1.4.1 cytolib_2.4.0 RProtoBufLib_2.4.0 S4Vectors_0.30.0 codetools_0.2-18
[26] ncdf4_1.17 sp_1.4-5 compiler_4.1.0 scales_1.1.1 stats4_4.1.0

with nThread =1 I get no problems but with any other value there are differences.
For example with nThread=10 and a table like this:

'data.frame': 250000 obs. of 54 variables:
$ Start_push : num 0 0 0 0 0 0 0 0 0 0 ...
$ End_push : num 0 0 0 0 0 0 0 0 0 0 ...
$ Pushes_duration : num 0 0 0 0 0 0 0 0 0 0 ...
$ X : num 0 1 2 3 4 5 6 7 8 9 ...
$ Y : num 0 0 0 0 0 0 0 0 0 0 ...
$ Z : num 0 1 2 3 4 5 6 7 8 9 ...
$ 120Sn(Sn120Di) : num 0 1 0 0 3.64 ...
....

I get 91 discrepancies like:

sprintf("%.60f", T1[32096,7])
[1] "2.414797999999999777998027639114297926425933837890625000000000"
sprintf("%.60f", T2[32096,7])
[1] "2.414798000000000222087237489176914095878601074218750000000000"

tlapak · 2022-03-23T20:58:05Z

I can confirm that if you use my example with 2.414798 instead it shows the differing results on 1.14.3. Bizarrely, for me, it does not on 1.14.2. Still only on Windows and not on Ubuntu.

tlapak · 2022-06-29T22:04:32Z

@luigidolcetti could you test again with R 4.2 and a current build of data.table 1.14.3? I can no longer reproduce any issues since upgrading to R 4.2.

If running update.dev.pkg() doesn't work, you can download a compatible build here (Note this is a build produced by our CI process. I'm just not sure update.dev.pkg() will download the correct version. It has never worked for me at all tbh...). Alternatively, you can of course build it yourself, but make sure to install rtools42.

I strongly suspect that there was an issue with the compiler/openmp in the toolchain which now moved from gcc 8 to gcc 10.

ben-schwen added the fread label Mar 11, 2022

tlapak mentioned this issue Jun 17, 2022

Differences in numeric representation between read.csv and data.table::fread #5406

Open

tlapak mentioned this issue Aug 10, 2022

Fread precision issue with multiple threads #5433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obtaining different object from same file #5346

Obtaining different object from same file #5346

luigidolcetti commented Mar 10, 2022

MichaelChirico commented Mar 10, 2022

ben-schwen commented Mar 11, 2022

luigidolcetti commented Mar 11, 2022

luigidolcetti commented Mar 11, 2022

ben-schwen commented Mar 11, 2022

MichaelChirico commented Mar 12, 2022

tlapak commented Mar 12, 2022 •

edited

Loading

OfekShilon commented Mar 13, 2022

ben-schwen commented Mar 13, 2022

tlapak commented Mar 14, 2022

luigidolcetti commented Mar 15, 2022

ben-schwen commented Mar 15, 2022

luigidolcetti commented Mar 15, 2022

ben-schwen commented Mar 23, 2022

luigidolcetti commented Mar 23, 2022

tlapak commented Mar 23, 2022 •

edited

Loading

tlapak commented Jun 29, 2022

Obtaining different object from same file #5346

Obtaining different object from same file #5346

Comments

luigidolcetti commented Mar 10, 2022

MichaelChirico commented Mar 10, 2022

ben-schwen commented Mar 11, 2022

luigidolcetti commented Mar 11, 2022

luigidolcetti commented Mar 11, 2022

ben-schwen commented Mar 11, 2022

MichaelChirico commented Mar 12, 2022

tlapak commented Mar 12, 2022 • edited Loading

OfekShilon commented Mar 13, 2022

ben-schwen commented Mar 13, 2022

tlapak commented Mar 14, 2022

luigidolcetti commented Mar 15, 2022

ben-schwen commented Mar 15, 2022

luigidolcetti commented Mar 15, 2022

ben-schwen commented Mar 23, 2022

luigidolcetti commented Mar 23, 2022

tlapak commented Mar 23, 2022 • edited Loading

tlapak commented Jun 29, 2022

tlapak commented Mar 12, 2022 •

edited

Loading

tlapak commented Mar 23, 2022 •

edited

Loading