fixes #504 fread now handles all kind of NAs without coercion to char #1236

dselivanov · 2015-07-21T07:57:13Z

Tests added. Test 882 adjusted to match correct behaviour.
I tested this a lot (also test.data.table() helps! ), but may be we need to add more tests for this. I believe I need some help with writing good tests.

Now na.strings = NULL means no coercion at all. Old hardcoded handling for "NA" string removed from everywhere.
@arunsrinivasan, please review code before merging. This PR touches a lot of fread.c internals. It affects Strto*() functions which are workhorses for data reading and conversion.
Also It would be great if @mattdowle also will take a look, because as I see he wrote most of the fread code.

dselivanov · 2015-07-21T09:23:24Z

strange behaviour - CI tests 908, 1343, 1344 were failed, however all tests were passed on my machine. I'll check.

jangorecki · 2015-07-21T10:59:19Z

It happens sometimes to fail the tests while on the same code next built will pass them. In that case, all mentioned tests (908, 1343, 1344) are related to fread.
I did run the the test against your branch and it failed on my machine, it may be related to OS.
Reproducible script below.

library(devtools)
install_github("dselivanov/data.table@fread_improvements")
library(data.table)
test <- function(num, x, y) all.equal(x,y)
test(908, fread("A,B,C\n1,3,\n2,4,\n"), data.table(A=1:2,B=3:4,C=NA)) # where NA is type logical
# [1] "Component “C”: Modes: numeric, logical"               "Component “C”: target is numeric, current is logical"
test(1343, fread("A,B\n1,TRUE\n2,\n3,F"), data.table(A=1:3, B=c(TRUE,NA,FALSE)))
# [1] "Component “B”: Modes: character, logical"               "Component “B”: target is character, current is logical"
test(1344, fread("A,B\n1,T\n2,NA\n3,"), data.table(A=1:3, B=c(TRUE,NA,NA)))
# [1] "Component “B”: Modes: character, logical"               "Component “B”: target is character, current is logical"
# Warning message:
# In fread("A,B\n1,T\n2,NA\n3,") :
#   Bumped column 2 to type character on data row 1, field contains 'T'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5 devtools_1.8.0  

loaded via a namespace (and not attached):
 [1] httr_1.0.0      R6_2.1.0        magrittr_1.5    rversions_1.0.2 tools_3.2.1     curl_0.9.1      Rcpp_0.11.6     memoise_0.2.1   xml2_0.1.1      stringi_0.5-5  
[11] knitr_1.10.5    git2r_0.10.1    stringr_1.0.0   digest_0.6.8    chron_2.3-47

arunsrinivasan · 2015-07-21T16:06:06Z

@dselivanov thanks. I'll leave it to @mattdowle for this one.

What'd be very useful is to run a sufficiently large file (both with, and without using this argument) using fread() before and after your fix and report the timings.

dselivanov · 2015-07-21T20:54:28Z

@jangorecki, I'm on mac os

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.5
loaded via a namespace (and not attached):
[1] tools_3.2.1 chron_2.3-47

But I'll try these test on ubuntu.

@arunsrinivasan Seems, timings for master and fread_improvements branches are quite similar.

Benchmark:
generate data

library(data.table)
K <- 1e7
DT <- data.table(int = 1:K, 
                 char = sample(letters, size = K, replace = T), 
                 float = 1:K + 0.1, 
                 bool = sample( c(T, F), K, replace = T))

fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for( fr in seq_along(fractions) ) {
  DT_NA <- copy(DT)
  for (j in seq_len( ncol(DT) )) {
    i <- sample(K, ceiling(fractions[[fr]] * K) )
    set(x = DT_NA, i = i, j = j, value = NA)
  }
  write.table(DT_NA, file = files[[fr]], quote = F, sep = ',', row.names = F, col.names = T)
}

NEW fread branch

library(devtools)
install_github("dselivanov/data.table@fread_improvements")
library(data.table)
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for (f in files) {
  # this copy prevents reading from already mmaped file?
  f1 <- '~/tempfile'
  system(paste('cp', f, f1))
  #*****************************************************
  timing <- system.time(DT_fread <- fread(f1, na.strings = "NA", verbose = F))
  print(f)
  print(timing)
}

[1] "/fr_0.001"
user system elapsed
1.546 0.146 1.692
[1] "/fr_0.01"
user system elapsed
1.554 0.156 1.708
[1] "/fr_0.1"
user system elapsed
1.559 0.104 1.663
[1] "/fr_0.3"
user system elapsed
1.823 0.086 1.910

OLD fread branch

library(devtools)
install_github('rdatatable/data.table')
library(data.table)
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for (f in files) {
  # this copy prevents reading from already mmaped file?
  f2 <- '~/tempfile'
  system(paste('cp', f, f2))
  #*****************************************************
  system(paste('cp', f, f2))
  timing <- system.time(DT_fread <- fread(f2, na.strings = "NA", verbose = F))
  print(f)
  print(timing)
}

[1] "/fr_0.001"
user system elapsed
1.456 0.149 1.605
[1] "/fr_0.01"
user system elapsed
1.586 0.137 1.724
[1] "/fr_0.1"
user system elapsed
1.490 0.092 1.582
[1] "/fr_0.3"
user system elapsed
1.528 0.088 1.617

dselivanov · 2015-07-22T07:36:30Z

I fixed - variable na_found_flag was uninitialised. Compile with -pedantic flag helps to locate issue.

But how do you think, guys, is it right expected behaviour, that in these cases empty string at the end of line converted into NAs, even if we don't set na.strings=""?

arunsrinivasan · 2015-08-18T20:38:08Z

@dselivanov thanks for your time. I'll have time this weekend to take a look at this. I don't foresee any issues.. so it's very likely I'll merge this. Thanks for the benchmarks and code to reproduce it. I ran it on 1e8 rows, and things weren't much different. That's great!

dselivanov · 2015-08-19T08:53:31Z

@arunsrinivasan, nice! If you will need any assistance with code, feel free to ask.

arunsrinivasan · 2015-08-23T16:35:01Z

@dselivanov had a look. Once again, great work. Here are some thoughts.

cases like these aren't handled correctly at the moment:

require(data.table)
DT = data.table(a=9:10, b=9:10 + 0.1, c=as.logical(0:1))
text = do.call("paste", c(DT, collapse="\n", sep=","))
ans1 = fread(text, na.strings=c("9", "9.1", "FALSE"))
#    V1   V2    V3
# 1:  9  9.1 FALSE
# 2: 10 10.1  TRUE

sapply(ans1, class)
#         V1        V2        V3 
#  "integer" "numeric" "logical" 

# whereas read.table() gives
ans2 = read.table(text=text, na.strings=c("9", "9.1", "FALSE"), sep=",", header=FALSE)
#   V1   V2   V3
# 1 NA   NA   NA
# 2 10 10.1 TRUE

sapply(ans2, class)
#         V1        V2        V3 
#  "integer" "numeric" "logical"

Do you think you could take a look at it? The NA checks has to come first, AFAICT.

I think you should store the string length of each of the NAs so as to avoid calling strlen() each time here. And you can do that by using the already existing NA_MASK which is just boolean at the moment. I think you could store the lengths instead, and set it to 0 accordingly instead. Dint look at the code carefully, but I think, at the moment, for each field, strlen() is called at least once?
contains_non_zero() function call can also be avoided (I understand it's inline'd) by having an int and initialising it to the length(na.strings) and decrementing that counter when doing this, isn't it?

In general, I'd try and avoid function calls (especially repetitive ones) when it can be avoided just as easily...
I think we all understand what global variables bring to the table. This comment is really not necessary :-). We understand and accept the responsibility.

Here are the benchmarks on 2e8 rows (thanks for sharing the code again):

#   NA%     PR   stable
# 0.001    44.1    45.2
#  0.01    43.9    46.8
#   0.1    45.3    41.6
#   0.3    44.3    39.4

The timings are absolutely fine for normal scenarios.

I ran it with excessive amount of NULL and null, and tested with and without strlen() repetitive calls.. (quick and dirty testing):

require(data.table)
set.seed(1L)
DT = data.table(x=sample(c(1:3, "NULL", "null"), 2e8, TRUE))[, y := x]
write.table(DT, file = "~/fr_na_rich", quote = F, sep = ',', row.names = F, col.names = T)

Runtimes were ~25s and ~18s (~28% time spent on strlen()). Also tested with Instruments -> Time profiler and saw that this time was spent on strlen() calls.

It's a very welcoming fix, and It'd be great to wrap these things up before merging.

dselivanov · 2015-08-23T17:56:32Z

@arunsrinivasan , thank you very much for such detailed review! Agreed with your comments, and will have a deeper look into 1 and 3 points.
As I understand, point 1 (start parsing with NA checks) will involve a lot of overhead, so I expect performance will degrade considerably. But I will try to check.

arunsrinivasan · 2015-08-23T18:29:22Z

@dselivanov right, it might be more tricky. We'll probably need to optimise for default cases.. but maybe we can discuss that part at a later point (when Matt has some time to offer his thoughts as well). For now, fixing the function calls should be enough to merge, I think.

…on to character. Tests added.

dselivanov · 2015-08-29T10:37:02Z

@arunsrinivasan plz, review updated PR.

arunsrinivasan · 2015-09-07T13:34:31Z

Excellent! Re-ran the benchmark on the excessive NULL/null data and this is what I get now (even faster):

system.time(ans <- fread("~/fr_na_rich"))
# Read 200000000 rows and 2 (of 2) columns from 1.192 GB file in 00:00:16
#    user  system elapsed 
#  15.093   1.226  16.475 
system.time(ans <- fread("~/fr_na_rich", na.strings=c("NULL", "null")))
# Read 200000000 rows and 2 (of 2) columns from 1.192 GB file in 00:00:14
#    user  system elapsed 
#  11.900   0.873  12.870

IIUC this doesn't yet address the point (1) (which is fine btw)? Merging now..

dselivanov · 2019-10-26T07:35:46Z

@mattdowle just watched you video from h2o world. Seems I also should be in contributors in DESCRIPTION? :-)

MichaelChirico · 2019-10-26T12:52:47Z

@dselivanov thanks for bringing it up! I would say yes :)

Strange that you don't show up on the Contributors page though? Any idea why?

https://github.com/Rdatatable/data.table/graphs/contributors

dselivanov · 2019-10-26T12:59:01Z

@MichaelChirico might be that at some point my git was configured with email which was different to what I used on the github.

mattdowle · 2019-10-26T22:29:44Z

@dselivanov That's odd. Thanks for mentioning it. Yes, as Michael thought, it's because you're missing from https://github.com/Rdatatable/data.table/graphs/contributors. Seems like a GitHub bug to me. Even if you've changed email it should still link everything up. And if there are commits that are not associated with any user for any reason, then why aren't those displayed so at least we know they exist? I've raised a support request...

jangorecki · 2019-10-27T06:49:57Z

GitLab seems to deal a little better with that

Dmitry Selivanov
[email protected]

https://gitlab.com/Rdatatable/data.table/-/graphs/master

MichaelChirico · 2019-10-27T06:58:32Z

did GL catch anyone else we may have missed?

mattdowle · 2019-10-27T21:23:28Z

I heard back from GitHub support. One problem is that "[email protected]" is an invalid email address. If it were valid then Dmitry could add that email address to his profile and then that email address would be associated with his GitHub account :
https://help.github.com/articles/adding-an-email-address-to-your-github-account
But it's not a valid email address so that can't be done in GitHub.

To know if any more have been missed, they suggest :
git log --pretty="%an %ae%n%cn %ce" | sort | uniq
and then comparing the result against what we know. I just tried that and it returns 103 lines.

Indeed the GitLab page seems to do a better job. I let GitHub support know that and sent them a link.

So we need to add Dmitry to contributors list, and ask him retrospectively if he would have been ok with license change from GPL to MPL. Same for any others missed. First thought is that I should do that formally as a follow up PR I'll make, linked to the original, including Dmitry and any others all together.

mattdowle · 2019-12-27T04:52:46Z

Results of follow-up review are that 4 contributors need to be added to DESCRIPTION and their retrospective permission needs to be sought for the license change. PR for that linking here is on its way.

The command suggested by GitHub support (above) returns the most names (109) :
git log --pretty="%an %ae%n%cn %ce" | sort | uniq
That includes many variations for the same person; e.g. I have 4 lines there and Arun has 7.

GitLab contributors page shows 95. Whether it resolves variations seems a bit hit and miss. It includes Dmitry though.

GitHub history shows 79. Variations for the same person seem to be merged better, but some are missing, like Dmitry, as we saw before.

Named in DESCRIPTION: 48. Lower than 79 because there are 31 non-code contributors; e.g. fixes to documentation, or whitespace changes to code files.

So, I started from the biggest number (109) and did a full review including same-person variations. I show the first 8 characters of names+emails here just to avoid robot harvesting. But if anyone interested runs the git command above themselves, it should correspond exactly to the 109 lines shown here, in the same order.

Already in DESCRIPTION?	Non-code?	Retro needed?	Name first 8 chars		PR
Yes			2005m mo
		Yes	Andrey R	Andrey Riabushenko	1434	@cooldome
		Yes	Andrey R
Yes			Anthony
Yes			Arun ara
Yes			Arun aru
Yes			Arunkuma
Yes			arunsrin
Yes			Arun Sri
Yes			arunsrin
Yes			Arun Sri
Yes			Ayappan
	Yes		Bela Hau
	Yes		Ben Tupp
	Yes		Brett J.
	Yes: line endings don't count		chalabi
		Yes	Cheng H.	Cheng Lee	1457	@chenghlee
Yes			Daniel P
Yes			David Ar
Yes			DavidAre
Yes			Davis Va
Yes			Dénes Tó
Yes			DexGrove
		Yes	Dmitry S	Dmitry Selivanov	1236	@dselivanov
		Yes	Dmitry S
Yes			Eduard A
Yes			eduard e
	Yes: wording in a warning message string		Elio Cam
	Yes		erikdf e
Yes			Ethan Sm
	Yes		ew-git 3
	Yes		eyherabh
Yes			Felipe P
Yes			Francois
Yes			François
	Yes		Frank fr
	Yes		Gerhard
	Yes		Giovanni
Artifact of 2 line git output with %n in middle: `git log --pretty="%an %ae%n%cn %ce"`			GitHub n
	Yes		Hadley W
	Yes		Henrik-P
Yes			HughPars
	Yes		JackieMe
	Yes		James La
Yes			James Sa
	Yes		Jan Glei
Yes			Jan Gore
Yes			jangorec
Yes			javruceb
Yes			Jim Hest
	Yes		JoshOBri
	Yes		jwijffel
Yes			Karl Bro
Yes			Kun Ren
	Yes		Kyle Hay
Yes			Leonardo
	Yes		Mallick
Yes			Manish S
	Yes		Marco Co
Yes			marc-out
	Yes (travis.yml)		Mark Kli
Yes			MarkusBo
Yes			MarkusBo
Yes			Martin M
Yes			mattdowl
Yes			Matt Dow
Yes			mattdowl
Yes			Matt Dow
	Yes		matthewh
Yes			Matthieu
	Yes		Maxim Na
	Yes		mGalarny
Yes			Michael
Yes			Michael
Yes			Michael
Yes			Michael
Yes			Michael
Yes			Michel L
Yes			MusTheDa
		Yes	Nello Bl	Nello Blaser	1155	@blasern
	Yes		Oliver K
Yes			Otto Sei
Yes			Pasha St
Yes			Philippe
	Yes		psychome
Yes			Rick Sap
Yes			Roy Stor
	Yes		Russ Hyd
Yes			Scott Ri
Yes			Sebastia
Yes			Seth Wen
Yes			shrektan
	Yes		Simon Be
	Yes		stefan7t
Yes			Steve Li
Yes			Tobias S
Yes			Toby Dyl
Yes			Toby Dyl
Yes			Tom Shor
Yes			Tyson Ba
	Yes		Uwe Bloc
	Yes		Vadim Kh
	Yes		Vijay Lu
	Yes		Ville Vä
Yes			Watal M.
	Yes		wligtenb
Yes			xianghui
Yes			Xianying
	Yes		Yossi Sy

dselivanov force-pushed the fread_improvements branch 2 times, most recently from 15d3cb7 to 972f1c0 Compare July 21, 2015 08:42

dselivanov force-pushed the fread_improvements branch from 972f1c0 to ddf270f Compare July 22, 2015 07:16

dselivanov force-pushed the fread_improvements branch from ddf270f to d9fac21 Compare August 29, 2015 10:32

fixes Rdatatable#504 fread now handles all kind of NAs without coerci…

6c302f5

…on to character. Tests added.

dselivanov force-pushed the fread_improvements branch from d9fac21 to 6c302f5 Compare August 29, 2015 10:34

arunsrinivasan added the fread label Sep 4, 2015

arunsrinivasan merged commit 6c302f5 into Rdatatable:master Sep 7, 2015

arunsrinivasan mentioned this pull request Sep 7, 2015

fread's na.strings argument should handle values like "-999" #1314

Closed

dselivanov mentioned this pull request Sep 16, 2015

autodetect json fields and use jsonlite to parse them into lists #1240

Closed

mattdowle mentioned this pull request Dec 27, 2019

clarify error when rbindlist(l) is not a list() #3638

Merged

mattdowle mentioned this pull request Dec 27, 2019

Fix for Intel compiler bug #3648

Merged

mattdowle mentioned this pull request Dec 27, 2019

Retrospective license change permission #4140

Merged

jangorecki mentioned this pull request Oct 25, 2023

Formal governance document #5676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #504 fread now handles all kind of NAs without coercion to char #1236

fixes #504 fread now handles all kind of NAs without coercion to char #1236

dselivanov commented Jul 21, 2015

dselivanov commented Jul 21, 2015

jangorecki commented Jul 21, 2015

arunsrinivasan commented Jul 21, 2015

dselivanov commented Jul 21, 2015

dselivanov commented Jul 22, 2015

arunsrinivasan commented Aug 18, 2015

dselivanov commented Aug 19, 2015

arunsrinivasan commented Aug 23, 2015

dselivanov commented Aug 23, 2015

arunsrinivasan commented Aug 23, 2015

dselivanov commented Aug 29, 2015

arunsrinivasan commented Sep 7, 2015

dselivanov commented Oct 26, 2019

MichaelChirico commented Oct 26, 2019

dselivanov commented Oct 26, 2019

mattdowle commented Oct 26, 2019 •

edited

Loading

jangorecki commented Oct 27, 2019

MichaelChirico commented Oct 27, 2019

mattdowle commented Oct 27, 2019

mattdowle commented Dec 27, 2019 •

edited

Loading

fixes #504 fread now handles all kind of NAs without coercion to char #1236

fixes #504 fread now handles all kind of NAs without coercion to char #1236

Conversation

dselivanov commented Jul 21, 2015

dselivanov commented Jul 21, 2015

jangorecki commented Jul 21, 2015

arunsrinivasan commented Jul 21, 2015

dselivanov commented Jul 21, 2015

dselivanov commented Jul 22, 2015

arunsrinivasan commented Aug 18, 2015

dselivanov commented Aug 19, 2015

arunsrinivasan commented Aug 23, 2015

dselivanov commented Aug 23, 2015

arunsrinivasan commented Aug 23, 2015

dselivanov commented Aug 29, 2015

arunsrinivasan commented Sep 7, 2015

dselivanov commented Oct 26, 2019

MichaelChirico commented Oct 26, 2019

dselivanov commented Oct 26, 2019

mattdowle commented Oct 26, 2019 • edited Loading

jangorecki commented Oct 27, 2019

MichaelChirico commented Oct 27, 2019

mattdowle commented Oct 27, 2019

mattdowle commented Dec 27, 2019 • edited Loading

mattdowle commented Oct 26, 2019 •

edited

Loading

mattdowle commented Dec 27, 2019 •

edited

Loading