Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #504 fread now handles all kind of NAs without coercion to char #1236

Merged
merged 1 commit into from
Sep 7, 2015

Conversation

dselivanov
Copy link

Tests added. Test 882 adjusted to match correct behaviour.
I tested this a lot (also test.data.table() helps! ), but may be we need to add more tests for this. I believe I need some help with writing good tests.

Now na.strings = NULL means no coercion at all. Old hardcoded handling for "NA" string removed from everywhere.
@arunsrinivasan, please review code before merging. This PR touches a lot of fread.c internals. It affects Strto*() functions which are workhorses for data reading and conversion.
Also It would be great if @mattdowle also will take a look, because as I see he wrote most of the fread code.

@dselivanov dselivanov force-pushed the fread_improvements branch 2 times, most recently from 15d3cb7 to 972f1c0 Compare July 21, 2015 08:42
@dselivanov
Copy link
Author

strange behaviour - CI tests 908, 1343, 1344 were failed, however all tests were passed on my machine. I'll check.

@jangorecki
Copy link
Member

It happens sometimes to fail the tests while on the same code next built will pass them. In that case, all mentioned tests (908, 1343, 1344) are related to fread.
I did run the the test against your branch and it failed on my machine, it may be related to OS.
Reproducible script below.

library(devtools)
install_github("dselivanov/data.table@fread_improvements")
library(data.table)
test <- function(num, x, y) all.equal(x,y)
test(908, fread("A,B,C\n1,3,\n2,4,\n"), data.table(A=1:2,B=3:4,C=NA)) # where NA is type logical
# [1] "Component “C”: Modes: numeric, logical"               "Component “C”: target is numeric, current is logical"
test(1343, fread("A,B\n1,TRUE\n2,\n3,F"), data.table(A=1:3, B=c(TRUE,NA,FALSE)))
# [1] "Component “B”: Modes: character, logical"               "Component “B”: target is character, current is logical"
test(1344, fread("A,B\n1,T\n2,NA\n3,"), data.table(A=1:3, B=c(TRUE,NA,NA)))
# [1] "Component “B”: Modes: character, logical"               "Component “B”: target is character, current is logical"
# Warning message:
# In fread("A,B\n1,T\n2,NA\n3,") :
#   Bumped column 2 to type character on data row 1, field contains 'T'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5 devtools_1.8.0  

loaded via a namespace (and not attached):
 [1] httr_1.0.0      R6_2.1.0        magrittr_1.5    rversions_1.0.2 tools_3.2.1     curl_0.9.1      Rcpp_0.11.6     memoise_0.2.1   xml2_0.1.1      stringi_0.5-5  
[11] knitr_1.10.5    git2r_0.10.1    stringr_1.0.0   digest_0.6.8    chron_2.3-47   

@arunsrinivasan
Copy link
Member

@dselivanov thanks. I'll leave it to @mattdowle for this one.

What'd be very useful is to run a sufficiently large file (both with, and without using this argument) using fread() before and after your fix and report the timings.

@dselivanov
Copy link
Author

@jangorecki, I'm on mac os

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.5
loaded via a namespace (and not attached):
[1] tools_3.2.1 chron_2.3-47

But I'll try these test on ubuntu.

@arunsrinivasan Seems, timings for master and fread_improvements branches are quite similar.

Benchmark:
generate data

library(data.table)
K <- 1e7
DT <- data.table(int = 1:K, 
                 char = sample(letters, size = K, replace = T), 
                 float = 1:K + 0.1, 
                 bool = sample( c(T, F), K, replace = T))

fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for( fr in seq_along(fractions) ) {
  DT_NA <- copy(DT)
  for (j in seq_len( ncol(DT) )) {
    i <- sample(K, ceiling(fractions[[fr]] * K) )
    set(x = DT_NA, i = i, j = j, value = NA)
  }
  write.table(DT_NA, file = files[[fr]], quote = F, sep = ',', row.names = F, col.names = T)
}

NEW fread branch

library(devtools)
install_github("dselivanov/data.table@fread_improvements")
library(data.table)
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for (f in files) {
  # this copy prevents reading from already mmaped file?
  f1 <- '~/tempfile'
  system(paste('cp', f, f1))
  #*****************************************************
  timing <- system.time(DT_fread <- fread(f1, na.strings = "NA", verbose = F))
  print(f)
  print(timing)
}

[1] "/fr_0.001"
user system elapsed
1.546 0.146 1.692
[1] "
/fr_0.01"
user system elapsed
1.554 0.156 1.708
[1] "/fr_0.1"
user system elapsed
1.559 0.104 1.663
[1] "
/fr_0.3"
user system elapsed
1.823 0.086 1.910

OLD fread branch

library(devtools)
install_github('rdatatable/data.table')
library(data.table)
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for (f in files) {
  # this copy prevents reading from already mmaped file?
  f2 <- '~/tempfile'
  system(paste('cp', f, f2))
  #*****************************************************
  system(paste('cp', f, f2))
  timing <- system.time(DT_fread <- fread(f2, na.strings = "NA", verbose = F))
  print(f)
  print(timing)
}

[1] "/fr_0.001"
user system elapsed
1.456 0.149 1.605
[1] "
/fr_0.01"
user system elapsed
1.586 0.137 1.724
[1] "/fr_0.1"
user system elapsed
1.490 0.092 1.582
[1] "
/fr_0.3"
user system elapsed
1.528 0.088 1.617

@dselivanov
Copy link
Author

I fixed - variable na_found_flag was uninitialised. Compile with -pedantic flag helps to locate issue.

But how do you think, guys, is it right expected behaviour, that in these cases empty string at the end of line converted into NAs, even if we don't set na.strings=""?

@arunsrinivasan
Copy link
Member

@dselivanov thanks for your time. I'll have time this weekend to take a look at this. I don't foresee any issues.. so it's very likely I'll merge this. Thanks for the benchmarks and code to reproduce it. I ran it on 1e8 rows, and things weren't much different. That's great!

@dselivanov
Copy link
Author

@arunsrinivasan, nice! If you will need any assistance with code, feel free to ask.

@arunsrinivasan
Copy link
Member

@dselivanov had a look. Once again, great work. Here are some thoughts.

  1. cases like these aren't handled correctly at the moment:

    require(data.table)
    DT = data.table(a=9:10, b=9:10 + 0.1, c=as.logical(0:1))
    text = do.call("paste", c(DT, collapse="\n", sep=","))
    ans1 = fread(text, na.strings=c("9", "9.1", "FALSE"))
    #    V1   V2    V3
    # 1:  9  9.1 FALSE
    # 2: 10 10.1  TRUE
    
    sapply(ans1, class)
    #         V1        V2        V3 
    #  "integer" "numeric" "logical" 
    
    # whereas read.table() gives
    ans2 = read.table(text=text, na.strings=c("9", "9.1", "FALSE"), sep=",", header=FALSE)
    #   V1   V2   V3
    # 1 NA   NA   NA
    # 2 10 10.1 TRUE
    
    sapply(ans2, class)
    #         V1        V2        V3 
    #  "integer" "numeric" "logical" 

    Do you think you could take a look at it? The NA checks has to come first, AFAICT.

  2. I think you should store the string length of each of the NAs so as to avoid calling strlen() each time here. And you can do that by using the already existing NA_MASK which is just boolean at the moment. I think you could store the lengths instead, and set it to 0 accordingly instead. Dint look at the code carefully, but I think, at the moment, for each field, strlen() is called at least once?

  3. contains_non_zero() function call can also be avoided (I understand it's inline'd) by having an int and initialising it to the length(na.strings) and decrementing that counter when doing this, isn't it?

    In general, I'd try and avoid function calls (especially repetitive ones) when it can be avoided just as easily...

  4. I think we all understand what global variables bring to the table. This comment is really not necessary :-). We understand and accept the responsibility.

  5. Here are the benchmarks on 2e8 rows (thanks for sharing the code again):

    #   NA%     PR   stable
    # 0.001    44.1    45.2
    #  0.01    43.9    46.8
    #   0.1    45.3    41.6
    #   0.3    44.3    39.4

    The timings are absolutely fine for normal scenarios.


I ran it with excessive amount of NULL and null, and tested with and without strlen() repetitive calls.. (quick and dirty testing):

require(data.table)
set.seed(1L)
DT = data.table(x=sample(c(1:3, "NULL", "null"), 2e8, TRUE))[, y := x]
write.table(DT, file = "~/fr_na_rich", quote = F, sep = ',', row.names = F, col.names = T)

Runtimes were ~25s and ~18s (~28% time spent on strlen()). Also tested with Instruments -> Time profiler and saw that this time was spent on strlen() calls.

It's a very welcoming fix, and It'd be great to wrap these things up before merging.

@dselivanov
Copy link
Author

@arunsrinivasan , thank you very much for such detailed review! Agreed with your comments, and will have a deeper look into 1 and 3 points.
As I understand, point 1 (start parsing with NA checks) will involve a lot of overhead, so I expect performance will degrade considerably. But I will try to check.

@arunsrinivasan
Copy link
Member

@dselivanov right, it might be more tricky. We'll probably need to optimise for default cases.. but maybe we can discuss that part at a later point (when Matt has some time to offer his thoughts as well). For now, fixing the function calls should be enough to merge, I think.

@dselivanov
Copy link
Author

@arunsrinivasan plz, review updated PR.

@arunsrinivasan
Copy link
Member

Excellent! Re-ran the benchmark on the excessive NULL/null data and this is what I get now (even faster):

system.time(ans <- fread("~/fr_na_rich"))
# Read 200000000 rows and 2 (of 2) columns from 1.192 GB file in 00:00:16
#    user  system elapsed 
#  15.093   1.226  16.475 
system.time(ans <- fread("~/fr_na_rich", na.strings=c("NULL", "null")))
# Read 200000000 rows and 2 (of 2) columns from 1.192 GB file in 00:00:14
#    user  system elapsed 
#  11.900   0.873  12.870 

IIUC this doesn't yet address the point (1) (which is fine btw)? Merging now..

@dselivanov
Copy link
Author

@mattdowle just watched you video from h2o world. Seems I also should be in contributors in DESCRIPTION? :-)

@MichaelChirico
Copy link
Member

@dselivanov thanks for bringing it up! I would say yes :)

Strange that you don't show up on the Contributors page though? Any idea why?

https://github.com/Rdatatable/data.table/graphs/contributors

@dselivanov
Copy link
Author

@MichaelChirico might be that at some point my git was configured with email which was different to what I used on the github.

@mattdowle
Copy link
Member

mattdowle commented Oct 26, 2019

@dselivanov That's odd. Thanks for mentioning it. Yes, as Michael thought, it's because you're missing from https://github.com/Rdatatable/data.table/graphs/contributors. Seems like a GitHub bug to me. Even if you've changed email it should still link everything up. And if there are commits that are not associated with any user for any reason, then why aren't those displayed so at least we know they exist? I've raised a support request...

@jangorecki
Copy link
Member

GitLab seems to deal a little better with that

Dmitry Selivanov
[email protected]

https://gitlab.com/Rdatatable/data.table/-/graphs/master

@MichaelChirico
Copy link
Member

did GL catch anyone else we may have missed?

@mattdowle
Copy link
Member

I heard back from GitHub support. One problem is that "[email protected]" is an invalid email address. If it were valid then Dmitry could add that email address to his profile and then that email address would be associated with his GitHub account :
https://help.github.com/articles/adding-an-email-address-to-your-github-account
But it's not a valid email address so that can't be done in GitHub.

To know if any more have been missed, they suggest :
git log --pretty="%an %ae%n%cn %ce" | sort | uniq
and then comparing the result against what we know. I just tried that and it returns 103 lines.

Indeed the GitLab page seems to do a better job. I let GitHub support know that and sent them a link.

So we need to add Dmitry to contributors list, and ask him retrospectively if he would have been ok with license change from GPL to MPL. Same for any others missed. First thought is that I should do that formally as a follow up PR I'll make, linked to the original, including Dmitry and any others all together.

@mattdowle
Copy link
Member

mattdowle commented Dec 27, 2019

Results of follow-up review are that 4 contributors need to be added to DESCRIPTION and their retrospective permission needs to be sought for the license change. PR for that linking here is on its way.

The command suggested by GitHub support (above) returns the most names (109) :
git log --pretty="%an %ae%n%cn %ce" | sort | uniq
That includes many variations for the same person; e.g. I have 4 lines there and Arun has 7.

GitLab contributors page shows 95. Whether it resolves variations seems a bit hit and miss. It includes Dmitry though.

GitHub history shows 79. Variations for the same person seem to be merged better, but some are missing, like Dmitry, as we saw before.

Named in DESCRIPTION: 48. Lower than 79 because there are 31 non-code contributors; e.g. fixes to documentation, or whitespace changes to code files.

So, I started from the biggest number (109) and did a full review including same-person variations. I show the first 8 characters of names+emails here just to avoid robot harvesting. But if anyone interested runs the git command above themselves, it should correspond exactly to the 109 lines shown here, in the same order.

Already in DESCRIPTION? Non-code? Retro needed? Name first 8 chars   PR  
Yes     2005m mo      
    Yes Andrey R Andrey Riabushenko 1434 @cooldome
    Yes Andrey R      
Yes     Anthony      
Yes     Arun ara      
Yes     Arun aru      
Yes     Arunkuma      
Yes     arunsrin      
Yes     Arun Sri      
Yes     arunsrin      
Yes     Arun Sri      
Yes     Ayappan      
  Yes   Bela Hau      
  Yes   Ben Tupp      
  Yes   Brett J.      
  Yes: line endings don't count   chalabi      
    Yes Cheng H. Cheng Lee 1457 @chenghlee
Yes     Daniel P      
Yes     David Ar      
Yes     DavidAre      
Yes     Davis Va      
Yes     Dénes Tó      
Yes     DexGrove      
    Yes Dmitry S Dmitry Selivanov 1236 @dselivanov
    Yes Dmitry S      
Yes     Eduard A      
Yes     eduard e      
  Yes: wording in a warning message string   Elio Cam      
  Yes   erikdf e      
Yes     Ethan Sm      
  Yes   ew-git 3      
  Yes   eyherabh      
Yes     Felipe P      
Yes     Francois      
Yes     François      
  Yes   Frank fr      
  Yes   Gerhard      
  Yes   Giovanni      
Artifact of 2 line git output with %n in middle: git log --pretty="%an %ae%n%cn %ce"     GitHub n      
  Yes   Hadley W      
  Yes   Henrik-P      
Yes     HughPars      
  Yes   JackieMe      
  Yes   James La      
Yes     James Sa      
  Yes   Jan Glei      
Yes     Jan Gore      
Yes     jangorec      
Yes     javruceb      
Yes     Jim Hest      
  Yes   JoshOBri      
  Yes   jwijffel      
Yes     Karl Bro      
Yes     Kun Ren      
  Yes   Kyle Hay      
Yes     Leonardo      
  Yes   Mallick      
Yes     Manish S      
  Yes   Marco Co      
Yes     marc-out      
  Yes (travis.yml)   Mark Kli      
Yes     MarkusBo      
Yes     MarkusBo      
Yes     Martin M      
Yes     mattdowl      
Yes     Matt Dow      
Yes     mattdowl      
Yes     Matt Dow      
  Yes   matthewh      
Yes     Matthieu      
  Yes   Maxim Na      
  Yes   mGalarny      
Yes     Michael      
Yes     Michael      
Yes     Michael      
Yes     Michael      
Yes     Michael      
Yes     Michel L      
Yes     MusTheDa      
    Yes Nello Bl Nello Blaser 1155 @blasern
  Yes   Oliver K      
Yes     Otto Sei      
Yes     Pasha St      
Yes     Philippe      
  Yes   psychome      
Yes     Rick Sap      
Yes     Roy Stor      
  Yes   Russ Hyd      
Yes     Scott Ri      
Yes     Sebastia      
Yes     Seth Wen      
Yes     shrektan      
  Yes   Simon Be      
  Yes   stefan7t      
Yes     Steve Li      
Yes     Tobias S      
Yes     Toby Dyl      
Yes     Toby Dyl      
Yes     Tom Shor      
Yes     Tyson Ba      
  Yes   Uwe Bloc      
  Yes   Vadim Kh      
  Yes   Vijay Lu      
  Yes   Ville Vä      
Yes     Watal M.      
  Yes   wligtenb      
Yes     xianghui      
Yes     Xianying      
  Yes   Yossi Sy      

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants