-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes #504 fread now handles all kind of NAs without coercion to char #1236
fixes #504 fread now handles all kind of NAs without coercion to char #1236
Conversation
15d3cb7
to
972f1c0
Compare
strange behaviour - CI tests 908, 1343, 1344 were failed, however all tests were passed on my machine. I'll check. |
It happens sometimes to fail the tests while on the same code next built will pass them. In that case, all mentioned tests (908, 1343, 1344) are related to fread. library(devtools)
install_github("dselivanov/data.table@fread_improvements")
library(data.table)
test <- function(num, x, y) all.equal(x,y)
test(908, fread("A,B,C\n1,3,\n2,4,\n"), data.table(A=1:2,B=3:4,C=NA)) # where NA is type logical
# [1] "Component “C”: Modes: numeric, logical" "Component “C”: target is numeric, current is logical"
test(1343, fread("A,B\n1,TRUE\n2,\n3,F"), data.table(A=1:3, B=c(TRUE,NA,FALSE)))
# [1] "Component “B”: Modes: character, logical" "Component “B”: target is character, current is logical"
test(1344, fread("A,B\n1,T\n2,NA\n3,"), data.table(A=1:3, B=c(TRUE,NA,NA)))
# [1] "Component “B”: Modes: character, logical" "Component “B”: target is character, current is logical"
# Warning message:
# In fread("A,B\n1,T\n2,NA\n3,") :
# Bumped column 2 to type character on data row 1, field contains 'T'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
|
@dselivanov thanks. I'll leave it to @mattdowle for this one. What'd be very useful is to run a sufficiently large file (both with, and without using this argument) using |
@jangorecki, I'm on mac os
But I'll try these test on ubuntu. @arunsrinivasan Seems, timings for master and fread_improvements branches are quite similar. Benchmark: library(data.table)
K <- 1e7
DT <- data.table(int = 1:K,
char = sample(letters, size = K, replace = T),
float = 1:K + 0.1,
bool = sample( c(T, F), K, replace = T))
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for( fr in seq_along(fractions) ) {
DT_NA <- copy(DT)
for (j in seq_len( ncol(DT) )) {
i <- sample(K, ceiling(fractions[[fr]] * K) )
set(x = DT_NA, i = i, j = j, value = NA)
}
write.table(DT_NA, file = files[[fr]], quote = F, sep = ',', row.names = F, col.names = T)
} NEW fread branch library(devtools)
install_github("dselivanov/data.table@fread_improvements")
library(data.table)
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for (f in files) {
# this copy prevents reading from already mmaped file?
f1 <- '~/tempfile'
system(paste('cp', f, f1))
#*****************************************************
timing <- system.time(DT_fread <- fread(f1, na.strings = "NA", verbose = F))
print(f)
print(timing)
}
OLD fread branch library(devtools)
install_github('rdatatable/data.table')
library(data.table)
fractions <- c(0.001, 0.01, 0.1, 0.3)
files <- paste0("~/fr_", fractions)
for (f in files) {
# this copy prevents reading from already mmaped file?
f2 <- '~/tempfile'
system(paste('cp', f, f2))
#*****************************************************
system(paste('cp', f, f2))
timing <- system.time(DT_fread <- fread(f2, na.strings = "NA", verbose = F))
print(f)
print(timing)
}
|
972f1c0
to
ddf270f
Compare
I fixed - variable But how do you think, guys, is it right expected behaviour, that in these cases empty string at the end of line converted into |
@dselivanov thanks for your time. I'll have time this weekend to take a look at this. I don't foresee any issues.. so it's very likely I'll merge this. Thanks for the benchmarks and code to reproduce it. I ran it on 1e8 rows, and things weren't much different. That's great! |
@arunsrinivasan, nice! If you will need any assistance with code, feel free to ask. |
@dselivanov had a look. Once again, great work. Here are some thoughts.
I ran it with excessive amount of require(data.table)
set.seed(1L)
DT = data.table(x=sample(c(1:3, "NULL", "null"), 2e8, TRUE))[, y := x]
write.table(DT, file = "~/fr_na_rich", quote = F, sep = ',', row.names = F, col.names = T) Runtimes were ~25s and ~18s (~28% time spent on It's a very welcoming fix, and It'd be great to wrap these things up before merging. |
@arunsrinivasan , thank you very much for such detailed review! Agreed with your comments, and will have a deeper look into 1 and 3 points. |
@dselivanov right, it might be more tricky. We'll probably need to optimise for default cases.. but maybe we can discuss that part at a later point (when Matt has some time to offer his thoughts as well). For now, fixing the function calls should be enough to merge, I think. |
ddf270f
to
d9fac21
Compare
…on to character. Tests added.
d9fac21
to
6c302f5
Compare
@arunsrinivasan plz, review updated PR. |
Excellent! Re-ran the benchmark on the excessive NULL/null data and this is what I get now (even faster): system.time(ans <- fread("~/fr_na_rich"))
# Read 200000000 rows and 2 (of 2) columns from 1.192 GB file in 00:00:16
# user system elapsed
# 15.093 1.226 16.475
system.time(ans <- fread("~/fr_na_rich", na.strings=c("NULL", "null")))
# Read 200000000 rows and 2 (of 2) columns from 1.192 GB file in 00:00:14
# user system elapsed
# 11.900 0.873 12.870 IIUC this doesn't yet address the point (1) (which is fine btw)? Merging now.. |
@mattdowle just watched you video from h2o world. Seems I also should be in contributors in DESCRIPTION? :-) |
@dselivanov thanks for bringing it up! I would say yes :) Strange that you don't show up on the Contributors page though? Any idea why? https://github.com/Rdatatable/data.table/graphs/contributors |
@MichaelChirico might be that at some point my git was configured with email which was different to what I used on the github. |
@dselivanov That's odd. Thanks for mentioning it. Yes, as Michael thought, it's because you're missing from https://github.com/Rdatatable/data.table/graphs/contributors. Seems like a GitHub bug to me. Even if you've changed email it should still link everything up. And if there are commits that are not associated with any user for any reason, then why aren't those displayed so at least we know they exist? I've raised a support request... |
GitLab seems to deal a little better with that
|
did GL catch anyone else we may have missed? |
I heard back from GitHub support. One problem is that "[email protected]" is an invalid email address. If it were valid then Dmitry could add that email address to his profile and then that email address would be associated with his GitHub account : To know if any more have been missed, they suggest : Indeed the GitLab page seems to do a better job. I let GitHub support know that and sent them a link. So we need to add Dmitry to contributors list, and ask him retrospectively if he would have been ok with license change from GPL to MPL. Same for any others missed. First thought is that I should do that formally as a follow up PR I'll make, linked to the original, including Dmitry and any others all together. |
Results of follow-up review are that 4 contributors need to be added to DESCRIPTION and their retrospective permission needs to be sought for the license change. PR for that linking here is on its way. The command suggested by GitHub support (above) returns the most names (109) : GitLab contributors page shows 95. Whether it resolves variations seems a bit hit and miss. It includes Dmitry though. GitHub history shows 79. Variations for the same person seem to be merged better, but some are missing, like Dmitry, as we saw before. Named in DESCRIPTION: 48. Lower than 79 because there are 31 non-code contributors; e.g. fixes to documentation, or whitespace changes to code files. So, I started from the biggest number (109) and did a full review including same-person variations. I show the first 8 characters of names+emails here just to avoid robot harvesting. But if anyone interested runs the
|
Tests added. Test 882 adjusted to match correct behaviour.
I tested this a lot (also
test.data.table()
helps! ), but may be we need to add more tests for this. I believe I need some help with writing good tests.Now
na.strings = NULL
means no coercion at all. Old hardcoded handling for "NA" string removed from everywhere.@arunsrinivasan, please review code before merging. This PR touches a lot of
fread.c
internals. It affectsStrto*()
functions which are workhorses for data reading and conversion.Also It would be great if @mattdowle also will take a look, because as I see he wrote most of the
fread
code.