Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread converting empty string to NA if all rows have empty string for that column #4579

Closed
ethanbsmith opened this issue Jun 28, 2020 · 8 comments
Labels

Comments

@ethanbsmith
Copy link
Contributor

docs say: ,"", is unambiguous and read as an empty string, so the second example below seems like an error. I'm not sure how to work around this at the moment

> fread(input = 'A,B\n1,foo\n2,""')
   A   B
1: 1 foo
2: 2    
> fread(input = 'A,B\n1,""\n2,""')
   A  B
1: 1 NA
2: 2 NA
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rollRegres_0.1.2  rio_0.5.16        rvest_0.3.4       xml2_1.2.2        data.table_1.12.9 curl_4.2          quantmod_0.4-15   TTR_0.23-5        xts_0.11-2        zoo_1.8-6        
[11] RODBC_1.3-16      doParallel_1.0.15 iterators_1.0.12  foreach_1.4.7     plotrix_3.7-6     checkpoint_0.4.7 

loaded via a namespace (and not attached):
 [1] zip_2.0.4        Rcpp_1.0.2       pillar_1.4.2     compiler_3.6.3   cellranger_1.1.0 forcats_0.4.0    tools_3.6.3      zeallot_0.1.0    checkmate_1.9.4  jsonlite_1.6     tibble_2.1.3    
[12] lattice_0.20-38  pkgconfig_2.0.3  rlang_0.4.0      openxlsx_4.1.0.1 rstudioapi_0.10  haven_2.1.1      stringr_1.4.0    httr_1.4.1       vctrs_0.2.0      hms_0.5.1        grid_3.6.3      
[23] R6_2.4.0         readxl_1.3.1     foreign_0.8-75   selectr_0.4-1    magrittr_1.5     codetools_0.2-16 backports_1.1.4  stringi_1.4.3    crayon_1.3.4
@ethanbsmith ethanbsmith changed the title fread converting empty string to NA if all rows have empty string fread converting empty string to NA if all rows have empty string for that column Jun 28, 2020
@jangorecki
Copy link
Member

Column was recognized as logical, if we force it to character then you get what you need.

fread(input = 'A,B\n1,""\n2,""', colClasses=c("integer","character"))
#       A      B
#   <int> <char>
#1:     1       
#2:     2       

Feel free to close this issue, turn it into FR, or a documentation update request.

@ethanbsmith
Copy link
Contributor Author

thx for the quick response! using colClasses to force this to load as character works. (i should have thought of that)

I'm not sure if this is a functionality or a documentation issue. It does seem to me that the docs are correct here and that "" is always a literal empty string. However, there may well be a compelling reason to treat this as logical that I'm not seeing.

I would have to leave this as your call as to whether its a functionality issue or a doc issue, but suspect it has to be one or the other

@ethanbsmith
Copy link
Contributor Author

one last thought. if "" is intended to be treated as a logical, how would one represent (and roundtrip) a column of empty stings?

@jangorecki
Copy link
Member

jangorecki commented Jun 28, 2020

That sounds like a FR :)
On the other hand it introduces minor inconsistency, making column classes depending on the fact if values are quoted or not. We could assume that quotes should force column to be character, then we can proceed with this issue as FR for that.
Logical type may also come from 0/1 integers, or in future possibly from Y/N character #4564.

@ethanbsmith
Copy link
Contributor Author

ok, now my head hurts ;)

str(fread(input = 'A,B\n1,"2"\n3,4'))
Classes ‘data.table’ and 'data.frame':	2 obs. of  2 variables:
 $ A: int  1 3
 $ B: int  2 4
 - attr(*, ".internal.selfref")=<externalptr> 

wow, I did not expect that! here I was thinking " implies character string. Before I ask for this as an FR, I will have to stew on this for a bit and get back to you, as existing functionality is not what I expected. Thx for the patience

Are there known use cases where treating something surrounded by " as character causes problems?

@jangorecki
Copy link
Member

no idea, when I do use fread, I usually do it on a good quality csv files, so I won't have good insight in such problems.

@ethanbsmith
Copy link
Contributor Author

having thought about this a bit, the most compelling argument went in the direction of round-tripping fread and fwrite of empty string. This lead me to #2524, where some of these ideas have been explored.

I have concluded that roundtripping via CSV is probably not the way to think about things. if one wants to persist state with data-type integrity, a database or binary format is probably the way to go.

that largely leaves this as an edge case that can currently be handled by an explicit colClasses and am happy to accept whatever get concluded in #2524

@jangorecki
Copy link
Member

I think CSVY is what is meant to address the fact that csv doesn't have schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants