Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5360] Add fill=T to fread #536

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 7 comments
Closed

[R-Forge #5360] Add fill=T to fread #536

arunsrinivasan opened this issue Jun 8, 2014 · 7 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Michele Carriero; Assigned to: Nobody; R-Forge link

Since this option is being added to rbind I wonder if it could be added to fread too, in order to reflect the read.table feature.

@mattdowle
Copy link
Member

Requested here as well (on a file from CBOE) :
http://stackoverflow.com/questions/25339552/how-to-read-cboe-csv-file-using-data-table

Seems likely it could be Excel generating such files. Could potentially be quite large and worth fread supporting then : http://stackoverflow.com/questions/25339552/how-to-read-cboe-csv-file-using-data-table/25341502?noredirect=1#comment39526821_25341502

@markdanese
Copy link

For what it is worth, our problematic dataset is data from Clinical Practice Research Datalink in the UK (the additional clinical details file where things like blood pressure, cholesterol, body weight, etc. are stored). Very commonly used in epidemiology and health services research. That one is not excel-based.

@mpearmain
Copy link

I often have data dumps taken from ad server data in a list format:
This list is a KEY:VALUE setup where the VALUE is itself a tuple
This is a very good setup for storing large amounts of data (and one i see implemented a lot)

Reading with a fill=T, flag would create a binary matrix
e.g

USER1, [a, b, c, d]
USER2, [b,e,f]
USER3, [a,b,c]

becomes

      a   b   c   d   e   f
USER1 1   1   1   1   0   0
USER2 0   1   0   0   1   1
USER3 1   1   1   0   0   0

Now with a quick awk script i can transform these

awk -F '[ ,\\[\\]]+' '{for (i=2; i<NF; i++) print $1,$i}' $1 >> "transformed_$1"

I am then able to use fread, and post process, (i personally read into a sparse data matrix)

But the use case it obviously much more to save having to AWK data files prior to reading and then converting.

This proves to be significantly faster than something like:

ReadMaxCSVCols <- function(f, sep = ",", quote = "\"'", header = FALSE, ...) {
  nc <- max(count.fields(f, sep = sep, quote = quote))
  read.table(f, 
             sep = sep, 
             quote = quote, 
             header = header,
             fill = TRUE,
             col.names = paste("V", 1:nc, sep = ""),
             ...)
}
foo <- data.table(ReadMaxCSVCols("myfile.txt"))

@mattdowle
Copy link
Member

@mpearmain Thanks, really useful. Tuple columns like VALUE was what sep2= was intended for. The VALUE column would be read into a list column. Would that work for you? The original use case for sep2= was columns 11 and 12 of BED files in genomics (they are vectors of integers iirc, separated within a field by a different separator than between fields). Is your VALUE field really wrapped with [ ] like that (or similar) then that could be coded in fread as an option where sep==sep2 i.e. both comma.
Could do fill=T as well, just that reading VALUE into a list column might be better. It depends on what operations you need to do it on afterwards really?

@mattdowle
Copy link
Member

@markdanese Great, yes very useful to know, thanks. Could you post a link to a sample file perhaps (or a made-up example of 3 or 4 lines that's close would be great). I had a look at http://www.cprd.com/ and it seems huge and varied ... and interesting. We could do fill=TRUE, but might sep2= into a list column be better and work for you? See new comments above.

@markdanese
Copy link

The list probably won't help. It is a simple flat file and would probably be easiest as columns -- to create a complete table.

I took a small file and changed individual digits randomly so that this is not identifiable. This dropbox link should allow you to get the .txt file:
https://www.dropbox.com/s/tkz4ofbqxis41w4/PET_Additional001%20copy.txt

Thanks for your help, and let me know if this file doesn't work.

@mpearmain
Copy link

Hi Matt,

I think you've hit the nail on the head with what you want to do after, to me the main use is to load as fast as possible and with a structure that is consistent, the list mechanism would allow for this.

I'm looking to do binary matrix factorization and so a full or sparse matrix is the end point, and so the list isnt ideal, but it adds structure if i am given a list of cols,

I can of course transform this into a DT or matrix, my concern is the overhead of the transform operation. which means running a few AWK or SED scripts before may still be the best option in my situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants