[R-Forge #5360] Add fill=T to fread #536

arunsrinivasan · 2014-06-08T13:15:35Z

Submitted by: Michele Carriero; Assigned to: Nobody; R-Forge link

Since this option is being added to rbind I wonder if it could be added to fread too, in order to reflect the read.table feature.

mattdowle · 2014-08-17T14:13:49Z

Requested here as well (on a file from CBOE) :
http://stackoverflow.com/questions/25339552/how-to-read-cboe-csv-file-using-data-table

Seems likely it could be Excel generating such files. Could potentially be quite large and worth fread supporting then : http://stackoverflow.com/questions/25339552/how-to-read-cboe-csv-file-using-data-table/25341502?noredirect=1#comment39526821_25341502

markdanese · 2014-08-18T04:15:46Z

For what it is worth, our problematic dataset is data from Clinical Practice Research Datalink in the UK (the additional clinical details file where things like blood pressure, cholesterol, body weight, etc. are stored). Very commonly used in epidemiology and health services research. That one is not excel-based.

mpearmain · 2014-08-18T08:08:43Z

I often have data dumps taken from ad server data in a list format:
This list is a KEY:VALUE setup where the VALUE is itself a tuple
This is a very good setup for storing large amounts of data (and one i see implemented a lot)

Reading with a fill=T, flag would create a binary matrix
e.g

USER1, [a, b, c, d]
USER2, [b,e,f]
USER3, [a,b,c]

becomes

      a   b   c   d   e   f
USER1 1   1   1   1   0   0
USER2 0   1   0   0   1   1
USER3 1   1   1   0   0   0

Now with a quick awk script i can transform these

awk -F '[ ,\\[\\]]+' '{for (i=2; i<NF; i++) print $1,$i}' $1 >> "transformed_$1"

I am then able to use fread, and post process, (i personally read into a sparse data matrix)

But the use case it obviously much more to save having to AWK data files prior to reading and then converting.

This proves to be significantly faster than something like:

ReadMaxCSVCols <- function(f, sep = ",", quote = "\"'", header = FALSE, ...) {
  nc <- max(count.fields(f, sep = sep, quote = quote))
  read.table(f, 
             sep = sep, 
             quote = quote, 
             header = header,
             fill = TRUE,
             col.names = paste("V", 1:nc, sep = ""),
             ...)
}
foo <- data.table(ReadMaxCSVCols("myfile.txt"))

mattdowle · 2014-08-18T10:13:55Z

@mpearmain Thanks, really useful. Tuple columns like VALUE was what sep2= was intended for. The VALUE column would be read into a list column. Would that work for you? The original use case for sep2= was columns 11 and 12 of BED files in genomics (they are vectors of integers iirc, separated within a field by a different separator than between fields). Is your VALUE field really wrapped with [ ] like that (or similar) then that could be coded in fread as an option where sep==sep2 i.e. both comma.
Could do fill=T as well, just that reading VALUE into a list column might be better. It depends on what operations you need to do it on afterwards really?

mattdowle · 2014-08-18T10:21:40Z

@markdanese Great, yes very useful to know, thanks. Could you post a link to a sample file perhaps (or a made-up example of 3 or 4 lines that's close would be great). I had a look at http://www.cprd.com/ and it seems huge and varied ... and interesting. We could do fill=TRUE, but might sep2= into a list column be better and work for you? See new comments above.

markdanese · 2014-08-19T06:37:18Z

The list probably won't help. It is a simple flat file and would probably be easiest as columns -- to create a complete table.

I took a small file and changed individual digits randomly so that this is not identifiable. This dropbox link should allow you to get the .txt file:
https://www.dropbox.com/s/tkz4ofbqxis41w4/PET_Additional001%20copy.txt

Thanks for your help, and let me know if this file doesn't work.

mpearmain · 2014-08-19T09:21:56Z

Hi Matt,

I think you've hit the nail on the head with what you want to do after, to me the main use is to load as fast as possible and with a structure that is consistent, the list mechanism would allow for this.

I'm looking to do binary matrix factorization and so a full or sparse matrix is the end point, and so the list isnt ideal, but it adds structure if i am given a list of cols,

I can of course transform this into a DT or matrix, my concern is the overhead of the transform operation. which means running a few AWK or SED scripts before may still be the best option in my situation.

mattdowle mentioned this issue Aug 17, 2014

Fread autofill NA or NULL for empty columns #766

Closed

arunsrinivasan added the fread label Sep 4, 2015

arunsrinivasan added this to the v1.9.8 milestone Dec 17, 2015

arunsrinivasan self-assigned this Dec 21, 2015

arunsrinivasan closed this as completed in ba84a4c Dec 21, 2015

st-pasha mentioned this issue Apr 18, 2018

Add text argument to fread #2753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #5360] Add fill=T to fread #536

[R-Forge #5360] Add fill=T to fread #536

arunsrinivasan commented Jun 8, 2014

mattdowle commented Aug 17, 2014

markdanese commented Aug 18, 2014

mpearmain commented Aug 18, 2014

mattdowle commented Aug 18, 2014

mattdowle commented Aug 18, 2014

markdanese commented Aug 19, 2014

mpearmain commented Aug 19, 2014

[R-Forge #5360] Add fill=T to fread #536

[R-Forge #5360] Add fill=T to fread #536

Comments

arunsrinivasan commented Jun 8, 2014

mattdowle commented Aug 17, 2014

markdanese commented Aug 18, 2014

mpearmain commented Aug 18, 2014

mattdowle commented Aug 18, 2014

mattdowle commented Aug 18, 2014

markdanese commented Aug 19, 2014

mpearmain commented Aug 19, 2014