-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunked fread #1721
Comments
If I understand correctly, this FR asks for a new The only complication that I foresee is that |
It would be easier to implement emitting DataTables of approximate size (i.e. not exactly 1,000,000 rows, but approximately that many). Then the logic could be implemented in |
FWIW, this is no longer relevant for me. |
To me, CSV is a one-off on the way to a binary or database. If it's so large that it won't fit and chunking is needed, then the data should be in a database or binary format. |
Just came across this Q & A, which suggests a simple (???) way to accomplish this would be to allow |
@MichaelChirico I'm not sure we are all on the same page of what it means to "support file connections". For example, python's However this wouldn't allow chunking the input. I mean, if you ask |
what I have in mind is not for fread to handle the chunking automatically,
but for the user to be able to rig something up themselves that amounts to
chunking.
so, perhaps user won't have access to all the bells & whistles (e.g. column
re-read as you said), but e.g. if they can supply colClasses that goes away.
I might not be fully grokking the engineering requirements though.
fread(paste(readLines(f, n = 1e5), collapse = '\n')) is a way to accomplish
it now, quite inefficiently of course. is there no way to imitate that
without substantial code refactoring (conditional on fread accepting
connections)?
|
|
@mattdowle - I disagree with your reason for closing this. There are lots of examples of huge datasets that have no business in a database and lots of Streaming algorithms to work with them. Why not continue to keep your hat in this arena. DT excels so well in so many ways, and it already interoperates with non-seekable inputs. I expect @st-pasha's suggestion that this "could be implemented in pushBuffers function" is probably spot on. And clearly, hope springs eternal: Reading in chunks at a time using fread in package data.table. Just sayin' |
Please provide one example so I can understand.
Please provide one example. If I understand correctly, you're asking for I've reopened so we can discuss further. The examples will help a lot. |
Just to clarify, they are wrong for analytics, but for transactions processing they make perfect sense. |
@mattdowle - Next gen sequencing commonly produces tens of millions (more?) of rows in Sequence Alignment Map (SAM) files which are then further scanned / tabulated often using fread (at least by me). I have wished for "chunked" reading due to memory considerations, and also, the desire to dispatch chunks to worker processes. These SAM files are often themselves ephemeral and intermediate to an analysis- databasing them would waste resources - they are produced as flat files - have a few binary variants that facilitate selected retrieval operations which are also probably best considered ephemeral, and regardless, they still occasionally require linear scanning. Perhaps I am misunderstanding your considerations, or am overlooking an approach I might be taking. On a similar/related topic, would you entertain a request for fread to produce a list of data.tables, one for each section within a .csv, where section delimiters would be identified during scan via a regular expressions? This could easily be accomplished by two passes - one to find sections based on regular expressions and the next to passing line-numbers and counts to fread - however I thought it might make a natural extension to fread's inner loop. Thanks for your consideration of this. |
@malcook Thanks for the info and links. If there's enough demand for it, then I should reconsider then. |
I'd just like to chime in as another programmer who'd appreciate this. I'd be the first to agree that a single CSV file is a poor choice of format for a massive dataset. Unfortunately, one might still get massive CSV files from other people. Two examples I've encountered so far are US voter-registration data, which can comprise tens of millions of observations per state but state governments may provide as single CSV files, and daily or hourly observations from Mexican weather stations. Such files would be easy to handle with linewise Unix tools if it weren't for quoted fields with embedded newlines. My group will probably use |
Just to throw in an example I have: I've got a csv that's too big to fit into memory on my laptop. However, I have some data munging I want to do on that dataset that amounts to:
After this operation, the file size is small enough to fit into memory, but I can't load it all at once to do it. Right now I'm using a function like this (it isn't perfect yet): library(pbapply)
library(data.table)
chunked_fread <- function(file, chunk_size, FUN=NULL, verbose=TRUE, ...){
lines <- as.integer(system(paste("wc -l <", file), intern=TRUE))
n_chunks <- ceiling(lines/chunk_size)
if(verbose){
print(paste(lines, 'lines,', n_chunks, 'chunks, chunk size of', chunk_size))
}
if(verbose){
lapply <- pblapply
}
out <- lapply(1:n_chunks, function(idx){
out <- fread(file, skip=idx, nrows=chunk_size, ...)
if(!is.null(FUN)){
out <- FUN(out)
}
return(out)
})
return(rbindlist(out))
}
chunked_fread(my_file, chunk_size=100, FUN=function(x) return(x[,list(new=V1+V2)])) |
@zachmayer I am also using a similar function. But I like readr's chunked function which calls a |
I'd like to chime in with an example as well. I am just analysing a genetic variant file with 421 million rows and long rows at that. I've got a few of these files that are 100s of GBs so I encounter memory issues when trying to deal with it on the server. It is common in genetics to work with very large tabular data. Often what I would like to be able to do is to stream and filter the file to reduce its size. I usually do this filtering in bash but would be a lot smoother if I could do it all in Rdata.table. |
related to @adamwaring use case #583 |
{disk.frame} provides a method for chunked read (and manipulation) of too-large-for-RAM tabular data. Notably, it imports {data.table}. |
Oh awesome, disk.frame totally solves my problem. Nice that it imports data.table! |
Consider a case when there's a large csv file, but it can be processed by chunks. It would be nice if
fread
could read the file in chunks. See also Reading in chunks at a time using fread in package data.table on StackOverflow.The interface would be something like
fread.apply(input, fun, chunk.size = 1000, ...)
, wherefun
would be applied (similar tolapply
) to subsequent data table chunks read frominput
of the size at mostchunk.size
.If there's a consensus, I could work on a PR.
The text was updated successfully, but these errors were encountered: