Chunked fread #1721

UnkindPartition · 2016-05-25T18:32:31Z

Consider a case when there's a large csv file, but it can be processed by chunks. It would be nice if fread could read the file in chunks. See also Reading in chunks at a time using fread in package data.table on StackOverflow.

The interface would be something like fread.apply(input, fun, chunk.size = 1000, ...), where fun would be applied (similar to lapply) to subsequent data table chunks read from input of the size at most chunk.size.

If there's a consensus, I could work on a PR.

The text was updated successfully, but these errors were encountered:

st-pasha · 2017-11-08T23:18:50Z

fread already reads the input file in chunks, avoiding carrying large parts of input in-memory. This process is invisible to the user (unless "verbose" mode is turned on), and results in a DataTable that has data from the entire file.

If I understand correctly, this FR asks for a new fread "mode" where the result is not a single DataTable, but rather a sequence of smaller DataTables emitted one-at-a-time via some callback or generator mechanism.

The only complication that I foresee is that fread sometimes has to re-read the data (say, due to "out-of-sample type exceptions" or similar problems). Currently at most 1 re-read may happen, but in the future more re-read steps may be necessary to address some of the corner cases.

st-pasha · 2017-11-08T23:45:02Z

It would be easier to implement emitting DataTables of approximate size (i.e. not exactly 1,000,000 rows, but approximately that many). Then the logic could be implemented in pushBuffers function, without changing freadMain.

UnkindPartition · 2017-11-08T23:56:41Z

FWIW, this is no longer relevant for me.

mattdowle · 2018-03-03T11:20:00Z

To me, CSV is a one-off on the way to a binary or database. If it's so large that it won't fit and chunking is needed, then the data should be in a database or binary format.
We have done parallel reading in chunks, but to result in the final full table. The sampling uses the full file, etc.
Very unlikely we'll provide a user api for chunked csv access. skip= and nrows= are only really for working around format errors; those aren't meant to provide chunked access really.

MichaelChirico · 2018-03-03T11:53:53Z

Just came across this Q & A, which suggests a simple (???) way to accomplish this would be to allow fread to handle connections as input, i.e., this is a subset of #561:

https://stackoverflow.com/a/30403877/3576984

st-pasha · 2018-03-03T19:42:57Z

@MichaelChirico I'm not sure we are all on the same page of what it means to "support file connections". For example, python's fread supports reading from file-like objects (analogue of file connections in R) in a trivial way: you give me a read()-able object, I dump its content into a temporary file, then fread that file.

However this wouldn't allow chunking the input. I mean, if you ask read() to just return a certain number of bytes, then you're virtually guaranteed to have the last line broken (perhaps even having an unterminated string literal). Re-reading the last line with the next chunk would not work either - a file connection is not always seekable. Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly. All of these requirements would break if you insist that the input must be a streamable non-seekable object. (Unless you dump it into a file that is).

MichaelChirico · 2018-03-04T00:13:30Z

what I have in mind is not for fread to handle the chunking automatically, but for the user to be able to rig something up themselves that amounts to chunking. so, perhaps user won't have access to all the bells & whistles (e.g. column re-read as you said), but e.g. if they can supply colClasses that goes away. I might not be fully grokking the engineering requirements though. fread(paste(readLines(f, n = 1e5), collapse = '\n')) is a way to accomplish it now, quite inefficiently of course. is there no way to imitate that without substantial code refactoring (conditional on fread accepting connections)?

st-pasha · 2018-03-04T01:54:56Z

paste(readLines(f, n = 1e5), collapse = '\n') reads the entire input into a character vector, then copies the content into a single memory buffer, then passes that buffer to fread. This is less efficient than directly dumping the content of f into a temporary file and reading that file.

malcook · 2018-03-18T08:42:54Z

@mattdowle - I disagree with your reason for closing this. There are lots of examples of huge datasets that have no business in a database and lots of Streaming algorithms to work with them. Why not continue to keep your hat in this arena. DT excels so well in so many ways, and it already interoperates with non-seekable inputs. I expect @st-pasha's suggestion that this "could be implemented in pushBuffers function" is probably spot on.

And clearly, hope springs eternal: Reading in chunks at a time using fread in package data.table.

Just sayin'

mattdowle · 2018-04-22T07:33:04Z

@malcook

There are lots of examples of huge datasets that have no business in a database

Please provide one example so I can understand.

and lots of Streaming algorithms to work with them

Please provide one example.

If I understand correctly, you're asking for fread to support this in a CSV file. We want to do such things, but not in CSV. Rather, in a columnar binary format; i.e. a database. That doesn't mean SQL, nor Oracle, MySQL or MS SQL Server, but in a columnar binary format. You get the CSV format into that columnar binary format (which is a transpose, too, for cache efficiency) and then do chunking and streaming properly in that suitable format. Traditional row-store SQL databases are wrong for that because they're row-store and SQL. Same problem with CSV ... row-store.

I've reopened so we can discuss further. The examples will help a lot.

jangorecki · 2018-04-22T08:37:02Z

Just to clarify, they are wrong for analytics, but for transactions processing they make perfect sense.
Before starting any work on binary columnar format it might be good to keep in mind compatibility with this "new" "thing" https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ whatever it will be :)

malcook · 2018-04-22T22:06:21Z

@mattdowle - Next gen sequencing commonly produces tens of millions (more?) of rows in Sequence Alignment Map (SAM) files which are then further scanned / tabulated often using fread (at least by me). I have wished for "chunked" reading due to memory considerations, and also, the desire to dispatch chunks to worker processes. These SAM files are often themselves ephemeral and intermediate to an analysis- databasing them would waste resources - they are produced as flat files - have a few binary variants that facilitate selected retrieval operations which are also probably best considered ephemeral, and regardless, they still occasionally require linear scanning. Perhaps I am misunderstanding your considerations, or am overlooking an approach I might be taking.

On a similar/related topic, would you entertain a request for fread to produce a list of data.tables, one for each section within a .csv, where section delimiters would be identified during scan via a regular expressions? This could easily be accomplished by two passes - one to find sections based on regular expressions and the next to passing line-numbers and counts to fread - however I thought it might make a natural extension to fread's inner loop. Thanks for your consideration of this.

mattdowle · 2018-05-12T03:24:16Z

@malcook Thanks for the info and links. If there's enough demand for it, then I should reconsider then.

Kodiologist · 2018-12-12T22:31:36Z

I'd just like to chime in as another programmer who'd appreciate this. I'd be the first to agree that a single CSV file is a poor choice of format for a massive dataset. Unfortunately, one might still get massive CSV files from other people. Two examples I've encountered so far are US voter-registration data, which can comprise tens of millions of observations per state but state governments may provide as single CSV files, and daily or hourly observations from Mexican weather stations. Such files would be easy to handle with linewise Unix tools if it weren't for quoted fields with embedded newlines. My group will probably use read_csv_chunked from the readr package.

zachmayer · 2019-01-25T14:35:01Z

Just to throw in an example I have: I've got a csv that's too big to fit into memory on my laptop.

However, I have some data munging I want to do on that dataset that amounts to:

Parse dates into Date class objects (see also fread should read columns of Date/POSIXct types directly #1450)
Clean up some messy numeric data and convert it from categorical to numeric
Combine some categorical levels and covert to factor (I know the levels ahead of time)
Combine many numeric columns into a single numeric column

After this operation, the file size is small enough to fit into memory, but I can't load it all at once to do it.

Right now I'm using a function like this (it isn't perfect yet):

library(pbapply)
library(data.table)
chunked_fread <- function(file, chunk_size, FUN=NULL, verbose=TRUE, ...){
  lines <- as.integer(system(paste("wc -l <", file), intern=TRUE))
  n_chunks <- ceiling(lines/chunk_size)
  if(verbose){
    print(paste(lines, 'lines,', n_chunks, 'chunks, chunk size of', chunk_size))
  }
  
  if(verbose){
    lapply <- pblapply
  }
  out <- lapply(1:n_chunks, function(idx){
    out <- fread(file, skip=idx, nrows=chunk_size, ...) 
    if(!is.null(FUN)){
      out <- FUN(out)
    }
    return(out)
  })
  return(rbindlist(out))
}

chunked_fread(my_file, chunk_size=100, FUN=function(x) return(x[,list(new=V1+V2)]))

xiaodaigh · 2019-08-05T11:29:30Z

@zachmayer I am also using a similar function. But I like readr's chunked function which calls a callback after reading each chunk. I want something similar for data.table so bad!

adamwaring · 2019-08-08T08:51:49Z

I'd like to chime in with an example as well. I am just analysing a genetic variant file with 421 million rows and long rows at that. I've got a few of these files that are 100s of GBs so I encounter memory issues when trying to deal with it on the server.

It is common in genetics to work with very large tabular data. Often what I would like to be able to do is to stream and filter the file to reduce its size. I usually do this filtering in bash but would be a lot smoother if I could do it all in Rdata.table.

jangorecki · 2019-08-08T08:55:45Z

related to @adamwaring use case #583

Joe-Wasserman · 2020-05-23T18:50:40Z

{disk.frame} provides a method for chunked read (and manipulation) of too-large-for-RAM tabular data. Notably, it imports {data.table}.

zachmayer · 2020-05-23T19:40:44Z

Oh awesome, disk.frame totally solves my problem. Nice that it imports data.table!

jangorecki added the fread label Mar 5, 2017

st-pasha added the feature request label Jul 6, 2017

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

mattdowle closed this as completed Mar 3, 2018

clarkfitzg mentioned this issue Apr 20, 2018

[R-Forge #4931] Support file connections for fread #561

Open

mattdowle reopened this Apr 22, 2018

jangorecki changed the title ~~[Request] Chunked fread~~ Chunked fread Aug 5, 2019

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

MichaelChirico mentioned this issue Apr 17, 2020

Any plans for out-of-memory data formats? #4384

Closed

MichaelChirico mentioned this issue May 21, 2022

Support for chunkwise operations in fread? #5388

Closed

tdhock mentioned this issue Nov 28, 2023

add GOVERNANCE.md document #5772

Merged

MichaelChirico mentioned this issue Mar 16, 2024

Master list of most-requested issues #3189

Open

76 tasks

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked fread #1721

Chunked fread #1721

UnkindPartition commented May 25, 2016

st-pasha commented Nov 8, 2017

st-pasha commented Nov 8, 2017

UnkindPartition commented Nov 8, 2017

mattdowle commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

st-pasha commented Mar 3, 2018

MichaelChirico commented Mar 4, 2018 via email •

edited by st-pasha

Loading

st-pasha commented Mar 4, 2018

malcook commented Mar 18, 2018 •

edited

Loading

mattdowle commented Apr 22, 2018

jangorecki commented Apr 22, 2018

malcook commented Apr 22, 2018 •

edited

Loading

mattdowle commented May 12, 2018

Kodiologist commented Dec 12, 2018 •

edited

Loading

zachmayer commented Jan 25, 2019 •

edited

Loading

xiaodaigh commented Aug 5, 2019

adamwaring commented Aug 8, 2019

jangorecki commented Aug 8, 2019

Joe-Wasserman commented May 23, 2020

zachmayer commented May 23, 2020

Chunked fread #1721

Chunked fread #1721

Comments

UnkindPartition commented May 25, 2016

st-pasha commented Nov 8, 2017

st-pasha commented Nov 8, 2017

UnkindPartition commented Nov 8, 2017

mattdowle commented Mar 3, 2018 • edited Loading

MichaelChirico commented Mar 3, 2018 • edited Loading

st-pasha commented Mar 3, 2018

MichaelChirico commented Mar 4, 2018 via email • edited by st-pasha Loading

st-pasha commented Mar 4, 2018

malcook commented Mar 18, 2018 • edited Loading

mattdowle commented Apr 22, 2018

jangorecki commented Apr 22, 2018

malcook commented Apr 22, 2018 • edited Loading

mattdowle commented May 12, 2018

Kodiologist commented Dec 12, 2018 • edited Loading

zachmayer commented Jan 25, 2019 • edited Loading

xiaodaigh commented Aug 5, 2019

adamwaring commented Aug 8, 2019

jangorecki commented Aug 8, 2019

Joe-Wasserman commented May 23, 2020

zachmayer commented May 23, 2020

mattdowle commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented Mar 4, 2018 via email •

edited by st-pasha

Loading

malcook commented Mar 18, 2018 •

edited

Loading

malcook commented Apr 22, 2018 •

edited

Loading

Kodiologist commented Dec 12, 2018 •

edited

Loading

zachmayer commented Jan 25, 2019 •

edited

Loading