Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread with "text" argument is very slow #4919

Open
jeanlain opened this issue Mar 2, 2021 · 4 comments
Open

fread with "text" argument is very slow #4919

jeanlain opened this issue Mar 2, 2021 · 4 comments
Labels

Comments

@jeanlain
Copy link

jeanlain commented Mar 2, 2021

I haven't found this issue reported.

To reproduce:

library(data.table)
library(microbenchmark)
file <- "path/to/large/file"
string <- readLines(file)
microbenchmark(fread(file), fread(text = string), times = 1)

Here are my results (confirmed on different files and macOS + Linux), with a ~150MB file.

Unit: milliseconds
                 expr        min         lq       mean     median         uq
          fread(file)   598.7479   598.7479   598.7479   598.7479   598.7479
 fread(text = string) 41160.0737 41160.0737 41160.0737 41160.0737 41160.0737
        max neval
   598.7479     1
 41160.0737     1

Expected results : the second command should be at least as fast as the first.

on macOS 11.2, I have noticed that the slower command involves some write activity to disk. I haven't checked this on Linux (this was on a remote server). Anyhow, both commands take 100% of a CPU.

@ColeMiller1
Copy link
Contributor

Thanks for the report, I can reproduce on Windows. Here is a self-contained reprex:

library(data.table)
tmp = tempfile()
fwrite(data.table(id = 1:10000), tmp)

string = readLines(tmp)
microbenchmark::microbenchmark(fread(tmp), fread(text = string), times = 1)
#> Unit: milliseconds
#>                  expr        min         lq       mean     median         uq
#>            fread(tmp)   2.495301   2.495301   2.495301   2.495301   2.495301
#>  fread(text = string) 217.293502 217.293502 217.293502 217.293502 217.293502

Here is the relevant source:

data.table/R/fread.R

Lines 37 to 39 in 3fa8b20

if (length(text) > 1L) {
cat(text, file=(tmpFile<-tempfile(tmpdir=tmpdir)), sep="\n") # avoid paste0() which could create a new very long single string in R's memory
file = tmpFile

#4572 is slightly relevant. #4805 would largely close this issue as proposed implementation is much faster although it would still be slower than reading directly from the .csv. Could you expand on the use case?

microbenchmark::microbenchmark(old = cat(string, file=(tmpFile<-tempfile(tmpdir=tempdir())), sep="\n"),
new = writeLines(string, (tmpFile<-tempfile(tmpdir=tempdir()))), times = 10)
## Unit: milliseconds
##  expr      min       lq      mean   median       uq      max neval
##   old 173.6504 181.8860 194.84379 187.7294 201.8676 246.2789    10
##   new  19.4778  30.4866  30.32118  31.4903  33.5262  35.5296    10

@jeanlain
Copy link
Author

jeanlain commented Mar 2, 2021

Thanks for the reply.
My use case is this: I would like to import tabular outputs generated by certain programs (like blast or samtools) directly into data.tables without writing anything to disk. I noticed that fread(cmd = command) writes the tabular output of command to disk before importing it, which can be inefficient. By comparison system(command, intern = T) directly ingests the output of command without writing anything to disk. So I would use that solution instead. But the result is not parsed into a table, so fread(text = …) would have been useful as fread is very fast at parsing.
But as I understand it, fread(text = string) involves writing string to a file, which looks inefficient. Why does it do that?

@ColeMiller1
Copy link
Contributor

The example passes a greater-than-one length character vector to fread(). If you had length one character, no writing to file would be needed. E.g.:

library(data.table)
fread("
id var 
1 a")

To accept a character vector of greater-than-one without writing to disk, significant changes would be needed. AFAIU fread() points to the start and end of text and then parses until it reach EOF. While there may be nice use cases to accept large character vectors, it would be a non-trivial amount of work.

I am not super familiar with system(). With fread() we end up with a data.table. AFAIU, system() is not really for that intended use case with the use case being: do system processing to have an intermediate result followed by reading the intermediate result into RAM as a data.table.

@addnox
Copy link

addnox commented Feb 8, 2022

my workaround is use paste (or stri_c) to collapse the char vector into one string:

library(data.table)
tmp = tempfile()
fwrite(data.table(id = 1:10000), tmp)

string = readLines(tmp)
microbenchmark::microbenchmark(
  file = fread(tmp), 
  charVector = fread(text = string), 
  singleStr = fread(text = paste(string, collapse = "\n")),
  times = 1)
##Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##       file   3.2711   3.2711   3.2711   3.2711   3.2711   3.2711     1
##  charVector 228.0831 228.0831 228.0831 228.0831 228.0831 228.0831     1
##  singleStr   2.2994   2.2994   2.2994   2.2994   2.2994   2.2994     1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants