CSV.Rows with CodecZlib Out-of-Memory #476

cpfiffer · 2019-08-02T16:01:50Z

I'm reading some very large gzip files, where I want an iterator for each row. I have something similar to this:

using CSV
using CodecZlib

path = "/some/giant/file.csv.gz"

open(GzipDecompressorStream, path, "r") do stream
	df = CSV.Rows(stream, header=colnames, datarow=2)
	for row in df
		# Do stuff.
	end
end

I get an OOM error for fairly large files:

reading /home/cameron/Dropbox/Research/data/oil_data_sample/WISC_TAS_20160104.csv.gz                                                                         
ERROR: LoadError: LoadError: OutOfMemoryError()
Stacktrace:
 [1] _growend! at ./array.jl:812 [inlined]
 [2] ensureroom at ./iobuffer.jl:320 [inlined]
 [3] unsafe_write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64) at ./iobuffer.jl:409                                                       
 [4] unsafe_write at ./io.jl:509 [inlined]
 [5] macro expansion at ./gcutils.jl:87 [inlined]                                                                                                            
 [6] write at ./io.jl:532 [inlined]                                                                                                                          
 [7] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}) at ./io.jl:579                         
 [8] getsource(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}, ::Bool) at /home/cameron/.julia/packages/CSV/9II7K/src/utils.jl:167

Any thoughts on what do do here?

cpfiffer · 2019-08-02T16:03:03Z

Sorry, I hit the submit issue button too early. I've updated the main text.

quinnj · 2019-08-02T16:39:19Z

Thanks for the report; we can be smarter about what we're doing here.

… using mmapped buffers to slurp IO objects, which should be a little easier on overall memory. Now, this is an ok short-term fix, and is definitely smarter for the CSV.File case, but we really should find a true buffering solution for CSV.Rows, since the whole zen there is a low-memory footprint

quinnj · 2019-08-06T04:21:15Z

@cpfiffer, I have a PR up here; it'd be great if you could try it out for your use-case (you can get that branch by doing ] add CSV#jq/476). Hopefully, it should solve the OutOfMemory error, but as I noted on that PR, we really need a better buffering solution unique to CSV.Rows. I'll keep noodling on what we can do there.

… using mmapped buffers to slurp IO objects, which should be a little easier on overall memory. Now, this is an ok short-term fix, and is definitely smarter for the CSV.File case, but we really should find a true buffering solution for CSV.Rows, since the whole zen there is a low-memory footprint (#477)

cpfiffer · 2019-08-07T14:16:42Z

Seems to be working on my side with master, thanks! I'm good to close this if you are.

quinnj · 2019-08-07T14:19:11Z

Great!

quinnj closed this as completed Aug 7, 2019

jbrnd mentioned this issue Mar 12, 2021

CSV.Rows with gzipped files running out of memory #815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV.Rows with CodecZlib Out-of-Memory #476

CSV.Rows with CodecZlib Out-of-Memory #476

cpfiffer commented Aug 2, 2019 •

edited

Loading

cpfiffer commented Aug 2, 2019

quinnj commented Aug 2, 2019

quinnj commented Aug 6, 2019

cpfiffer commented Aug 7, 2019

quinnj commented Aug 7, 2019

CSV.Rows with CodecZlib Out-of-Memory #476

CSV.Rows with CodecZlib Out-of-Memory #476

Comments

cpfiffer commented Aug 2, 2019 • edited Loading

cpfiffer commented Aug 2, 2019

quinnj commented Aug 2, 2019

quinnj commented Aug 6, 2019

cpfiffer commented Aug 7, 2019

quinnj commented Aug 7, 2019

cpfiffer commented Aug 2, 2019 •

edited

Loading