Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV.Rows with CodecZlib Out-of-Memory #476

Closed
cpfiffer opened this issue Aug 2, 2019 · 5 comments
Closed

CSV.Rows with CodecZlib Out-of-Memory #476

cpfiffer opened this issue Aug 2, 2019 · 5 comments

Comments

@cpfiffer
Copy link

cpfiffer commented Aug 2, 2019

I'm reading some very large gzip files, where I want an iterator for each row. I have something similar to this:

using CSV
using CodecZlib

path = "/some/giant/file.csv.gz"

open(GzipDecompressorStream, path, "r") do stream
	df = CSV.Rows(stream, header=colnames, datarow=2)
	for row in df
		# Do stuff.
	end
end

I get an OOM error for fairly large files:

reading /home/cameron/Dropbox/Research/data/oil_data_sample/WISC_TAS_20160104.csv.gz                                                                         
ERROR: LoadError: LoadError: OutOfMemoryError()
Stacktrace:
 [1] _growend! at ./array.jl:812 [inlined]
 [2] ensureroom at ./iobuffer.jl:320 [inlined]
 [3] unsafe_write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64) at ./iobuffer.jl:409                                                       
 [4] unsafe_write at ./io.jl:509 [inlined]
 [5] macro expansion at ./gcutils.jl:87 [inlined]                                                                                                            
 [6] write at ./io.jl:532 [inlined]                                                                                                                          
 [7] write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}) at ./io.jl:579                         
 [8] getsource(::TranscodingStreams.TranscodingStream{GzipDecompressor,IOStream}, ::Bool) at /home/cameron/.julia/packages/CSV/9II7K/src/utils.jl:167     

Any thoughts on what do do here?

@cpfiffer
Copy link
Author

cpfiffer commented Aug 2, 2019

Sorry, I hit the submit issue button too early. I've updated the main text.

@quinnj
Copy link
Member

quinnj commented Aug 2, 2019

Thanks for the report; we can be smarter about what we're doing here.

quinnj added a commit that referenced this issue Aug 6, 2019
… using mmapped buffers to slurp IO objects, which should be a little easier on overall memory. Now, this is an ok short-term fix, and is definitely smarter for the CSV.File case, but we really should find a true buffering solution for CSV.Rows, since the whole zen there is a low-memory footprint
@quinnj
Copy link
Member

quinnj commented Aug 6, 2019

@cpfiffer, I have a PR up here; it'd be great if you could try it out for your use-case (you can get that branch by doing ] add CSV#jq/476). Hopefully, it should solve the OutOfMemory error, but as I noted on that PR, we really need a better buffering solution unique to CSV.Rows. I'll keep noodling on what we can do there.

quinnj added a commit that referenced this issue Aug 7, 2019
… using mmapped buffers to slurp IO objects, which should be a little easier on overall memory. Now, this is an ok short-term fix, and is definitely smarter for the CSV.File case, but we really should find a true buffering solution for CSV.Rows, since the whole zen there is a low-memory footprint (#477)
@cpfiffer
Copy link
Author

cpfiffer commented Aug 7, 2019

Seems to be working on my side with master, thanks! I'm good to close this if you are.

@quinnj
Copy link
Member

quinnj commented Aug 7, 2019

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants