Memory leak in multi-threaded CSV.read #1045

bkamins · 2022-10-23T16:43:37Z

I do not remember if it was reported before (as similar issues were reported), but reading the file instagram_posts.csv that can be found in https://www.kaggle.com/datasets/shmalex/instagram-dataset using 4 threads leaves 15GB memory leak (after destroying all the visible variables that reference to the read file) + GC.gc() is very slow all the time.

When doing the same on a single thread all is OK, i.e. after removing references to the read file and doing GC.gc memory is back to the previous level.

Configuration: Win11, Julia 1.8.2, CSV.jl 0.10.4.

The text was updated successfully, but these errors were encountered:

quinnj · 2022-10-25T05:01:43Z

@bkamins, can you check your script/workflow on #1046? I believe we're probably also running into a similar issue in Arrow.jl w/ multithreaded reading/writing. It's probably also worth considering for DataFrames.jl and any other packages utilizing Threads.@spawn.

bkamins · 2022-10-25T14:36:50Z

#1046 resolves the issue.

CC @nalimilan regarding Threads.@spawn in DataFrames.jl

nickrobinson251 added bug help wanted labels Oct 24, 2022

bkamins mentioned this issue Oct 25, 2022

check if DataFrames.jl is affected by Threads.@spawn bug JuliaData/DataFrames.jl#3207

Open

quinnj closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in multi-threaded CSV.read #1045

Memory leak in multi-threaded CSV.read #1045

bkamins commented Oct 23, 2022

quinnj commented Oct 25, 2022

bkamins commented Oct 25, 2022

Memory leak in multi-threaded CSV.read #1045

Memory leak in multi-threaded CSV.read #1045

Comments

bkamins commented Oct 23, 2022

quinnj commented Oct 25, 2022

bkamins commented Oct 25, 2022