Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading multiple .csv files uses ~double the memory it should #952

Closed
krynju opened this issue Dec 18, 2021 · 2 comments
Closed

Loading multiple .csv files uses ~double the memory it should #952

krynju opened this issue Dec 18, 2021 · 2 comments

Comments

@krynju
Copy link

krynju commented Dec 18, 2021

I'm on Julia master/1.8
and Windows

I tried CSV.File, CSV.read with NamedTuples, DTable, DataFrame etc. all have the same issue.
There's just some sticky memory leftover from loading the csv files

The table generated is about 1.6GB

# generate
using DataFrames
d = DataFrame((;[Symbol("a$i") => rand(Int32(1):Int32(1000), Int(1e8)) for i in 1:4]...));
# run GC.gc() a few times, memory usage settles at ~

# prep
using CSV
genchunk = () -> (; [Symbol("a$i") => rand(Int32(1):Int32(1000), Int(1e7)) for i = 1:4]...)
mkpath("data")

for i = 1:10
    CSV.write(joinpath(["data", "datapart_$i.csv"]), genchunk())
end

# load from multiple files
d = CSV.read(files, DataFrame)

generated
image

loaded from files
image

@krynju krynju changed the title Loading a .csv uses ~double the memory than it should Loading a multiple .csv files uses ~double the memory than it should Dec 18, 2021
@krynju
Copy link
Author

krynju commented Dec 18, 2021

@krynju krynju changed the title Loading a multiple .csv files uses ~double the memory than it should Loading a multiple .csv files uses ~double the memory it should Dec 18, 2021
@krynju krynju changed the title Loading a multiple .csv files uses ~double the memory it should Loading multiple .csv files uses ~double the memory it should Dec 18, 2021
@krynju
Copy link
Author

krynju commented Dec 19, 2021

d = CSV.read(files, DataFrame, types=Int32)
forgot it parses as Int64 and that's where my double memory usage was coming from
this sticky memory related to glibc issue is still observable on my end though, but that's a different issue

@krynju krynju closed this as completed Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant