Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does one make DTable construction lazy? #7

Open
salbert83 opened this issue Jan 5, 2022 · 1 comment
Open

How does one make DTable construction lazy? #7

salbert83 opened this issue Jan 5, 2022 · 1 comment

Comments

@salbert83
Copy link

I tried

tbl = Dagger.DTable(Parquet.read_parquet, my_files)

where "my_files" is an array of paths to parquet files that were from a dask dataframe. It seems to be loading everything into memory. I'd like a way to process out-of-core, similar to dask, I was under the impression this was a goal for DTable. Thanks.

@jpsamaroo
Copy link
Member

You can give JuliaData/MemPool.jl#60 a try, which is my new WIP approach to swap-to-disk (just set the env. var. JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 to enable it). I will warn you that it's not ready yet:

  • Performance of swapped-out data reads is currently bad, due to not properly migrating data back to memory (instead reading from disk for every read)
  • The memory usage limit is not yet tunable, and defaults to 8GB
  • The disk usage limit is currently unbounded, and will use all of your disk space if you allocate too much (everything will be stored in .mempool relative to your current working directory, if you need to manually delete those files)

I plan to begin DTable testing of that PR soon, but haven't yet had the chance to get to it, but do feel free to give it a spin! I'll let you know once I've fixed the above issues.

@krynju krynju transferred this issue from JuliaParallel/Dagger.jl Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants