How does one make DTable construction lazy? #7

salbert83 · 2022-01-05T03:22:44Z

I tried

tbl = Dagger.DTable(Parquet.read_parquet, my_files)

where "my_files" is an array of paths to parquet files that were from a dask dataframe. It seems to be loading everything into memory. I'd like a way to process out-of-core, similar to dask, I was under the impression this was a goal for DTable. Thanks.

jpsamaroo · 2022-01-05T14:48:59Z

You can give JuliaData/MemPool.jl#60 a try, which is my new WIP approach to swap-to-disk (just set the env. var. JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR=1 to enable it). I will warn you that it's not ready yet:

Performance of swapped-out data reads is currently bad, due to not properly migrating data back to memory (instead reading from disk for every read)
The memory usage limit is not yet tunable, and defaults to 8GB
The disk usage limit is currently unbounded, and will use all of your disk space if you allocate too much (everything will be stored in .mempool relative to your current working directory, if you need to manually delete those files)

I plan to begin DTable testing of that PR soon, but haven't yet had the chance to get to it, but do feel free to give it a spin! I'll let you know once I've fixed the above issues.

jpsamaroo added data movement performance table interface upstream labels Jan 5, 2022

krynju transferred this issue from JuliaParallel/Dagger.jl Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does one make DTable construction lazy? #7

How does one make DTable construction lazy? #7

salbert83 commented Jan 5, 2022

jpsamaroo commented Jan 5, 2022

How does one make DTable construction lazy? #7

How does one make DTable construction lazy? #7

Comments

salbert83 commented Jan 5, 2022

jpsamaroo commented Jan 5, 2022