Replies: 1 comment 3 replies
-
Hi @min-mwei thanks for raising this! Could you run:
We do have documentation around memory usage, but it's geared towards running on the Ray runner. Daft does not perform out-of-core processing when running on the PyRunner (which is the default single-node backend when you run Daft without explicit calls to switch the runner) cc @samster25 as well |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I just played with some real data. My test is a simple counting over 1000+ of parquet files over 250+GB total running on a 370GB memory VM with 48 cores.
The line could finish the counting 100GB DRAM consumption, like ~20mins.
daft.read_parquet("az://...", io_config=io_config) .select('some_id', 'some_field').groupby('some_id').agg([('some_field', 'count')]).collect())
I thought having two partitions like below would speed up.
daft.read_parquet("az://...", io_config=io_config).into_partitions(2)..select('some_id', 'some_field').groupby('some_id').agg([('some_field', 'count')]).collect())
Yet it runs out of memory under 1 min, and there is no obvious config setting to control memory.
RuntimeError: Requested 4056071093404 bytes of memory but found only 405476225024 available
My package version
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
daft.__version__ '0.2.14'
Thanks for any suggestion.
Beta Was this translation helpful? Give feedback.
All reactions