[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

hendrikmakait · 2024-02-08T13:28:01Z

At scale 1000, all of these queries have workers getting restarted after running out of memory. We should investigate the cause and see if we're missing optimizations, have chosen a poor join order, or whether there are any other issues with these queries.

The text was updated successfully, but these errors were encountered:

phofl · 2024-02-09T14:57:31Z

Query 18 most likely dies because our source dataset is weird. We have files that have 50mbs in memory and files that have 380mbs in memory. The latter is relatively big for our small machines (8GB of ram). This gets worse through our strategy of combining multiple partitions when we drop columns, we end up combining a few large ones which makes them even bigger.

I don't know how we want to proceed exactly, but the varying partitions are probably not very good for what we want to do here.

Edit: This is not compression related, the difference scales down to compressed file sizes

fjetter · 2024-02-12T10:26:03Z

Varying partition sizes are very realistic and we shouldn't micro optimize our code to only run on extremely homogeneous datasets

phofl · 2024-02-12T15:26:57Z

I agree, but this is hard to change with the Current read_parquet

phofl · 2024-02-12T16:32:09Z

See #1376 for 17 and 18

hendrikmakait added the tpch label Feb 8, 2024

This was referenced Feb 13, 2024

Improve handling of imbalanced or small partitions dask/dask-expr#869

Open

[TPC-H] Query 11 and 13 memory issue #1382

Open

hendrikmakait self-assigned this Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

hendrikmakait commented Feb 8, 2024 •

edited

Loading

phofl commented Feb 9, 2024 •

edited

Loading

fjetter commented Feb 12, 2024

phofl commented Feb 12, 2024

phofl commented Feb 12, 2024

[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

Comments

hendrikmakait commented Feb 8, 2024 • edited Loading

phofl commented Feb 9, 2024 • edited Loading

fjetter commented Feb 12, 2024

phofl commented Feb 12, 2024

phofl commented Feb 12, 2024

hendrikmakait commented Feb 8, 2024 •

edited

Loading

phofl commented Feb 9, 2024 •

edited

Loading