GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

spigo900 · 2022-10-20T18:01:17Z

GNN training runs out of GPU memory when dealing with many-stroke scenes. For example, this happens when training on the M5 objects curriculum using STEGO segmentations with stroke merging enabled and no color segmentation refinement. In this example you end up with 599 train inputs and a significant minority of "many-stroke" scenes. One scene has s_max = 42 strokes, the most strokes of any input. GNN training can't fit the activations for a dataset like this into memory, so it crashes during the forward pass.

Specifically, we run into trouble with the message-passing part of the GNN (aka MPNN). The number of edge outputs scales quadratically with s_max. It ends up using about 8 GiB of memory for the edge outputs alone. This is one of the very first tensors computed and it doesn't leave a lot of room for any further tensors. I haven't tested yet if this also happens at decode time; it's possible that we can scrape by if we don't have to also hold on to tensors for the backward pass. It's something I plan to test.

This isn't a terrible problem for two reasons. First, it only affects these two variants; it's unlikely to affect either the segmentation experiments or the spatial relations experiments. Second, I've found a good-enough hack/workaround for color segmentation -- run the GNN training for the two affected variants on ephemeral-lg. With 48 GiB of GPU memory, we don't even have to worry about adjusting the code. :)

The text was updated successfully, but these errors were encountered:

spigo900 · 2022-10-20T18:17:19Z

Okay, from a quick test it seems we don't need this much memory at decode time. The GNN is happy with a ~~standard~~ plain old ephemeral GPU. Phew. :)

ETA: I spoke too soon. Decoding the eval/test data requires ephemeral-lg. Oh well. 🙃

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

spigo900 commented Oct 20, 2022

spigo900 commented Oct 20, 2022 •

edited

Loading

GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

Comments

spigo900 commented Oct 20, 2022

spigo900 commented Oct 20, 2022 • edited Loading

spigo900 commented Oct 20, 2022 •

edited

Loading