Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

Open
spigo900 opened this issue Oct 20, 2022 · 1 comment
Open

Comments

@spigo900
Copy link
Collaborator

GNN training runs out of GPU memory when dealing with many-stroke scenes. For example, this happens when training on the M5 objects curriculum using STEGO segmentations with stroke merging enabled and no color segmentation refinement. In this example you end up with 599 train inputs and a significant minority of "many-stroke" scenes. One scene has s_max = 42 strokes, the most strokes of any input. GNN training can't fit the activations for a dataset like this into memory, so it crashes during the forward pass.

Specifically, we run into trouble with the message-passing part of the GNN (aka MPNN). The number of edge outputs scales quadratically with s_max. It ends up using about 8 GiB of memory for the edge outputs alone. This is one of the very first tensors computed and it doesn't leave a lot of room for any further tensors. I haven't tested yet if this also happens at decode time; it's possible that we can scrape by if we don't have to also hold on to tensors for the backward pass. It's something I plan to test.

This isn't a terrible problem for two reasons. First, it only affects these two variants; it's unlikely to affect either the segmentation experiments or the spatial relations experiments. Second, I've found a good-enough hack/workaround for color segmentation -- run the GNN training for the two affected variants on ephemeral-lg. With 48 GiB of GPU memory, we don't even have to worry about adjusting the code. :)

@spigo900
Copy link
Collaborator Author

spigo900 commented Oct 20, 2022

Okay, from a quick test it seems we don't need this much memory at decode time. The GNN is happy with a standard plain old ephemeral GPU. Phew. :)

ETA: I spoke too soon. Decoding the eval/test data requires ephemeral-lg. Oh well. 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant