Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOM_hor_visc: Variables moved to stack #1362

Merged

Conversation

marshallward
Copy link
Collaborator

New diagnostics to horizontal_viscosity were causing issues with stack
memory on some platforms, causing the runtime to more than double.

Two of the diagnostics were allocatables and the other two were local
variables. By redefining the two allocatables as locals (and presumably
moving to stack), the faster performance was restored.

While the underlying cause is unclear, this is almost certainly due to
stack spill in this function, which happens to have a large number of
local arrays - including many 3d arrays used to gather diagnostics - and
any new variable is going to have volatile consequences.

This should be seen as a short term fix. In the future, we need better
tools to detect this problem and better guidance on how to responsibly
use stack.

Also note that two variables were removed: max_diss_rate_[qh].
Neither variable was used in the function.

New diagnostics to horizontal_viscosity were causing issues with stack
memory on some platforms, causing the runtime to more than double.

Two of the diagnostics were allocatables and the other two were local
variables.  By redefining the two allocatables as locals (and presumably
moving to stack), the faster performance was restored.

While the underlying cause is unclear, this is almost certainly due to
stack spill in this function, which happens to have a large number of
local arrays - including many 3d arrays used to gather diagnostics - and
any new variable is going to have volatile consequences.

This should be seen as a short term fix.  In the future, we need better
tools to detect this problem and better guidance on how to responsibly
use stack.

Also note that two variables were removed: `max_diss_rate_[qh]`.
Neither variable was used in the function.
@klindsay28
Copy link
Contributor

@marshallward, it seems like you're saying there were issues with stack memory and that you've resolved it by placing more variables on the stack. This seems backwards to me, so I think that I must be misunderstanding.

Based on your description of the code's performance, it smells like the original was slower because of the now removed allocate statements. But you're attributing the performance to stack spills.

I'm not following your explanation. Could you clarify this?

@marshallward
Copy link
Collaborator Author

@klindsay28 I agree! There isn't an explanation here, just a vague assertion that we're using too much stack, and that changing how we use it somehow(!) resolved the problem.

What I can say is that this function suffers from many odd symptoms which suggest stack spill:

  • If I retain three of the four diagnostics, the speed is restored. Adding the fourth doubles the the runtime. It doesn't matter if it's allocatable or local.

  • Lots of memory-swapping lea instructions start to appear whenever slowdown happens, another sign of stack spill.

  • The weirdest sign: Reordering the variable declarations can change runtime! Almost as if the position on the stack affects performance.

  • If I profile under Intel, it explicitly tags certain variables as stack spill! (AMD has been less explicit)

  • Fortran has always blurred the line between heap and stack; the allocatables here might be stack anyway (or perhaps they never were).

I don't mean to assert that this is reducing the stack spill and fixing the underlying problem. Only that there is a problem - caused by the huge number of local arrays, probably on stack - and that any little tweak can either fix or exacerbate the problem.

As for why adding two more variables magically fixes the issue that I saw? I don't know. Could be some odd interaction with how allocatable and locals are managed.

This should not be seen as anything more than a short-term fix, and it would be good to resolve the bigger problem of excessive locals.

@marshallward
Copy link
Collaborator Author

Another point which I forgot to make: I was not even testing these blocks of code! None of these diagnostics were enabled. Yet their presence was causing the runtime to more than double.

So it was not the actual allocation (or lack of) causing the issue. It was the presence of these particular variables and loops, and resulting object code, which was causing a problem for the rest of the function.

To say it another way: the mere possibility of such variables may have forced the compiler to make some awful compromises in order to accommodate the worst case scenario.

I'm asserting (without much evidence) that if we had fewer locals, then we'd have more stack and the compiler would not feel so trapped and would not be forced to make awkward choices regarding additional heap and stack variables, and the whole problem would just go away.

So perhaps moving the two diagnostics from heap to stack was enough to keep the memory under control.

I apologize for not providing this context, but I felt it was far too speculative to put in the commit log.

@klindsay28
Copy link
Contributor

Thanks for the explanatory comments @marshallward.

Copy link
Collaborator

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced that these changes are correct and improve model performance. In addition to passing the TC testing, this PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/12386.

@Hallberg-NOAA Hallberg-NOAA merged commit 2b6d3e1 into mom-ocean:dev/gfdl Apr 1, 2021
@marshallward marshallward deleted the horvisc_diag_to_stack branch May 7, 2021 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants