Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Also recover on old index.db when not using MaxMsgsPer #5901

Merged
merged 2 commits into from
Sep 18, 2024

Conversation

MauriceVanVeen
Copy link
Member

@MauriceVanVeen MauriceVanVeen commented Sep 18, 2024

Extension to #5893

If we can't update the index.db upon shutdown, for example during a hard kill, we'd enter into this condition if MaxMsgsPer was set.
https://github.com/nats-io/nats-server/pull/5893/files#diff-384c189826934c9a6fc3554dafc63dab2076245010e3d6fce5c71a93e15e9877R1752

However, all limits-based fields have this issue not just MaxMsgsPer.
Running similar tests where nats str info before hard kill should equal its output after hard kill:

  • MaxMsgsPer: −7,877 diff (fixed with addition of above condition/PR)
  • MaxMsgs: +2,123 diff
  • MaxAge: no diff (correct messages, but still [WRN] Filestore [stream] loadBlock error: message block data missing)
  • MaxBytes: +3,567 diff (had a MaxBytes set of 1016 MiB, but after restart the state has more messages and Bytes: 1020 MiB)

I think we shouldn't only target MaxMsgsPer, since other fields can also trigger this and making it specific to also include these other fields would come back to bite if we add other limits-based fields in the future and forget to add it in this condition.
We need to detect index.db was not written during shutdown or there is a difference between index.db and our msg blocks. If we detect this we can't rely on it being correct still, so I'd propose to simplify and upon detecting defer to rebuilding.

Signed-off-by: Maurice van Veen [email protected]

@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review September 18, 2024 10:02
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner September 18, 2024 10:02
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MaxMsgs and MaxBytes, and MaxAge (for now), we can try a different approach such that when we detect we did not match on last block we whip through the blocks recovered via index.db and os.Stat() the file.

If not present, remove from top level accounting, if exists recover that one from disk and break since that will be all that is needed..

I could try to take your updated test (awesome) and see if that approach would work.

@MauriceVanVeen
Copy link
Member Author

I could play with that tomorrow for sure.
(But also feel free to have a look yourself if you'd want to look at it sooner than that)

Last week I did already try the os.Stat approach upon initially adding the block higher up. But that would break lost data accounting. Might be doing it lower down here would work better.

@derekcollison
Copy link
Member

Actually thinking some more about this, we would need to redo the PSIM layer since we do not know what we lost and hence we would need to load those blocks anyway.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit a2ff03e into main Sep 18, 2024
5 checks passed
@derekcollison derekcollison deleted the recover-stale-indexdb-with-any-limits branch September 18, 2024 21:06
@derekcollison
Copy link
Member

Thanks @MauriceVanVeen !

neilalexander pushed a commit that referenced this pull request Sep 20, 2024
Extension to #5893

If we can't update the index.db upon shutdown, for example during a hard
kill, we'd enter into this condition if `MaxMsgsPer` was set.

https://github.com/nats-io/nats-server/pull/5893/files#diff-384c189826934c9a6fc3554dafc63dab2076245010e3d6fce5c71a93e15e9877R1752

However, all limits-based fields have this issue not just `MaxMsgsPer`.
Running similar tests where `nats str info` before hard kill should
equal its output after hard kill:
- `MaxMsgsPer`: −7,877 diff (fixed with addition of above condition/PR)
- `MaxMsgs`: +2,123 diff
- `MaxAge`: no diff (correct messages, but still `[WRN] Filestore
[stream] loadBlock error: message block data missing`)
- `MaxBytes`: +3,567 diff (had a MaxBytes set of 1016 MiB, but after
restart the state has more messages and Bytes: 1020 MiB)

I think we shouldn't only target `MaxMsgsPer`, since other fields can
also trigger this and making it specific to also include these other
fields would come back to bite if we add other limits-based fields in
the future and forget to add it in this condition.
We need to detect index.db was not written during shutdown or there is a
difference between index.db and our msg blocks. If we detect this we
can't rely on it being correct still, so I'd propose to simplify and
upon detecting defer to rebuilding.

Signed-off-by: Maurice van Veen <[email protected]>

---------

Signed-off-by: Maurice van Veen <[email protected]>
neilalexander added a commit that referenced this pull request Sep 20, 2024
Includes the following:

- #5901
- #5904
- #5900
- #5906
- #5908
- #5907

Signed-off-by: Neil Twigg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants