Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve behavior on corrupted checkpoint. #4025

Open
fulmicoton opened this issue Oct 25, 2023 · 5 comments
Open

Improve behavior on corrupted checkpoint. #4025

fulmicoton opened this issue Oct 25, 2023 · 5 comments
Assignees
Labels
bug Something isn't working low-priority

Comments

@fulmicoton
Copy link
Contributor

fulmicoton commented Oct 25, 2023

If a checkpoint contains an invalid position (for instance not u64) for ingest,
we currently panic.

Ideally we should:

  • log an error
  • repair the checkpoint by removing the corrupted partition
  • start indexing the faulty partition from the beginning.

This is very defensive and hence low priority

@fulmicoton fulmicoton added the bug Something isn't working label Oct 25, 2023
@jmintb
Copy link
Contributor

jmintb commented Oct 26, 2023

Can I work on this? :)

@guilload
Copy link
Member

Yes, you can.

Logging an error and skipping the partition is enough. I don't think self-clean-up is helpful because it's likely that the issue comes from either a bug in the source or a user manually editing a checkpoint. In the first case, the bug will reoccur, and self-cleanup will be less helpful. In the second, I'd rather have users clean up their mess themselves and decide from which point they want to resume indexing the partition. Always restarting from the beginning will yield duplicates.

@jmintb
Copy link
Contributor

jmintb commented Oct 26, 2023

Perfect, do you know whereabouts in the codebase the panic(s) occurs? Sounds like it is in quickwit-ingest.

@guilload
Copy link
Member

if you rg -e 'expect\(.*offset.*' in quickwit/quickwit-indexing/src/source you should find them all.

@jmintb
Copy link
Contributor

jmintb commented Oct 27, 2023

Are there any existing tests or test data that would simulate this scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working low-priority
Projects
None yet
Development

No branches or pull requests

3 participants