Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] CIP: Write-Ahead Log Pruning & Vacuuming #2498
[ENH] CIP: Write-Ahead Log Pruning & Vacuuming #2498
Changes from 5 commits
918e0bf
abf5e63
71c811a
82df8ee
ff3fd97
d756c85
97d1811
937af7f
a0d7378
a306cee
13ac0f6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAL is extremely valuable, as more than once, users have corrupted DBs or accidentally deleted their binary index dirs. Let's thread lightly here and make this an opt-in. Furthermore, let's make users aware of the ramifications of WAL pruning and potentially offer ways or processes to back up the WAL, which can help recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel somewhat strongly that WAL is not intended to be the primary backup solution and that the two should not be conflated. We should absolutely offer an easy way to backup and restore databases (though a CLI command?) but imo that's separate from cleaning the WAL. Postgres and MySQL both automatically clean the WAL. Even SQLite itself will automatically truncate the WAL by default when in WAL mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying to use the WAL as a backup. All I'm saying is that the WAL is currently the only recovery option in Chroma, and users (some) have come to expect that Chroma will automatically rebuild binary indices if they happen to delete them. Enabling auto-pruning by default circumvents an existing behavior and user (again, some) user expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid point on some levels. I view this an instance of https://www.hyrumslaw.com/.
While this may be true, and people may be in the habit of deleting database files and expecting them to just get rebuilt from the WAL - I would argue this is absolutely pathological behavior that we should not encourage or support. Manually deleting files and expecting the database to recreate them seems ripe for disaster. If the flow we want to support is rebuilding indices, we should productize that and promote it to the top level.
Yes there is some undocumented fact that the WAL is infinite, but this should not be relied upon and preserved. I think we shouldn't prevent ourselves from doing the right thing - pruning the wal and preventing unbounded wal growth - in order to preserve users doing the wrong thing - relying on the WAL for backup/restore behavior as opposed to crash recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe as part of 0.6.0, we can document recommended backup/recovery flows (probably shut down chroma server, make copy of directory--unless we have bandwidth to add a
chroma backup
command?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tazarov please see updated proposal that addresses this