Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] CIP: Write-Ahead Log Pruning & Vacuuming #2498

Merged
merged 11 commits into from
Jul 23, 2024
Merged

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Jul 10, 2024

See #2288.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more points to consider.


Two additional things will be done after write transactions:

1. The `embeddings_queue` table will be pruned to remove rows that are no longer needed. Specifically, rows with a sequence ID less than the minimum sequence ID of any active subscriber will be deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAL is extremely valuable, as more than once, users have corrupted DBs or accidentally deleted their binary index dirs. Let's thread lightly here and make this an opt-in. Furthermore, let's make users aware of the ramifications of WAL pruning and potentially offer ways or processes to back up the WAL, which can help recovery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel somewhat strongly that WAL is not intended to be the primary backup solution and that the two should not be conflated. We should absolutely offer an easy way to backup and restore databases (though a CLI command?) but imo that's separate from cleaning the WAL. Postgres and MySQL both automatically clean the WAL. Even SQLite itself will automatically truncate the WAL by default when in WAL mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying to use the WAL as a backup. All I'm saying is that the WAL is currently the only recovery option in Chroma, and users (some) have come to expect that Chroma will automatically rebuild binary indices if they happen to delete them. Enabling auto-pruning by default circumvents an existing behavior and user (again, some) user expectations.

Copy link
Collaborator

@HammadB HammadB Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid point on some levels. I view this an instance of https://www.hyrumslaw.com/.

and users (some) have come to expect that Chroma will automatically rebuild binary indices if they happen to delete them.

While this may be true, and people may be in the habit of deleting database files and expecting them to just get rebuilt from the WAL - I would argue this is absolutely pathological behavior that we should not encourage or support. Manually deleting files and expecting the database to recreate them seems ripe for disaster. If the flow we want to support is rebuilding indices, we should productize that and promote it to the top level.

Enabling auto-pruning by default circumvents an existing behavior

Yes there is some undocumented fact that the WAL is infinite, but this should not be relied upon and preserved. I think we shouldn't prevent ourselves from doing the right thing - pruning the wal and preventing unbounded wal growth - in order to preserve users doing the wrong thing - relying on the WAL for backup/restore behavior as opposed to crash recovery.

Copy link
Contributor Author

@codetheweb codetheweb Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe as part of 0.6.0, we can document recommended backup/recovery flows (probably shut down chroma server, make copy of directory--unless we have bandwidth to add a chroma backup command?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tazarov please see updated proposal that addresses this

docs/cip/CIP-07102024_Write_Ahead_Log_Pruning_Vacuuming.md Outdated Show resolved Hide resolved

If the system's embedding queue is empty (a fresh system), `automatically_prune` will default to `true`.

This configuration object will be stored in a new table, `embeddings_queue_config`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after playing around with it I believe it makes more sense to store this as a property of the embeddings queue rather than as a property of the system
e.g. this setting is meaningless when not using the SQLite implementation of the embeddings queue (aka distributed), so at the system level the setting is only meaningful in some contexts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

draft implementation of what this looks like here: #2557

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my only question. I'm happy with this implementation.

Copy link
Contributor

@atroyn atroyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looking forward to prototype.


If the system's embedding queue is empty (a fresh system), `automatically_prune` will default to `true`.

This configuration object will be stored in a new table, `embeddings_queue_config`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my only question. I'm happy with this implementation.

@codetheweb codetheweb merged commit 1c0fb13 into main Jul 23, 2024
10 checks passed
@codetheweb codetheweb deleted the cip-wal-pruning branch July 23, 2024 19:41
codetheweb added a commit that referenced this pull request Jul 30, 2024
Implementation of the CLI command described in this CIP:
#2498. Terminal UI demo:


https://github.com/user-attachments/assets/9f902655-d6c9-4227-bd8b-c16e232b062a

Warning that's logged when automatic pruning is disabled:

<img width="1001" alt="Screenshot 2024-07-24 at 5 37 32 PM"
src="https://github.com/user-attachments/assets/741bd62d-b9f7-4730-9429-1bc7a5017911">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants