[ENH] add `.clean_log()` to Producers #2549

codetheweb · 2024-07-19T23:33:17Z

Depends on #2545.

Changes:

Adds a clean_log() method to producers (not called automatically in this PR).
The existing table max_seq_id is now used to track the maximum seen sequence ID for both metadata and vector segments (formerly only used by metadata segments).
Segments are expected to update the max_seq_id table themselves.
Vector segments will automatically migrate the max_seq_id field from the old pickled metadata file source into the database upon init.

In this PR, log entries are deleted on a per-collection basis. The next PR in this stack deletes entries globally.

github-actions · 2024-07-19T23:33:27Z

github-actions · 2024-07-19T23:33:29Z

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

chromadb/ingest/__init__.py

tazarov · 2024-07-22T11:58:58Z

chromadb/db/mixins/embeddings_queue.py

@@ -243,6 +291,28 @@ def unsubscribe(self, subscription_id: UUID) -> None:
                        del self._subscriptions[topic_name]
                    return

+    @trace_method("SqlEmbeddingsQueue.ack", OpenTelemetryGranularity.ALL)
+    @override
+    def ack(self, subscription_id: UUID, up_to_seq_id: SeqId) -> None:


This feels weird as we re inverting the responsibility of who maintains the max_seq_id table. In fact this duplicates what the metadata segment already does. Doesn't it make sense to let segments maintain their own max_seq_ids and have just an in-memory representation held by the embedding queue?

sorry, not quite understanding--are you proposing that segments maintain max_seq_id themselves and the embeddings queue calls max_seq_id() or something on them during clean_log()?

Segments already maintain their max_seq_id:

Metadata -

chroma/chromadb/segment/impl/metadata/sqlite.py

Lines 483 to 495 in 28b3739

with self._db.tx() as cur:

for record in records:

q = (

self._db.querybuilder()

.into(Table("max_seq_id"))

.columns("segment_id", "seq_id")

.insert(

ParameterValue(self._db.uuid_to_db(self._id)),

ParameterValue(_encode_seq_id(record["log_offset"])),

)

)

sql, params = get_sql(q)

sql = sql.replace("INSERT", "INSERT OR REPLACE")

HNSW with the pickled metadata -

chroma/chromadb/segment/impl/vector/local_persistent_hnsw.py

Lines 213 to 222 in 28b3739

self._persist_data.max_seq_id = self._max_seq_id

# TODO: This should really be stored in sqlite, the index itself, or a better

# storage format

self._persist_data.id_to_label = self._id_to_label

self._persist_data.label_to_id = self._label_to_id

self._persist_data.id_to_seq_id = self._id_to_seq_id

with open(self._get_metadata_file(), "wb") as metadata_file:

pickle.dump(self._persist_data, metadata_file, pickle.HIGHEST_PROTOCOL)

My concern was whether it is the embedding queue responsibility to update max_seq_id table or leave that to each segment. Instead, let segments report their max sequence IDs either via the ack() or simply the return of the notify_one and keep these in-memory (aka as attributes for each subscriber).

Right, I see. There's two main issues with keeping max sequence IDs in memory on the embeddings queue:

It assumes all segments have subscribed to the embeddings queue before clean_log() is called. This may be true in the vast majority of cases but it feels like that's a correctness bug waiting to happen. You could add an assertion that len(subscribers) == 2 but imo that's not materially better than just persisting the max sequence ID.

It requires segments to be loaded into memory for clean_log() to work. In most cases the segments will probably already be in-memory, but this could make chroma vacuum slower and implementation a little annoying.

You don't have to keep max_seq_id in memory, but only query it and let segments update max_seq_id table on their own (aka responsibility for persisting the max_seq_id lies within the owner of the counter - each segment).

Here's a hypothetical situation situation:

Add a large batch (that overflows the threshold)

Vector segment updates its metadata to store the max_seq_id

Embedding queue fails to update the max_seq_id - not very likely, but also not impossible. We roll back the whole shenanigan, but you still have vectors added to the index. This, too, is a correctness bug waiting to happen :)

I hope you see my point about the responsibility of segments to manage their own max_seq_id and make the embedding queue only a consumer (aka only query) the state upon clean_log().

Embedding queue fails to update the max_seq_id - not very likely, but also not impossible. We roll back the whole shenanigan, but you still have vectors added to the index. This, too, is a correctness bug waiting to happen :)

Sorry, not sure how I see this could result in a correctness bug--at worst the embeddings queue won't prune processed WAL records?

I hope you see my point about the responsibility of segments to manage their own max_seq_id and make the embedding queue only a consumer (aka only query) the state upon clean_log().

To clarify, in this implementation would clean_log() directly inspect the max_seq_id table and/or the pickled metadata file?

After thinking through this some more, I'd be fine having segments be responsible for updating the max_seq_id table themselves. We then don't need the ack() method. The only thing that's a little weird then is that clean_log() will be directly accessing the max_seq_id table.

Maybe we have a new component, LogPosition, that owns reads/writes for max_seq_id and we put clean_log() there? That seems fairly clean.

...segments be responsible for updating the max_seq_id table themselves. We then don't need the ack() method. The only thing that's a little weird then is that clean_log() will be directly accessing the max_seq_id table.

I think this is a slight improvement. Comparison in 41a7131. End of the comment has some of the considerations I've had; I think this approach does probably balance them slightly better.

Maybe we have a new component, LogPosition, that owns reads/writes for max_seq_id and we put clean_log() there? That seems fairly clean.

Spent a few minutes prototyping this:

Added a new SqliteDB mixin, SqlSegmentLogPosition.

The mixin requires clean_log()and ack() to be defined on the abstract SqlDB to avoid a circular import (👎)

For the sake of writing things down, some of the things I've been considering throughout the iterations of this implementation:

clean_log() assumes that max_seq_id entries are kept up-to-date (suggests that mutation logic for max_seq_id should be co-located with clean_log()

clean_log() and max_seq_id are only applicable for single node (suggests that Consumer/Producer interfaces should not be modified)

"subscribers" are ephemeral, while "segments" are persistent--max_seq_id has one row per segment, not one row per subscriber

we should be able to get the max_seq_id of segments without having to load them into memory

I'd really like to just commit to a path at the point and stop bike shedding. It's not solely the fault of this thread, this feature has just been very delayed after chasing down the various bugs that fell out and the multiple permutations of the CIP (although I do think the user-facing side of this is far better after going through those permutations!).

codetheweb · 2024-07-22T20:14:03Z

[ENH] add vacuum CLI command #2519
[ENH] automatically clean WAL #2557
[ENH] add .clean_log() to Producers #2549 👈
[ENH] simplify logic for when to persist index changes (re-apply with fix) #2545
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @codetheweb and the rest of your teammates on Graphite

chromadb/segment/impl/vector/local_persistent_hnsw.py

codetheweb · 2024-07-25T19:53:26Z

chromadb/test/property/test_cross_version_persist.py

+        embeddings_queue, system.instance(SegmentManager)
+    )
+
+    embeddings_queue.clean_log(coll.id)


exercises the max_seq_id migration--if the migration doesn't happen or fails, .clean_log() will have no effect (because the max_seq_id for one of the segments is missing) and the invariant check immediately below will fail if the # of embeddings added is > sync threshold

chromadb/db/mixins/embeddings_queue.py

chromadb/test/property/test_cross_version_persist.py

chromadb/ingest/impl/utils.py

codetheweb · 2024-07-29T18:14:05Z

rename clean_log -> purge_log

chromadb/segment/impl/vector/local_hnsw.py

chromadb/test/property/invariants.py

…q_id` up-to-date

codetheweb · 2024-07-29T22:03:40Z

Merge activity

Jul 29, 6:03 PM EDT: @codetheweb merged this pull request with Graphite.

codetheweb changed the base branch from main to feat-simplify-persist-trigger-2 July 19, 2024 23:33

codetheweb changed the title ~~feat wal clean~~ [ENH] add .clean_log() to Producers Jul 19, 2024

codetheweb mentioned this pull request Jul 19, 2024

[ENH] add log cleaning implementation (single node) #2470

Closed

codetheweb commented Jul 20, 2024

View reviewed changes

chromadb/ingest/__init__.py Outdated Show resolved Hide resolved

codetheweb force-pushed the feat-wal-clean branch from 6624dce to 388bfb3 Compare July 20, 2024 00:24

tazarov reviewed Jul 22, 2024

View reviewed changes

codetheweb mentioned this pull request Jul 22, 2024

[ENH] automatically clean WAL #2557

Merged

codetheweb force-pushed the feat-simplify-persist-trigger-2 branch from aed02f9 to 4d920d6 Compare July 22, 2024 23:30

codetheweb force-pushed the feat-wal-clean branch from 388bfb3 to a4c4180 Compare July 22, 2024 23:30

codetheweb mentioned this pull request Jul 22, 2024

[ENH] simplify logic for when to persist index changes (re-apply with fix) #2545

Merged

codetheweb force-pushed the feat-simplify-persist-trigger-2 branch from 4d920d6 to d84489d Compare July 23, 2024 00:06

codetheweb force-pushed the feat-wal-clean branch 3 times, most recently from ccffe26 to 41a7131 Compare July 23, 2024 23:55

tazarov reviewed Jul 24, 2024

View reviewed changes

chromadb/segment/impl/vector/local_persistent_hnsw.py Outdated Show resolved Hide resolved

codetheweb force-pushed the feat-wal-clean branch 2 times, most recently from a42d715 to be69dde Compare July 24, 2024 21:49

codetheweb force-pushed the feat-simplify-persist-trigger-2 branch from d84489d to 7c1d6d0 Compare July 24, 2024 22:54

codetheweb force-pushed the feat-wal-clean branch from be69dde to ce949c8 Compare July 24, 2024 22:54

codetheweb mentioned this pull request Jul 24, 2024

[ENH] add vacuum CLI command #2519

Merged

codetheweb marked this pull request as ready for review July 24, 2024 23:14

codetheweb commented Jul 25, 2024

View reviewed changes

codetheweb force-pushed the feat-simplify-persist-trigger-2 branch from 7c1d6d0 to baa1850 Compare July 25, 2024 23:44

codetheweb force-pushed the feat-wal-clean branch 2 times, most recently from 4e05b68 to 0d87067 Compare July 26, 2024 00:05