-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Allow db keys to handle multiple schema versions #1026
refactor: Allow db keys to handle multiple schema versions #1026
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1026 +/- ##
===========================================
- Coverage 57.79% 57.76% -0.04%
===========================================
Files 174 174
Lines 19521 19563 +42
===========================================
+ Hits 11283 11300 +17
- Misses 7248 7263 +15
- Partials 990 1000 +10
|
Need to rebase 👀 due to the 0.4 re-release |
19fa240
to
f12a366
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving in the spirit that this seems to work and does move the needle in the direction I think we want to go. I have so many questions that have been generated by looking at these 129 new lines. Here are a few just to keep it short.
- Do we plan on being able to walk back the schema history?
- If so, did you think of what the mechanics of that will look like?
- Is it via some kind of block store or a reference to the previous version in the description of the new version?
- Will the document's fields have a reference to the schema version they belong to?
I'm unsure what you mean by walk back here. The current state of the schema at that version will be stored, not the patch. If you meant a post-transaction-commit rollback, the thought hadn't crossed my mind - it sounds like a nice idea, but with a whole minefield's worth of gotchas and well out of scope from anything I plan on doing in 0.5.
I dont understand to what you are referring to here, although hopefully it is answered in reply to Q1-2.
I have no plan on doing so, and dont see any immediate benefit of doing so. Why? |
I don't mean rollback no. What I mean is that if we have multiple schema updates, unless we have the exact
With the documents, we have a block store that links the updates. I was wondering if that's what you had in mind.
If we inspect the history of a document and we have fields that were deleted from the schema over time, I'm thinking we might need a way to know which schema version it was to get the right field name and type and so on. |
Ah I think I understand you, the commit queries will expose the schemaVersionId. There will also likely be a function on the client ABI that allows the schema of a given version to be fetched (noted in the almanac doc). At some point in the future we may want to allow the joining of the schema onto the commit in a query - but I consider that to be well out of scope atm. Likewise, the querying of deleted fields etc in a commit query is explicitly out of scope of phase 1 (and probably 0.5) - it gets very complicated as John noted in the standup yesterday, and whilst very nice to have I dont think it should block or slow down progress on the core [Schema Update] feature.
Ah I forgot the block store links to the prior commit (for composites). Doing so isn't a requirement for phase 1-3, and would not provide anything useful to the user. Being able to query a (visibly) ordered schema history would be nice at some point but is out of scope for now.
Ah okay - this is to be held on the commit (and probably only the composite*), not the field's head as I thought you meant. The requirement that you just stated is a large reason as to why this PR exists (so that commit/time-travelling querys dont stop making sense). *It will likely only be stored on the composite, but exposed as a requestable field on any commit query (with field-level commits sourcing it from their composite). Due to how time-travelling queries are just normal queries with a cid param this will almost certainly mean that it will also be made available on all normal queries too (again probably sourcing it from the composite commit). |
Storing it in the composite makes a lot of sense. I said field but the point was just to give a reference. Your answers all make sense. When I ask the questions I'm not inferring that it should be done now. But they are all real questions I had to ask myself while looking at the code. I find that the situation supports what we were saying in the standup regarding documentation. These kinds of questions and interaction could be had before we get to code. This way we alleviate the possibility that everyone will "waste" time going through the same thought process I did. Do you see what I mean? |
I do see where you are coming from, but a lot of your questions were regarding new ideas to me :) The time would be "wasted" either here or earlier. And it is impossible to explicitly exclude an infinite range of possibilities from a design doc (and doing so would reduce the information density of what is in scope). I still would like to encourage you to think along the lines of 'If it isn't in the spec, it is not in scope'. That said, I do think questions 1, 2, and 4 were explicitly mentioned in the almanac doc (with 4 not mentioning how it is achieved). |
Mentioned but not answering the questions.
We would first need a spec for that to be true :) |
Ah we have document the specifies and scopes out the problem to solve, it is not a tech spec, but it definitely covers this topic-area (It is almost the only thing it covers). |
Really? Do you mind sharing that document please. I don't think I've seen it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document somewhere in a note, the difference between schema version id vs schema id.
Otherwise lgtm.
I think it is already: In the body of
And the version key struct is documented with:
And the collection key struct is documented with:
And the schema key struct is documented with:
Where were you thinking-of-putting/expecting the note? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Minor suggestions (the private/concrete suggestion should be done tho :) ).
db/collection.go
Outdated
// getCollectionByVersionId returns the [client.Collection] at the given [schemaVersionId] version. | ||
// | ||
// Will return an error if the given key is empty, or not found. | ||
func (db *db) getCollectionByVersionId(ctx context.Context, schemaVersionId string) (client.Collection, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Internal funcs/methods should return private concrete types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is very important here as this is not a utils/helper-like function, but will change
- change func sig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id assert it's semi important. Has a cleaner API surface so that if you need to use it internally, there's no need for casting from the interface to access collection type info.
Compile time safety 😅
Internal methods/funcs in a package should always handle concrete internal types if possible
key := core.NewCollectionKey(name) | ||
buf, err := db.systemstore().Get(ctx, key.ToDS()) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
schemaVersionId := string(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(non-blocking): Going from name to schemaVersionId
can prob be scoped to a private method.
func (db *db) GetCollectionByName(ctx context.Context, name string) (client.Collection, error) {
// ...
schemaVersionId, err := db.getSchemaVersionId(name)
return db.getCollectionByVersionId(ctx, schemaVersionId)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look - if it would only be called once (I cant remember), then this will probably remain as-is.
- name=>vId private func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this as now, there is no significantly useful way of doing this without changing (adding to) the public interface in client. That work is already planned in the epic - I would rather this as part of that work where proper time can be allocated to it. I'll make a note of these locations in that ticket though as a reminder.
collectionKey := core.NewCollectionKey(name) | ||
var desc client.CollectionDescription | ||
buf, err := p.txn.Systemstore().Get(p.ctx, key.ToDS()) | ||
schemaVersionIdBytes, err := p.txn.Systemstore().Get(p.ctx, collectionKey.ToDS()) | ||
if err != nil { | ||
return desc, errors.Wrap("failed to get collection description", err) | ||
} | ||
|
||
schemaVersionId := string(schemaVersionIdBytes) | ||
schemaVersionKey := core.NewCollectionSchemaVersionKey(schemaVersionId) | ||
buf, err := p.txn.Systemstore().Get(p.ctx, schemaVersionKey.ToDS()) | ||
if err != nil { | ||
return desc, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(non-blocking): Would be nice (but not required) to link the two implementations (here and above) to handle schema name => versionId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand your other comment better now - thanks :)
collectionKey := core.NewCollectionKey(name) | ||
var desc client.CollectionDescription | ||
schemaVersionIdBytes, err := r.txn.Systemstore().Get(r.ctx, collectionKey.ToDS()) | ||
if err != nil { | ||
return desc, errors.Wrap("failed to get collection description", err) | ||
} | ||
|
||
key := core.NewCollectionKey(name) | ||
buf, err := r.txn.Systemstore().Get(r.ctx, key.ToDS()) | ||
schemaVersionId := string(schemaVersionIdBytes) | ||
schemaVersionKey := core.NewCollectionSchemaVersionKey(schemaVersionId) | ||
buf, err := r.txn.Systemstore().Get(r.ctx, schemaVersionKey.ToDS()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(non-blocking): Again :)
I was trying to find a central place where these different ids are documented, now that you pointed this location I can come here anytime to read these to remember the distinctions. Another non-blocking question, how do you distinguish the use of the word "key" vs "id". |
Yeah the key.go file is probably your best bet for stuff like that.
I am not sure I do in most cases - if I'm explicitly talking about datastore keys I'll use "key", id is more open. |
dded9c2
to
60cf433
Compare
Is clearer
Name will likely be improved with #1025
These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly
SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff
Is clearer
SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion.
Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.
* Rename collectionKey variable Is clearer * Add schema version key Name will likely be improved with #1025 * Remove duplicate variable * Deduplicate business logic These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly * Document global-local distinction of schema elements * Move schema id evaluation to before saving of collectionKey SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff * Rename collectionSchemaKey variable Is clearer * Persist schemaVersionKey * Persist schemaVersionId at schemaKey SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion. * Persist schemaVersionId at collectionKey Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.
…work#1026) * Rename collectionKey variable Is clearer * Add schema version key Name will likely be improved with sourcenetwork#1025 * Remove duplicate variable * Deduplicate business logic These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly * Document global-local distinction of schema elements * Move schema id evaluation to before saving of collectionKey SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff * Rename collectionSchemaKey variable Is clearer * Persist schemaVersionKey * Persist schemaVersionId at schemaKey SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion. * Persist schemaVersionId at collectionKey Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.
Relevant issue(s)
Resolves #1022
Epic level context doc
https://source.almanac.io/docs/N2o87ZyxzsruNHSlAErwHMVnqNBPOZFn
Description
Reworks the database keys
Collection
andCollectionSchemaKey
so that they become pointers to aCollectionSchemaVersionKey
. this should allow multiple schema versions to be persisted in a manner consistent with the current 'head'/Collection
key.The current state of Collection related database keys does not readily facilitate multiple versions. This is required because when introducing updateable schema we need to preserve the schema history, for example for aspects described in issues in the range #1006-#1012
The names are not fantastic at the moment, in part due to the duplication noted in #1025, simplifying the names would be done in that ticket (alongside the deduplication).