refactor: Allow db keys to handle multiple schema versions #1026

AndrewSisley · 2023-01-04T21:05:16Z

Relevant issue(s)

Resolves #1022

Epic level context doc

https://source.almanac.io/docs/N2o87ZyxzsruNHSlAErwHMVnqNBPOZFn

Description

Reworks the database keys Collection and CollectionSchemaKey so that they become pointers to a CollectionSchemaVersionKey. this should allow multiple schema versions to be persisted in a manner consistent with the current 'head'/Collection key.

The current state of Collection related database keys does not readily facilitate multiple versions. This is required because when introducing updateable schema we need to preserve the schema history, for example for aspects described in issues in the range #1006-#1012

The names are not fantastic at the moment, in part due to the duplication noted in #1025, simplifying the names would be done in that ticket (alongside the deduplication).

codecov · 2023-01-04T21:05:57Z

Codecov Report

Merging #1026 (60cf433) into develop (9a98755) will decrease coverage by 0.04%.
The diff coverage is 76.62%.

@@             Coverage Diff             @@
##           develop    #1026      +/-   ##
===========================================
- Coverage    57.79%   57.76%   -0.04%     
===========================================
  Files          174      174              
  Lines        19521    19563      +42     
===========================================
+ Hits         11283    11300      +17     
- Misses        7248     7263      +15     
- Partials       990     1000      +10

Impacted Files	Coverage Δ
db/errors.go	`0.00% <ø> (ø)`
planner/mapper/descriptions.go	`62.50% <60.00%> (-10.23%)`	⬇️
planner/datasource.go	`59.45% <62.50%> (-1.84%)`	⬇️
db/collection.go	`65.87% <80.43%> (-0.75%)`	⬇️
core/key.go	`87.45% <84.61%> (-0.15%)`	⬇️
net/client.go	`78.04% <0.00%> (-4.88%)`	⬇️
net/peer.go	`44.01% <0.00%> (-2.16%)`	⬇️
net/pb/net.pb.go	`15.66% <0.00%> (+0.16%)`	⬆️
... and 1 more

jsimnz · 2023-01-09T07:35:40Z

Need to rebase 👀 due to the 0.4 re-release

fredcarle

I'm approving in the spirit that this seems to work and does move the needle in the direction I think we want to go. I have so many questions that have been generated by looking at these 129 new lines. Here are a few just to keep it short.

Do we plan on being able to walk back the schema history?
If so, did you think of what the mechanics of that will look like?
Is it via some kind of block store or a reference to the previous version in the description of the new version?
Will the document's fields have a reference to the schema version they belong to?

db/collection.go

planner/datasource.go

planner/mapper/descriptions.go

AndrewSisley · 2023-01-11T20:26:13Z

Do we plan on being able to walk back the schema history? (and Q 2.)

I'm unsure what you mean by walk back here. The current state of the schema at that version will be stored, not the patch. If you meant a post-transaction-commit rollback, the thought hadn't crossed my mind - it sounds like a nice idea, but with a whole minefield's worth of gotchas and well out of scope from anything I plan on doing in 0.5.

Is it via some kind of block store or a reference to the previous version in the description of the new version?

I dont understand to what you are referring to here, although hopefully it is answered in reply to Q1-2.

Will the document's fields have a reference to the schema version they belong to?

I have no plan on doing so, and dont see any immediate benefit of doing so. Why?

fredcarle · 2023-01-11T20:38:30Z

I'm unsure what you mean by walk back here. The current state of the schema at that version will be stored, not the patch. If you meant a post-transaction-commit rollback, the thought hadn't crossed my mind - it sounds like a nice idea, but with a whole minefield's worth of gotchas and well out of scope from anything I plan on doing in 0.5.

I don't mean rollback no. What I mean is that if we have multiple schema updates, unless we have the exact schemaVersionId from somewhere to get those previous versions, I don't see how we can inspect the change history.

I dont understand to what you are referring to here, although hopefully it is answered in reply to Q1-2.

With the documents, we have a block store that links the updates. I was wondering if that's what you had in mind.

I have no plan on doing so, and dont see any immediate benefit of doing so. Why?

If we inspect the history of a document and we have fields that were deleted from the schema over time, I'm thinking we might need a way to know which schema version it was to get the right field name and type and so on.

AndrewSisley · 2023-01-11T21:08:41Z

I don't mean rollback no. What I mean is that if we have multiple schema updates, unless we have the exact schemaVersionId from somewhere to get those previous versions, I don't see how we can inspect the change history.

Ah I think I understand you, the commit queries will expose the schemaVersionId. There will also likely be a function on the client ABI that allows the schema of a given version to be fetched (noted in the almanac doc).

At some point in the future we may want to allow the joining of the schema onto the commit in a query - but I consider that to be well out of scope atm. Likewise, the querying of deleted fields etc in a commit query is explicitly out of scope of phase 1 (and probably 0.5) - it gets very complicated as John noted in the standup yesterday, and whilst very nice to have I dont think it should block or slow down progress on the core [Schema Update] feature.

With the documents, we have a block store that links the updates. I was wondering if that's what you had in mind.

Ah I forgot the block store links to the prior commit (for composites). Doing so isn't a requirement for phase 1-3, and would not provide anything useful to the user. Being able to query a (visibly) ordered schema history would be nice at some point but is out of scope for now.

If we inspect the history of a document and we have fields that were deleted from the schema over time, I'm thinking we might need a way to know which schema version it was to get the right field name and type and so on.

Ah okay - this is to be held on the commit (and probably only the composite*), not the field's head as I thought you meant. The requirement that you just stated is a large reason as to why this PR exists (so that commit/time-travelling querys dont stop making sense).

*It will likely only be stored on the composite, but exposed as a requestable field on any commit query (with field-level commits sourcing it from their composite). Due to how time-travelling queries are just normal queries with a cid param this will almost certainly mean that it will also be made available on all normal queries too (again probably sourcing it from the composite commit).

fredcarle · 2023-01-11T21:21:02Z

Storing it in the composite makes a lot of sense. I said field but the point was just to give a reference.

Your answers all make sense. When I ask the questions I'm not inferring that it should be done now. But they are all real questions I had to ask myself while looking at the code. I find that the situation supports what we were saying in the standup regarding documentation. These kinds of questions and interaction could be had before we get to code. This way we alleviate the possibility that everyone will "waste" time going through the same thought process I did. Do you see what I mean?

AndrewSisley · 2023-01-12T17:36:32Z

Storing it in the composite makes a lot of sense. I said field but the point was just to give a reference.

Your answers all make sense. When I ask the questions I'm not inferring that it should be done now. But they are all real questions I had to ask myself while looking at the code. I find that the situation supports what we were saying in the standup regarding documentation.

These kinds of questions and interaction could be had before we get to code. This way we alleviate the possibility that everyone will "waste" time going through the same thought process I did. Do you see what I mean?

I do see where you are coming from, but a lot of your questions were regarding new ideas to me :) The time would be "wasted" either here or earlier. And it is impossible to explicitly exclude an infinite range of possibilities from a design doc (and doing so would reduce the information density of what is in scope). I still would like to encourage you to think along the lines of 'If it isn't in the spec, it is not in scope'. That said, I do think questions 1, 2, and 4 were explicitly mentioned in the almanac doc (with 4 not mentioning how it is achieved).

fredcarle · 2023-01-12T17:43:18Z

That said, I do think questions 1, 2, and 4 were explicitly mentioned

Mentioned but not answering the questions.

I still would like to encourage you to think along the lines of 'If it isn't in the spec, it is not in scope'.

We would first need a spec for that to be true :)

AndrewSisley · 2023-01-12T17:46:57Z

We would first need a spec for that to be true :)

Ah we have document the specifies and scopes out the problem to solve, it is not a tech spec, but it definitely covers this topic-area (It is almost the only thing it covers).

fredcarle · 2023-01-12T18:07:17Z

Ah we have document the specifies and scopes out the problem to solve, it is not a tech spec, but it definitely covers this topic-area (It is almost the only thing it covers).

Really? Do you mind sharing that document please. I don't think I've seen it.

shahzadlone

Can you document somewhere in a note, the difference between schema version id vs schema id.

Otherwise lgtm.

AndrewSisley · 2023-01-12T21:45:40Z

Can you document somewhere in a note, the difference between schema version id vs schema id.

I think it is already:

In the body of CreateCollection it is noted on the var declaration:

// For new schemas the initial version id will match the schema id

And the version key struct is documented with:

// CollectionSchemaVersionKey points to schema of a collection at a given
// version.

And the collection key struct is documented with:

// CollectionKey points to the current/'head' SchemaVersionId for
// the collection of the given name.

And the schema key struct is documented with:

// CollectionSchemaKey points to the current/'head' SchemaVersionId for
// the collection of the given schema id.

Where were you thinking-of-putting/expecting the note?

jsimnz

LGTM. Minor suggestions (the private/concrete suggestion should be done tho :) ).

jsimnz · 2023-01-17T11:51:27Z

db/collection.go

+// getCollectionByVersionId returns the [client.Collection] at the given [schemaVersionId] version.
+//
+// Will return an error if the given key is empty, or not found.
+func (db *db) getCollectionByVersionId(ctx context.Context, schemaVersionId string) (client.Collection, error) {


suggestion: Internal funcs/methods should return private concrete types.

I don't think that is very important here as this is not a utils/helper-like function, but will change

change func sig

Id assert it's semi important. Has a cleaner API surface so that if you need to use it internally, there's no need for casting from the interface to access collection type info.

Compile time safety 😅

Internal methods/funcs in a package should always handle concrete internal types if possible

jsimnz · 2023-01-17T12:02:31Z

db/collection.go

+	key := core.NewCollectionKey(name)
+	buf, err := db.systemstore().Get(ctx, key.ToDS())
+	if err != nil {
+		return nil, err
+	}
+
+	schemaVersionId := string(buf)


suggestion(non-blocking): Going from name to schemaVersionId can prob be scoped to a private method.

func (db *db) GetCollectionByName(ctx context.Context, name string) (client.Collection, error) { // ... schemaVersionId, err := db.getSchemaVersionId(name) return db.getCollectionByVersionId(ctx, schemaVersionId) }

I'll have a look - if it would only be called once (I cant remember), then this will probably remain as-is.

name=>vId private func?

I'm going to leave this as now, there is no significantly useful way of doing this without changing (adding to) the public interface in client. That work is already planned in the epic - I would rather this as part of that work where proper time can be allocated to it. I'll make a note of these locations in that ticket though as a reminder.

jsimnz · 2023-01-17T12:06:51Z

planner/datasource.go

+	collectionKey := core.NewCollectionKey(name)
 	var desc client.CollectionDescription
-	buf, err := p.txn.Systemstore().Get(p.ctx, key.ToDS())
+	schemaVersionIdBytes, err := p.txn.Systemstore().Get(p.ctx, collectionKey.ToDS())
 	if err != nil {
 		return desc, errors.Wrap("failed to get collection description", err)
 	}

+	schemaVersionId := string(schemaVersionIdBytes)
+	schemaVersionKey := core.NewCollectionSchemaVersionKey(schemaVersionId)
+	buf, err := p.txn.Systemstore().Get(p.ctx, schemaVersionKey.ToDS())
+	if err != nil {
+		return desc, err
+	}


suggestion(non-blocking): Would be nice (but not required) to link the two implementations (here and above) to handle schema name => versionId

Ah I understand your other comment better now - thanks :)

jsimnz · 2023-01-17T12:07:11Z

planner/mapper/descriptions.go

+	collectionKey := core.NewCollectionKey(name)
+	var desc client.CollectionDescription
+	schemaVersionIdBytes, err := r.txn.Systemstore().Get(r.ctx, collectionKey.ToDS())
+	if err != nil {
+		return desc, errors.Wrap("failed to get collection description", err)
 	}

-	key := core.NewCollectionKey(name)
-	buf, err := r.txn.Systemstore().Get(r.ctx, key.ToDS())
+	schemaVersionId := string(schemaVersionIdBytes)
+	schemaVersionKey := core.NewCollectionSchemaVersionKey(schemaVersionId)
+	buf, err := r.txn.Systemstore().Get(r.ctx, schemaVersionKey.ToDS())


suggestion(non-blocking): Again :)

shahzadlone · 2023-01-17T17:47:34Z

Can you document somewhere in a note, the difference between schema version id vs schema id.

I think it is already:

In the body of CreateCollection it is noted on the var declaration:

// For new schemas the initial version id will match the schema id

And the version key struct is documented with:

// CollectionSchemaVersionKey points to schema of a collection at a given
// version.

And the collection key struct is documented with:

// CollectionKey points to the current/'head' SchemaVersionId for
// the collection of the given name.

And the schema key struct is documented with:

// CollectionSchemaKey points to the current/'head' SchemaVersionId for
// the collection of the given schema id.

Where were you thinking-of-putting/expecting the note?

I was trying to find a central place where these different ids are documented, now that you pointed this location I can come here anytime to read these to remember the distinctions.

Another non-blocking question, how do you distinguish the use of the word "key" vs "id".

AndrewSisley · 2023-01-17T18:00:01Z

I was trying to find a central place where these different ids are documented, now that you pointed this location I can come here anytime to read these to remember the distinctions.

Yeah the key.go file is probably your best bet for stuff like that.

Another non-blocking question, how do you distinguish the use of the word "key" vs "id".

I am not sure I do in most cases - if I'm explicitly talking about datastore keys I'll use "key", id is more open.

Is clearer

Name will likely be improved with #1025

These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly

SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff

Is clearer

SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion.

Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.

* Rename collectionKey variable Is clearer * Add schema version key Name will likely be improved with #1025 * Remove duplicate variable * Deduplicate business logic These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly * Document global-local distinction of schema elements * Move schema id evaluation to before saving of collectionKey SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff * Rename collectionSchemaKey variable Is clearer * Persist schemaVersionKey * Persist schemaVersionId at schemaKey SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion. * Persist schemaVersionId at collectionKey Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.

…work#1026) * Rename collectionKey variable Is clearer * Add schema version key Name will likely be improved with sourcenetwork#1025 * Remove duplicate variable * Deduplicate business logic These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly * Document global-local distinction of schema elements * Move schema id evaluation to before saving of collectionKey SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff * Rename collectionSchemaKey variable Is clearer * Persist schemaVersionKey * Persist schemaVersionId at schemaKey SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion. * Persist schemaVersionId at collectionKey Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.

AndrewSisley added area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components action/no-benchmark Skips the action that runs the benchmark. labels Jan 4, 2023

AndrewSisley added this to the DefraDB v0.5 milestone Jan 4, 2023

AndrewSisley requested a review from a team January 4, 2023 21:05

AndrewSisley self-assigned this Jan 4, 2023

AndrewSisley changed the title ~~refactor: Rework collection keys to permit the storage of multiple schema versions~~ refactor: Allow collection keys to handle multiple schema versions Jan 4, 2023

AndrewSisley changed the title ~~refactor: Allow collection keys to handle multiple schema versions~~ refactor: Allow db keys to multiple schema versions Jan 4, 2023

AndrewSisley changed the title ~~refactor: Allow db keys to multiple schema versions~~ refactor: Allow db keys to handle multiple schema versions Jan 4, 2023

jsimnz force-pushed the develop branch from a615303 to d87bae8 Compare January 8, 2023 00:16

AndrewSisley force-pushed the sisley/refactor/I1022-schema-version-id branch from 19fa240 to f12a366 Compare January 10, 2023 20:40

fredcarle approved these changes Jan 11, 2023

View reviewed changes

db/collection.go Show resolved Hide resolved

db/collection.go Show resolved Hide resolved

planner/datasource.go Outdated Show resolved Hide resolved

planner/mapper/descriptions.go Outdated Show resolved Hide resolved

shahzadlone approved these changes Jan 12, 2023

View reviewed changes

jsimnz approved these changes Jan 17, 2023

View reviewed changes

AndrewSisley mentioned this pull request Jan 18, 2023

Allow getting collection by version id #1007

Closed

AndrewSisley force-pushed the sisley/refactor/I1022-schema-version-id branch from dded9c2 to 60cf433 Compare January 18, 2023 17:25

Rename collectionKey variable

e59adef

Is clearer

AndrewSisley added 12 commits January 18, 2023 12:25

Add schema version key

c6443d9

Name will likely be improved with #1025

Remove duplicate variable

af3edbd

Deduplicate business logic

d163144

These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly

Document global-local distinction of schema elements

7c4ecbc

Move schema id evaluation to before saving of collectionKey

40f3255

SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff

Rename collectionSchemaKey variable

f8cdf20

Is clearer

Persist schemaVersionKey

712e395

Persist schemaVersionId at schemaKey

27c3d0d

SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion.

PR FIXUP - Rename buffer

7a5d3f8

PR FIXUP - Remove duplicate lines

7251a31

PR FIXUP - Return concrete type

60cf433

AndrewSisley merged commit db76f2d into develop Jan 18, 2023

AndrewSisley deleted the sisley/refactor/I1022-schema-version-id branch January 18, 2023 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Allow db keys to handle multiple schema versions #1026

refactor: Allow db keys to handle multiple schema versions #1026

AndrewSisley commented Jan 4, 2023

codecov bot commented Jan 4, 2023 •

edited

Loading

jsimnz commented Jan 9, 2023

fredcarle left a comment

AndrewSisley commented Jan 11, 2023

fredcarle commented Jan 11, 2023

AndrewSisley commented Jan 11, 2023

fredcarle commented Jan 11, 2023

AndrewSisley commented Jan 12, 2023

fredcarle commented Jan 12, 2023

AndrewSisley commented Jan 12, 2023 •

edited

Loading

fredcarle commented Jan 12, 2023

shahzadlone left a comment

AndrewSisley commented Jan 12, 2023

jsimnz left a comment

jsimnz Jan 17, 2023

AndrewSisley Jan 17, 2023 •

edited

Loading

jsimnz Jan 18, 2023

jsimnz Jan 17, 2023

AndrewSisley Jan 17, 2023 •

edited

Loading

AndrewSisley Jan 18, 2023

jsimnz Jan 17, 2023

AndrewSisley Jan 17, 2023

jsimnz Jan 17, 2023

shahzadlone commented Jan 17, 2023

AndrewSisley commented Jan 17, 2023

refactor: Allow db keys to handle multiple schema versions #1026

refactor: Allow db keys to handle multiple schema versions #1026

Conversation

AndrewSisley commented Jan 4, 2023

Relevant issue(s)

Epic level context doc

Description

codecov bot commented Jan 4, 2023 • edited Loading

Codecov Report

jsimnz commented Jan 9, 2023

fredcarle left a comment

Choose a reason for hiding this comment

AndrewSisley commented Jan 11, 2023

fredcarle commented Jan 11, 2023

AndrewSisley commented Jan 11, 2023

fredcarle commented Jan 11, 2023

AndrewSisley commented Jan 12, 2023

fredcarle commented Jan 12, 2023

AndrewSisley commented Jan 12, 2023 • edited Loading

fredcarle commented Jan 12, 2023

shahzadlone left a comment

Choose a reason for hiding this comment

AndrewSisley commented Jan 12, 2023

jsimnz left a comment

Choose a reason for hiding this comment

jsimnz Jan 17, 2023

Choose a reason for hiding this comment

AndrewSisley Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

jsimnz Jan 18, 2023

Choose a reason for hiding this comment

jsimnz Jan 17, 2023

Choose a reason for hiding this comment

AndrewSisley Jan 17, 2023 • edited Loading

Choose a reason for hiding this comment

AndrewSisley Jan 18, 2023

Choose a reason for hiding this comment

jsimnz Jan 17, 2023

Choose a reason for hiding this comment

AndrewSisley Jan 17, 2023

Choose a reason for hiding this comment

jsimnz Jan 17, 2023

Choose a reason for hiding this comment

shahzadlone commented Jan 17, 2023

AndrewSisley commented Jan 17, 2023

codecov bot commented Jan 4, 2023 •

edited

Loading

AndrewSisley commented Jan 12, 2023 •

edited

Loading

AndrewSisley Jan 17, 2023 •

edited

Loading

AndrewSisley Jan 17, 2023 •

edited

Loading