Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Allow db keys to handle multiple schema versions #1026

Merged
merged 13 commits into from
Jan 18, 2023

Conversation

AndrewSisley
Copy link
Contributor

Relevant issue(s)

Resolves #1022

Epic level context doc

https://source.almanac.io/docs/N2o87ZyxzsruNHSlAErwHMVnqNBPOZFn

Description

Reworks the database keys Collection and CollectionSchemaKey so that they become pointers to a CollectionSchemaVersionKey. this should allow multiple schema versions to be persisted in a manner consistent with the current 'head'/Collection key.

The current state of Collection related database keys does not readily facilitate multiple versions. This is required because when introducing updateable schema we need to preserve the schema history, for example for aspects described in issues in the range #1006-#1012

The names are not fantastic at the moment, in part due to the duplication noted in #1025, simplifying the names would be done in that ticket (alongside the deduplication).

@AndrewSisley AndrewSisley added area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components action/no-benchmark Skips the action that runs the benchmark. labels Jan 4, 2023
@AndrewSisley AndrewSisley added this to the DefraDB v0.5 milestone Jan 4, 2023
@AndrewSisley AndrewSisley requested a review from a team January 4, 2023 21:05
@AndrewSisley AndrewSisley self-assigned this Jan 4, 2023
@codecov
Copy link

codecov bot commented Jan 4, 2023

Codecov Report

Merging #1026 (60cf433) into develop (9a98755) will decrease coverage by 0.04%.
The diff coverage is 76.62%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1026      +/-   ##
===========================================
- Coverage    57.79%   57.76%   -0.04%     
===========================================
  Files          174      174              
  Lines        19521    19563      +42     
===========================================
+ Hits         11283    11300      +17     
- Misses        7248     7263      +15     
- Partials       990     1000      +10     
Impacted Files Coverage Δ
db/errors.go 0.00% <ø> (ø)
planner/mapper/descriptions.go 62.50% <60.00%> (-10.23%) ⬇️
planner/datasource.go 59.45% <62.50%> (-1.84%) ⬇️
db/collection.go 65.87% <80.43%> (-0.75%) ⬇️
core/key.go 87.45% <84.61%> (-0.15%) ⬇️
net/client.go 78.04% <0.00%> (-4.88%) ⬇️
net/peer.go 44.01% <0.00%> (-2.16%) ⬇️
net/pb/net.pb.go 15.66% <0.00%> (+0.16%) ⬆️
... and 1 more

@AndrewSisley AndrewSisley changed the title refactor: Rework collection keys to permit the storage of multiple schema versions refactor: Allow collection keys to handle multiple schema versions Jan 4, 2023
@AndrewSisley AndrewSisley changed the title refactor: Allow collection keys to handle multiple schema versions refactor: Allow db keys to multiple schema versions Jan 4, 2023
@AndrewSisley AndrewSisley changed the title refactor: Allow db keys to multiple schema versions refactor: Allow db keys to handle multiple schema versions Jan 4, 2023
@jsimnz
Copy link
Member

jsimnz commented Jan 9, 2023

Need to rebase 👀 due to the 0.4 re-release

@AndrewSisley AndrewSisley force-pushed the sisley/refactor/I1022-schema-version-id branch from 19fa240 to f12a366 Compare January 10, 2023 20:40
Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving in the spirit that this seems to work and does move the needle in the direction I think we want to go. I have so many questions that have been generated by looking at these 129 new lines. Here are a few just to keep it short.

  1. Do we plan on being able to walk back the schema history?
  2. If so, did you think of what the mechanics of that will look like?
  3. Is it via some kind of block store or a reference to the previous version in the description of the new version?
  4. Will the document's fields have a reference to the schema version they belong to?

db/collection.go Show resolved Hide resolved
db/collection.go Show resolved Hide resolved
planner/datasource.go Outdated Show resolved Hide resolved
planner/mapper/descriptions.go Outdated Show resolved Hide resolved
@AndrewSisley
Copy link
Contributor Author

  1. Do we plan on being able to walk back the schema history? (and Q 2.)

I'm unsure what you mean by walk back here. The current state of the schema at that version will be stored, not the patch. If you meant a post-transaction-commit rollback, the thought hadn't crossed my mind - it sounds like a nice idea, but with a whole minefield's worth of gotchas and well out of scope from anything I plan on doing in 0.5.

  1. Is it via some kind of block store or a reference to the previous version in the description of the new version?

I dont understand to what you are referring to here, although hopefully it is answered in reply to Q1-2.

Will the document's fields have a reference to the schema version they belong to?

I have no plan on doing so, and dont see any immediate benefit of doing so. Why?

@fredcarle
Copy link
Collaborator

I'm unsure what you mean by walk back here. The current state of the schema at that version will be stored, not the patch. If you meant a post-transaction-commit rollback, the thought hadn't crossed my mind - it sounds like a nice idea, but with a whole minefield's worth of gotchas and well out of scope from anything I plan on doing in 0.5.

I don't mean rollback no. What I mean is that if we have multiple schema updates, unless we have the exact schemaVersionId from somewhere to get those previous versions, I don't see how we can inspect the change history.

I dont understand to what you are referring to here, although hopefully it is answered in reply to Q1-2.

With the documents, we have a block store that links the updates. I was wondering if that's what you had in mind.

I have no plan on doing so, and dont see any immediate benefit of doing so. Why?

If we inspect the history of a document and we have fields that were deleted from the schema over time, I'm thinking we might need a way to know which schema version it was to get the right field name and type and so on.

@AndrewSisley
Copy link
Contributor Author

I don't mean rollback no. What I mean is that if we have multiple schema updates, unless we have the exact schemaVersionId from somewhere to get those previous versions, I don't see how we can inspect the change history.

Ah I think I understand you, the commit queries will expose the schemaVersionId. There will also likely be a function on the client ABI that allows the schema of a given version to be fetched (noted in the almanac doc).

At some point in the future we may want to allow the joining of the schema onto the commit in a query - but I consider that to be well out of scope atm. Likewise, the querying of deleted fields etc in a commit query is explicitly out of scope of phase 1 (and probably 0.5) - it gets very complicated as John noted in the standup yesterday, and whilst very nice to have I dont think it should block or slow down progress on the core [Schema Update] feature.

With the documents, we have a block store that links the updates. I was wondering if that's what you had in mind.

Ah I forgot the block store links to the prior commit (for composites). Doing so isn't a requirement for phase 1-3, and would not provide anything useful to the user. Being able to query a (visibly) ordered schema history would be nice at some point but is out of scope for now.

If we inspect the history of a document and we have fields that were deleted from the schema over time, I'm thinking we might need a way to know which schema version it was to get the right field name and type and so on.

Ah okay - this is to be held on the commit (and probably only the composite*), not the field's head as I thought you meant. The requirement that you just stated is a large reason as to why this PR exists (so that commit/time-travelling querys dont stop making sense).

*It will likely only be stored on the composite, but exposed as a requestable field on any commit query (with field-level commits sourcing it from their composite). Due to how time-travelling queries are just normal queries with a cid param this will almost certainly mean that it will also be made available on all normal queries too (again probably sourcing it from the composite commit).

@fredcarle
Copy link
Collaborator

Storing it in the composite makes a lot of sense. I said field but the point was just to give a reference.

Your answers all make sense. When I ask the questions I'm not inferring that it should be done now. But they are all real questions I had to ask myself while looking at the code. I find that the situation supports what we were saying in the standup regarding documentation. These kinds of questions and interaction could be had before we get to code. This way we alleviate the possibility that everyone will "waste" time going through the same thought process I did. Do you see what I mean?

@AndrewSisley
Copy link
Contributor Author

Storing it in the composite makes a lot of sense. I said field but the point was just to give a reference.

Your answers all make sense. When I ask the questions I'm not inferring that it should be done now. But they are all real questions I had to ask myself while looking at the code. I find that the situation supports what we were saying in the standup regarding documentation.

These kinds of questions and interaction could be had before we get to code. This way we alleviate the possibility that everyone will "waste" time going through the same thought process I did. Do you see what I mean?

I do see where you are coming from, but a lot of your questions were regarding new ideas to me :) The time would be "wasted" either here or earlier. And it is impossible to explicitly exclude an infinite range of possibilities from a design doc (and doing so would reduce the information density of what is in scope). I still would like to encourage you to think along the lines of 'If it isn't in the spec, it is not in scope'. That said, I do think questions 1, 2, and 4 were explicitly mentioned in the almanac doc (with 4 not mentioning how it is achieved).

@fredcarle
Copy link
Collaborator

That said, I do think questions 1, 2, and 4 were explicitly mentioned

Mentioned but not answering the questions.

I still would like to encourage you to think along the lines of 'If it isn't in the spec, it is not in scope'.

We would first need a spec for that to be true :)

@AndrewSisley
Copy link
Contributor Author

AndrewSisley commented Jan 12, 2023

We would first need a spec for that to be true :)

Ah we have document the specifies and scopes out the problem to solve, it is not a tech spec, but it definitely covers this topic-area (It is almost the only thing it covers).

@fredcarle
Copy link
Collaborator

Ah we have document the specifies and scopes out the problem to solve, it is not a tech spec, but it definitely covers this topic-area (It is almost the only thing it covers).

Really? Do you mind sharing that document please. I don't think I've seen it.

Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document somewhere in a note, the difference between schema version id vs schema id.

Otherwise lgtm.

@AndrewSisley
Copy link
Contributor Author

Can you document somewhere in a note, the difference between schema version id vs schema id.

I think it is already:

In the body of CreateCollection it is noted on the var declaration:

// For new schemas the initial version id will match the schema id

And the version key struct is documented with:

// CollectionSchemaVersionKey points to schema of a collection at a given
// version.

And the collection key struct is documented with:

// CollectionKey points to the current/'head' SchemaVersionId for
// the collection of the given name.

And the schema key struct is documented with:

// CollectionSchemaKey points to the current/'head' SchemaVersionId for
// the collection of the given schema id.

Where were you thinking-of-putting/expecting the note?

Copy link
Member

@jsimnz jsimnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor suggestions (the private/concrete suggestion should be done tho :) ).

db/collection.go Outdated
// getCollectionByVersionId returns the [client.Collection] at the given [schemaVersionId] version.
//
// Will return an error if the given key is empty, or not found.
func (db *db) getCollectionByVersionId(ctx context.Context, schemaVersionId string) (client.Collection, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Internal funcs/methods should return private concrete types.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is very important here as this is not a utils/helper-like function, but will change

  • change func sig

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id assert it's semi important. Has a cleaner API surface so that if you need to use it internally, there's no need for casting from the interface to access collection type info.

Compile time safety 😅

Internal methods/funcs in a package should always handle concrete internal types if possible

Comment on lines +246 to +252
key := core.NewCollectionKey(name)
buf, err := db.systemstore().Get(ctx, key.ToDS())
if err != nil {
return nil, err
}

schemaVersionId := string(buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): Going from name to schemaVersionId can prob be scoped to a private method.

func (db *db) GetCollectionByName(ctx context.Context, name string) (client.Collection, error) {
    // ...
    schemaVersionId, err := db.getSchemaVersionId(name)
    return db.getCollectionByVersionId(ctx, schemaVersionId)
}

Copy link
Contributor Author

@AndrewSisley AndrewSisley Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look - if it would only be called once (I cant remember), then this will probably remain as-is.

  • name=>vId private func?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as now, there is no significantly useful way of doing this without changing (adding to) the public interface in client. That work is already planned in the epic - I would rather this as part of that work where proper time can be allocated to it. I'll make a note of these locations in that ticket though as a reminder.

Comment on lines +63 to +72
collectionKey := core.NewCollectionKey(name)
var desc client.CollectionDescription
buf, err := p.txn.Systemstore().Get(p.ctx, key.ToDS())
schemaVersionIdBytes, err := p.txn.Systemstore().Get(p.ctx, collectionKey.ToDS())
if err != nil {
return desc, errors.Wrap("failed to get collection description", err)
}

schemaVersionId := string(schemaVersionIdBytes)
schemaVersionKey := core.NewCollectionSchemaVersionKey(schemaVersionId)
buf, err := p.txn.Systemstore().Get(p.ctx, schemaVersionKey.ToDS())
if err != nil {
return desc, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): Would be nice (but not required) to link the two implementations (here and above) to handle schema name => versionId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I understand your other comment better now - thanks :)

Comment on lines +46 to +55
collectionKey := core.NewCollectionKey(name)
var desc client.CollectionDescription
schemaVersionIdBytes, err := r.txn.Systemstore().Get(r.ctx, collectionKey.ToDS())
if err != nil {
return desc, errors.Wrap("failed to get collection description", err)
}

key := core.NewCollectionKey(name)
buf, err := r.txn.Systemstore().Get(r.ctx, key.ToDS())
schemaVersionId := string(schemaVersionIdBytes)
schemaVersionKey := core.NewCollectionSchemaVersionKey(schemaVersionId)
buf, err := r.txn.Systemstore().Get(r.ctx, schemaVersionKey.ToDS())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): Again :)

@shahzadlone
Copy link
Member

Can you document somewhere in a note, the difference between schema version id vs schema id.

I think it is already:

In the body of CreateCollection it is noted on the var declaration:

// For new schemas the initial version id will match the schema id

And the version key struct is documented with:

// CollectionSchemaVersionKey points to schema of a collection at a given
// version.

And the collection key struct is documented with:

// CollectionKey points to the current/'head' SchemaVersionId for
// the collection of the given name.

And the schema key struct is documented with:

// CollectionSchemaKey points to the current/'head' SchemaVersionId for
// the collection of the given schema id.

Where were you thinking-of-putting/expecting the note?

I was trying to find a central place where these different ids are documented, now that you pointed this location I can come here anytime to read these to remember the distinctions.

Another non-blocking question, how do you distinguish the use of the word "key" vs "id".

@AndrewSisley
Copy link
Contributor Author

I was trying to find a central place where these different ids are documented, now that you pointed this location I can come here anytime to read these to remember the distinctions.

Yeah the key.go file is probably your best bet for stuff like that.

Another non-blocking question, how do you distinguish the use of the word "key" vs "id".

I am not sure I do in most cases - if I'm explicitly talking about datastore keys I'll use "key", id is more open.

Name will likely be improved with #1025
These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly
SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff
SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion.
Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.
@AndrewSisley AndrewSisley merged commit db76f2d into develop Jan 18, 2023
@AndrewSisley AndrewSisley deleted the sisley/refactor/I1022-schema-version-id branch January 18, 2023 18:08
shahzadlone pushed a commit that referenced this pull request Apr 13, 2023
* Rename collectionKey variable

Is clearer

* Add schema version key

Name will likely be improved with #1025

* Remove duplicate variable

* Deduplicate business logic

These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly

* Document global-local distinction of schema elements

* Move schema id evaluation to before saving of collectionKey

SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff

* Rename collectionSchemaKey variable

Is clearer

* Persist schemaVersionKey

* Persist schemaVersionId at schemaKey

SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion.

* Persist schemaVersionId at collectionKey

Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
…work#1026)

* Rename collectionKey variable

Is clearer

* Add schema version key

Name will likely be improved with sourcenetwork#1025

* Remove duplicate variable

* Deduplicate business logic

These absolutely have to have the same value, the descriptive new variable name will also come in handy shortly

* Document global-local distinction of schema elements

* Move schema id evaluation to before saving of collectionKey

SchemaId will be used first shortly and I dont want to polute the logic-change commit with move stuff

* Rename collectionSchemaKey variable

Is clearer

* Persist schemaVersionKey

* Persist schemaVersionId at schemaKey

SchemaKey is now a pointer to the current version, essentially becoming schema.head. getCollectionByVersionId is a copy-paste of the existing GetCollectionByName function. GetCollectionByName will be reworked shortly to work in a similar fashion.

* Persist schemaVersionId at collectionKey

Uses the getCollectionByVersionId defined in an earlier commit. getCollectionDesc is an existing duplicate function that as well as duplicating itself, also duplicates much of the logic in getCollectionByVersionId - it should be refactored at somepoint, but correcting this is IMO out of scope at the moment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/no-benchmark Skips the action that runs the benchmark. area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework CollectionKeys to allow versioning of schema
4 participants