Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Deprecate CollectionDescription.Schema #1939

Merged

Conversation

AndrewSisley
Copy link
Contributor

@AndrewSisley AndrewSisley commented Oct 6, 2023

Relevant issue(s)

Resolves #1955

Description

Deprecate CollectionDescription.Schema. Schema is not a sub-property of collection.

Removes as many references to CollectionDescription.Schema as possible without making any breaking changes. Breaking changes will be made in a later PR. An exception has been made to the http API, which does have a breaking change in this PR - it can be pulled out of here if people want it more obvious in the develop commit history/change log.

This should minimise the conflicts generated by getting the large broad changes sorted out early, instead of being bogged down behind the more complex, breaking, changes coming later.

Commits should be clean, and should bundle up the work into more readable chunks, would recommend reviewing them one by one.

@AndrewSisley AndrewSisley added area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components labels Oct 6, 2023
@AndrewSisley AndrewSisley added this to the DefraDB v0.8 milestone Oct 6, 2023
@AndrewSisley AndrewSisley requested a review from a team October 6, 2023 21:18
@AndrewSisley AndrewSisley self-assigned this Oct 6, 2023
@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (1e255e7) 74.94% compared to head (ae291b3) 74.81%.

❗ Current head ae291b3 differs from pull request most recent head 1fc3e0d. Consider uploading reports for the commit 1fc3e0d to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1939      +/-   ##
===========================================
- Coverage    74.94%   74.81%   -0.13%     
===========================================
  Files          241      240       -1     
  Lines        23616    23645      +29     
===========================================
- Hits         17699    17689      -10     
- Misses        4715     4751      +36     
- Partials      1202     1205       +3     
Flag Coverage Δ
all-tests 74.81% <91.27%> (-0.13%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
cli/collection_describe.go 91.43% <100.00%> (ø)
db/collection_get.go 72.22% <100.00%> (-0.51%) ⬇️
db/collection_index.go 97.25% <100.00%> (+0.01%) ⬆️
db/collection_update.go 73.30% <100.00%> (+0.14%) ⬆️
db/errors.go 78.23% <ø> (-2.21%) ⬇️
db/fetcher/fetcher.go 77.17% <100.00%> (+0.63%) ⬆️
db/fetcher/indexer.go 78.89% <100.00%> (ø)
db/index.go 97.20% <100.00%> (-0.05%) ⬇️
lens/fetcher.go 66.51% <100.00%> (ø)
net/process.go 88.24% <100.00%> (ø)
... and 19 more

... and 8 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e255e7...1fc3e0d. Read the comment docs.

AndrewSisley added a commit to AndrewSisley/defradb that referenced this pull request Oct 10, 2023
@@ -16,6 +16,11 @@ import (
"github.com/sourcenetwork/defradb/client"
)

type CollectionDefinition struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: does it need to be a public definition of cli package?

Seems like it doesn't belong here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the public output type of the command. And this is the only place that outputs it.

Where were you thinking it should be?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct doesn't need to be public. Only its fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct doesn't need to be public. Only its fields.

That is technically correct, but, IMO conceptually incorrect. This is a public return type, it is just that it is returned as json - declaring it as private would be misleading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's specific to this CLI command and not used anywhere else. I really think this struct shouldn't be public. If you absolutely want it to be public, please at some documentation to the type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is returned from a public endpoint but that public endpoint isn't a public Go function. That endpoint won't show up in the godocs and the function itself is returning an error. The type/shape should be documented elsewhere but not with a public Go type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to define it in the client package and reference it in the cli and http. That would also get rid of the duplicate definition in the http client.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy with that option :)

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice okay - I'll try and figure something out. Some other stuff might get re-jigged slightly as we already have a CollectionDefinition (interface) in client, but hopefully everything will be the better for it.

  • Bring http and cli CollectionDefinitions into client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted, client.CollectionDefinition is now a struct

if field.ID == id {
return field, true
}
func (col CollectionDescription) GetFieldByID(id FieldID, schema *SchemaDescription) (FieldDescription, bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This and the couple other methods bellow no longer need the CollectionDescription. Do you intend on turning those into methods of SchemaDescription in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this and the others remain here as whilst they currently only access stuff on the schema, that is because the stuff they access is incorrectly on the schema.

For example, in this case we are getting fields by their ID, FieldID is a local concept that should exist only on collection, not on schema.

The function is located in the correct place, but for historical reasons the data structures are incorrect and need to be changed along with the implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying

db/collection.go Outdated
Comment on lines 209 to 211
existingDescriptionsByName map[string]client.CollectionDescription,
proposedDescriptionsByName map[string]client.CollectionDescription,
existingSchemaByName map[string]client.SchemaDescription,
proposedDescriptionsByName map[string]client.SchemaDescription,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought(out of scope): I would like to point out that this in a really confusing set of params. We are updateing one collection but we are passing all collection, all schemas and all proposed schemas. In my opinion, only what is directly relevant should be passed to this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is fiddly, and I agree that it is a bit of an eyesore. However they are all used and all need to be there (or within this func, we could probably (re)fetch them) due to relational fields.

It does get improved slightly in a later PR when we are able to change all their types to SchemaDescriptions (allowing us to only pass in two maps instead of 3). Might be worth re-visiting this in #1965

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a todo in that PR, as one of the params still makes no sense there: https://github.com/sourcenetwork/defradb/pull/1965/files#r1358767728

Removing it leaves us with:

func (db *db) updateCollection(
	ctx context.Context,
	txn datastore.Txn,
	existingSchemaByName map[string]client.SchemaDescription,
	proposedDescriptionsByName map[string]client.SchemaDescription,
	schema client.SchemaDescription,
	setAsDefaultVersion bool,
) (client.Collection, error) {

existingSchemaByName could be re-fetched inside updateCollection, but that feels fairly wasteful atm (although a planned repository mechanic would remove that issue).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are used but in a wasteful way as we only care about one in each of the maps. The only proposedDescription that we care about is already passed as schema so that one can be removed completely. It should probably look like this:

func (db *db) updateCollection(
    ctx context.Context,
    txn datastore.Txn,
    existingSchema client.SchemaDescription,
    schema client.SchemaDescription,
    desc client.CollectionDescription
    setAsDefaultVersion bool,
) (client.Collection, error) {

Although the checks on the schema should probably be done before calling updateCollection and we would end up with just:

func (db *db) updateCollection(
    ctx context.Context,
    txn datastore.Txn,
    schema client.SchemaDescription,
    desc client.CollectionDescription
    setAsDefaultVersion bool, // if we call updateCollection, this should be implicit and removed from params.
) (client.Collection, error) {

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only proposedDescription that we care about is already passed as schema so that one can be removed completely

This is not true, we must validate that all relational fields point to a valid schema. That schema is rarely the one being updated.

Although the checks on the schema should probably be done before calling updateCollection

I'me very much against that, as it moves the validation out of a request-language agnostic part of the codebase (db.collection.go) into a request-language specific area (patchSchema) of the codebase. It would mean that if we add an alterative to patchSchema (including a typed embedded Go call) we'd have to rework the validation.

Copy link
Collaborator

@fredcarle fredcarle Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the problem then is that the name of the function is updateCollection but it handles both schema and collection in the same function. Maybe once we get further in cleaning this up it won't be a problem anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for-reference: We spoke about this last Friday, and there is a task on a later PR to rename this function (when it stops being updateCollection)

@@ -24,9 +24,32 @@ import (
"github.com/graphql-go/graphql/language/source"
)

type collectionDefinition struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: You'll be able to use the one in the client package after you move it there instead of redefining it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: you removed so many tests. Are they already covered by some other tests?

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, our integration tests cover all of this, in significantly more depth.

@@ -324,28 +304,6 @@ func TestCreateIndex_IfFieldHasNoDirection_DefaultToAsc(t *testing.T) {
assert.Equal(t, client.Ascending, newDesc.Fields[0].Direction)
}

func TestCreateIndex_IfNameIsNotSpecified_Generate(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why did you remove this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is documented in the commit message:

One test was removed, TestCreateIndex_IfNameIsNotSpecified_Generate - this is because it only pretends to test what it tests - the name is set later by the unit test framework, and as such is identical to other tests here.

@@ -22,6 +22,11 @@ import (

type storeHandler struct{}

type CollectionDefinition struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: whatever solution is for identical struct in cli/collection_describe.go that we discussed should also apply here I believe

It adds no value in any location, and will always return false anyway.  Function likely originates from a time where schemaless collections were still being considered a possibility.
Unrelated to main body of work, but might as well fix it here.
Note: The only place that calls this only wants the field name, which long term will be available on the CollectionFieldDescription anyway.  I hope this param can be removed again soon. Additionally all properties of Schema used in that calling location will soon be on Collection itself, so the reference to Schema there is hopefully a very temporary thing too.
This will be temporary, as the properties required by this function will medium-long term on the Collection field set, not the schema.
This was validating stuff on the schema.  This stuff has already been validated on the schema before reaching this code block.
It should no longer be accessed via [collection] desc
Just get it from col.Schema instead of having to maintain the same prop in multiple places - this caused a bug briefly.
Just get it from col.Schema instead of having to maintain the same prop in multiple places.
The private funtion is not the one under test, and so we should not be forced to change the P2P tests every time we wish to change this private function. May be a historical artifact.
The main change in this commit is the removal of the call to `db.createCollection`, which is a private function used as a utility _only_ in these tests, it is not a function under test, but a means to an end.  Using it coupled the index tests to the function, whose only reason-to-be is from within the `db.addSchema` (GQL SDL stuff), this coupling is unwanted and needs to be removed in order to change the behaviour/signature of that private function in a later commit.  The unit tests now use the public `db.AddSchema` and `db.GetCollectionByName` functions.

One test was removed, TestCreateIndex_IfNameIsNotSpecified_Generate - this is because it only pretends to test what it tests - the name _is_ set later by the unit test framework, and as such is identical to other tests here.
Similar to the changes to the fetcher(s), this commit replaces CollectionDescription objects with Collections.

It also removes the DescriptionsRepo, replacing calls to it with calls to `store.GetCollectionByName`.  At some point the repo was caching descriptions to avoid fetching them from storage multiple times, but this has since been removed, making the type fairly pointless.
Splits the collection definition/metadata out from the interface that declares what a Collection can _do_.  This allows the trimmed down metadata to be passed around to code-locations which should not have access to database operations, this will be used in later commit(s).
Uses the new(er) client.CollectionDefinition interface to minimize the number of references to collectionDescription.Schema when creating new collections/schema from a GQL SDL.

I do not believe the gql parser code should have access to the database, so CollectionDefinition is passed around instead of Collection.

Some more work will need to be made to this function and its children as the splitting of collection and schema description is completed, however this should get us a lot of the way there without changing public behaviour/signatures.
Previously the return values would include schema within collectionDescription.  This commit makes them siblings.  This is a public behaviour change (I think the only one in the PR).
This function will be broken up further outside of this PR, the remaining changes will be breaking and will change both storage and the public interfaces.
@AndrewSisley AndrewSisley force-pushed the 1913-rm-schema-from-col branch 3 times, most recently from 09f67a4 to 95edff0 Compare October 16, 2023 16:54
@@ -16,24 +16,34 @@ import (
"github.com/sourcenetwork/defradb/datastore"
)

// CollectionDefinition contains the metadata defining what a Collection is.
type CollectionDefinition struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I'm a little worried that the name is so close to CollectionDescription that it will be a source of confusion. What do you think of CollectionInfo?

Copy link
Contributor Author

@AndrewSisley AndrewSisley Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike CollectionInfo as it suffers from being a little ambiguous - could be read as containing info about what data is in the collection too (e.g. doc count), CollectionDefinition suffers less from that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably think of something better then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably think of something better then.

I tried, and CollectionDefinition was the best I could come up with without spending an unreasonable amount of time on it. I'm fairly happy with it given the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting here for future reference:

I feel like client.SchemaDescription should just be client.Schema and client.CollectionDescription should be client.Collection. The interface should have a different name. Probably something like client.CollectionAPI or client.CollectionHandler.

This way, client.CollectionDefinition would work.

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work Andy!

Comment on lines +99 to +100
schema := def.Schema
desc := def.Description
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why reallocate when you've already passed a copy of the definition? Using def directly would be clearer and you could assign to col.def directly from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a struct, and its children are structs, mutating them does not mutate the one on the parent, the parent must instead later be overwritten with a new inner.

// foo.bar.foobar is false
foo.bar.foobar = true
// foo.bar.foobar is still false, a new (and unreachable) bar instance has value true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true: https://go.dev/play/p/FHce-ajgFyr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 Maybe not, but something was very broken before I made that change. It is very deliberate and it doesnt work without it. Could be something similar in the more complex real-world code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested it and all tests are passing. Not sure what would have been broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot see what you have changed, most likely it was schema that was important here, as that is what is being mutated most.

UpdateCollection already does this, it is safer, and we dont care about the performance cost here.
Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Andy!

@AndrewSisley AndrewSisley merged commit 5c1b21e into sourcenetwork:develop Oct 16, 2023
26 checks passed
@AndrewSisley AndrewSisley deleted the 1913-rm-schema-from-col branch October 16, 2023 20:55
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
## Relevant issue(s)

Resolves sourcenetwork#1955

## Description

Deprecate CollectionDescription.Schema. Schema is not a sub-property of
collection.

Removes as many references to CollectionDescription.Schema as possible
without making any breaking changes. Breaking changes will be made in a
later PR. An exception has been made to the http API, which does have a
breaking change in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collections Related to the collections system refactor This issue specific to or requires *notable* refactoring of existing codebases and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate CollectionDescription.Schema
5 participants