Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow new fields to be added locally to schema #1139

Merged
merged 12 commits into from
Mar 6, 2023

Conversation

AndrewSisley
Copy link
Contributor

@AndrewSisley AndrewSisley commented Feb 20, 2023

Relevant issue(s)

Resolves #1004

Description

Allows new simple (non-relational) fields to be added locally to schema.

The following items are out of scope of this PR and will be done later:

  • GQL Introspection tests
  • P2P tests on updated schema
  • Hiding the Field indexes
  • Reworking the client interfaces to provide a much cleaner experience regarding functions exposing transactions
  • Hiding the FieldDescription.Kind number value
  • Hiding the FieldDescription.Typ (CRDT) number value and renaming that field
  • Concurrency tests
  • Wrapping/hiding json patch lib errors (e.g. add operation does not apply: doc is missing path)

I very strongly recommend reviewing commit by commit, there are (hopefully) useful explanations as to what has been done in each commit body. The feature itself has been introduced in the final commit, everything else is preparation for that feature.

@AndrewSisley AndrewSisley added feature New feature or request area/schema Related to the schema system area/collections Related to the collections system action/no-benchmark Skips the action that runs the benchmark. labels Feb 20, 2023
@AndrewSisley AndrewSisley added this to the DefraDB v0.5 milestone Feb 20, 2023
@AndrewSisley AndrewSisley requested a review from a team February 20, 2023 18:56
@AndrewSisley AndrewSisley self-assigned this Feb 20, 2023
@AndrewSisley AndrewSisley force-pushed the sisley/feat/I1004-add-fields-to-schema branch 4 times, most recently from f0f552c to 05a2f0e Compare February 20, 2023 20:25
@codecov
Copy link

codecov bot commented Feb 20, 2023

Codecov Report

Merging #1139 (27093dd) into develop (92a7f89) will increase coverage by 0.05%.
The diff coverage is 69.76%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1139      +/-   ##
===========================================
+ Coverage    68.23%   68.28%   +0.05%     
===========================================
  Files          181      181              
  Lines        16617    17007     +390     
===========================================
+ Hits         11338    11613     +275     
- Misses        4337     4426      +89     
- Partials       942      968      +26     
Impacted Files Coverage Δ
request/graphql/schema/generate.go 84.26% <ø> (+0.33%) ⬆️
net/peer.go 43.90% <21.56%> (-2.15%) ⬇️
net/server.go 59.87% <25.00%> (ø)
db/collection_update.go 72.17% <50.00%> (ø)
db/db.go 70.24% <50.00%> (-2.73%) ⬇️
db/p2p_collection.go 55.55% <57.14%> (-4.45%) ⬇️
db/schema.go 64.51% <67.53%> (+18.68%) ⬆️
db/collection.go 67.65% <75.64%> (+1.45%) ⬆️
request/graphql/parser.go 87.14% <78.57%> (-2.69%) ⬇️
db/sequence.go 64.28% <90.00%> (-0.84%) ⬇️
... and 10 more

@AndrewSisley AndrewSisley force-pushed the sisley/feat/I1004-add-fields-to-schema branch 18 times, most recently from 075b49f to dc79dc9 Compare February 20, 2023 21:47
@AndrewSisley AndrewSisley force-pushed the sisley/feat/I1004-add-fields-to-schema branch 3 times, most recently from 5c7ddef to 319aa94 Compare February 24, 2023 21:45
@AndrewSisley AndrewSisley marked this pull request as ready for review February 24, 2023 22:12
Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick overview but didn't spend enough time on the details to approve it and I'm not sure how much time I'll have to spend on this in the coming days. As far as I can tell though, it looks good.

client/db.go Outdated
Comment on lines 54 to 55
// collection. Will return false and an error if it fails validation.
ValidateUpdateCollectionTxn(context.Context, datastore.Txn, CollectionDescription) (bool, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: "Will return false and and error" is a bit misleading here as it the doc string makes it sound as if we either get true and no error or false and error. But looking at the function I see that it can also return false and no error. To avoid that confusion I would simply say "Will return an error if it fails validation".

Copy link
Contributor Author

@AndrewSisley AndrewSisley Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, cheers Fred - I'll change this.

  • reword bool return documentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be publicly exposed on the client interface? Cant it just be privatee within the UpdateCollection? Whats the benefit of publicly calling Validate then Update?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like its only used internally anyway?

Copy link
Contributor Author

@AndrewSisley AndrewSisley Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the benefit of publicly calling Validate then Update?

It is quite common to want to validate something without immediate execution, and many places consider it best practice to expose validation functions by default. I'd probably want to use this as an app dev, and it seems reasonably likely that there'd be others too.

EDIT: Just noting that Orpheus also wanted this via the CLI and there is an open ticket for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should this not be done from the perspective of the Patch DDL then for external use? At the moment, they would need to somehow (not that hard since its a patch, but we are delegating that responsibilty to the dev in this API) produce the new description objects, and then call this API.

Compared to calling something like db.ValidateUpdatePatch(ctx, patch). Which would essentially apply the patch in a temp object and call validateDescription anyway.

Putting this API along the Schema related APIs instead of the Collection APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd want both, as updates can be either a json patch or collectionDescription update. This (the collectionDescription varient) already existed as a function so adding it to the interface was all the effort required.

Adding a ValidatePatch function would be slightly more effort and a scope expansion (ValidateUpdateCollectionTxn is already tested via UpdateCollectionTxn, whereas ValidatePatch would be untested).

Do you really dislike the adding of just this one now?

db/collection.go Outdated
// ValidateUpdateCollectionTxn validates that the given collection description is a valid update.
//
// Will return true if the given desctiption differs from the current persisted state of the
// collection. Will return false and an error if it fails validation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Same as above :)

{ "op": "add", "path": "/Users/Schema/Fields/-", "value": {"Name": "Foo", "Kind": 2, "Typ":3} }
]
`,
ExpectedError: "only default or LWW (last writer wins) CRDT types are supported. Name: Foo, CRDTType: 3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using exact string to assert certain behavior is in general a bad practice.
Is it our code convention or more like lets-stick-to-it-for-now approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I wish to match the whole string, as it is all relevant to whether the error is the expected.

ExpectedError does work on partials, and we use it that way, but here I do want the whole string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case the problem is not with entire vs partial string matching but with string matching in general.
The test should assert that certain error occurred and the test should fail only if the no error occurred or if some other error occurred instead of expected one.
With current change the test might fail for absolutely unrelated changes like text message formatting.
For this we will have to change we go about error handling.
But I guess this is a discussion for a bigger round.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean protection against line breaks etc?

I think that would be a really easy change - should just be a couple of lines in tests/integration/utils2.go (~ln 848).

Shouldn't matter in the short term as the error text/formatting are managed by devs atm and if they are deliberately changing the formatting they can also deliberately change the expected error (or just make that tweak to errors.Is).

{
"_key": "bae-43deba43-f2bc-59f4-9056-fef661b22832",
"Name": "John",
"Email": nil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this test consumed the previous one.


func TestSchemaUpdatesAddFieldWithCreateWithUpdateAfterSchemaUpdateAndVersionJoin(t *testing.T) {
initialSchemaVersionId := "bafkreicg3xcpjlt3ecguykpcjrdx5ogi4n7cq2fultyr6vippqdxnrny3u"
updatedSchemaVersionId := "bafkreicnj2kiq6vqxozxnhrc4mlbkdp5rr44awaetn5x5hcdymk6lxrxdy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother with the extra word?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's semantics. If it const I, the reader of the code, will note straight away that the values aren't supposed to change. Otherwise I will have to go through the code to see why it is a variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) I dont think that is something to worry about in our tests, the possiblities for mutation are quite small here and if anyone found a reason to mutate anything in a test declaration then that would probs/hopefully be flagged and removed in review anyway.

If you feel strongly about this, I'd suggest we add a linter (outside this PR). In most other langs I'd consider it a no-brainer, as it is a conscious choice between var and const, however Go's := makes it a bit awkward and a poor use of time and mental energy.

Request: `query {
Users {
Name
Foo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything bool-array-related here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind: 3 represents a field of type boolean array.

We plan on allowing string based defining of these values in the near future, and when the happens we'll get tests that use the more descriptive strings, but for the short term we (and any users) will need to deal with the uint8 representations.

type Users {
Name: String
}
`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these schema is repeated several times and some other things. Why not to extract to helper functions?
The test could look like this:

	test := testUtils.TestCase{
		Description: "Test schema update, add field with kind datetime (10)",
		Actions: []any{
			testUtils.SchemaUpdate{
				Schema: getTestSchema(),
			},
			testUtils.SchemaPatch{
				Patch: getTestPatchWithKind(10),
			},
			testUtils.Request{
				Request: getTestQueryWithFoo(),
				Results: []map[string]any{},
			},
		},
	}
	testUtils.ExecuteTestCase(t, []string{"Users"}, test)

or even like this:

	test := testUtils.TestCase{
		Description: "Test schema update, add field with kind datetime (10)",
		Actions: []any{
			getTestSchema(),
			getTestPatchWithKind(10),
			testUtils.Request{
				Request: getTestQueryWithFoo(),
				Results: []map[string]any{},
			},
		},
	}
	testUtils.ExecuteTestCase(t, []string{"Users"}, test)

Copy link
Contributor Author

@AndrewSisley AndrewSisley Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have quite a strong preference for not using helps like that in tests, whilst they reduce the number of lines they often obscure what is under test.

EDIT: To Expand: Here I want to see the actions that will be used by the users. Not only do I think it makes it much easier to see what is actually being done in the test, I have a much better idea as to how it might look and feel to the users. It also reduces the complexity of the test, something I consider particularly important for part-timers/newcomers to the test - including Defra users - the documentation aspect of these tests is very important to me, not just what the host machine executes.

EDIT2: Helper functions and over-sharing of code often results in tests that do far more than they need to, further obscuring what is actually under test. For example we currently do this a lot with test schema - most of the fields are not relevant to most of the tests - it is a tolerable trade-off for just schema perhaps, but I really don't want it spreading (and e.g. standardising test queries so they return 10 fields when each test only cares about 1).

Copy link
Contributor

@islamaliev islamaliev Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your points. They are absolutely valid concerns. And, as you mentioned earlier in one of the comments, there are always trade-offs.

I just would like to bring up some other points:

One of the (often underrated) properties of tests is low-level documentation. Which means that in order to understand how a system under test works one should just read the tests. And to do this effectively the documentation should be as concise and informative as possible. For tests to be concise they should contain only information relevant to the test itself. All the boiler-plate should be hidden away in some test set up stage.
In our specific case there are many repetitive parts that are not necessarily related to the test, hence the tests look very similar. As a result, it takes some effort and mental text-diff skills to spot what exactly makes one test special in comparison to others.

I think your approach would fit more to systems with less complexity and therefore fewer text cases.

But maybe it's more general discussion.

},
},
}
testUtils.ExecuteTestCase(t, []string{"Users"}, test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again the same test. Looks like these many of them here.

testUtils.SchemaPatch{
Patch: `
[
{ "op": "add", "path": "/Users/Schema/Fields/-", "value": {"Name": "Foo", "Kind": 19} }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to use const or enum for Kind values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in fmt.Sprintf them in (the enum does already exist in the prod code)?

The users can supply raw 19s, so that is what we should test, hiding it behind a test const hides the hides the ugliness of the production interface, and using the production const changes the concept under test (as well as hiding the ugliness).

@@ -49,5 +50,5 @@ type Parser interface {
ParseSDL(ctx context.Context, schemaString string) ([]client.CollectionDescription, error)

// Adds the given schema to this parser's model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is the documentation still valid for SetSchema?

Copy link
Contributor Author

@AndrewSisley AndrewSisley Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, what it does conceptually has not changed. The only difference is that changes will only be applied if/until the txn is successfully committed.

The doc could probably be expanded to mention this (although it is somewhat a universal implicit-given for anything transaction-based), but it is internal, not public, and if left up to me laziness would probably win out. Do you want it expanded?

Comment on lines +26 to +27
errCollectionIDDoesntMatch string = "CollectionID does not match existing"
errSchemaIDDoesntMatch string = "SchemaID does not match existing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why are some error messages starting with a capitalized char

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are property names, the name of the property is CollectionID, collectionID does not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be to describe the property instead of naming it collection ID does not match existing

Copy link
Member

@jsimnz jsimnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of little things. Overall I like the direction and functionality.

client/db.go Outdated
Comment on lines 54 to 55
// collection. Will return false and an error if it fails validation.
ValidateUpdateCollectionTxn(context.Context, datastore.Txn, CollectionDescription) (bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be publicly exposed on the client interface? Cant it just be privatee within the UpdateCollection? Whats the benefit of publicly calling Validate then Update?

client/db.go Outdated
Comment on lines 54 to 55
// collection. Will return false and an error if it fails validation.
ValidateUpdateCollectionTxn(context.Context, datastore.Txn, CollectionDescription) (bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like its only used internally anyway?

db/collection.go Outdated
var hasChanged bool
existingCollection, err := db.GetCollectionByNameTxn(ctx, txn, proposedDesc.Name)
if err != nil {
if err.Error() == "datastore: key not found" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Similar to standup convo on tests, but lets avoid doing direct string comparison. This particular error comes from ds.ErrNotFound.

Copy link
Contributor Author

@AndrewSisley AndrewSisley Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 Cheers lol - I spent a while looking for that err const as I thought it existed but never found it.

  • replace with ds.ErrNotFound

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +458 to +459
err = txn.Commit(ctx)
return col, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this can be merge into a single line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how I feel about that, it reduces the importance of the Commit call making it appear as something of an afterthought. May or may not change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair :)

Comment on lines +422 to +423
err = txn.Commit(ctx)
return col, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: merge lines

@@ -161,32 +161,46 @@ func (db *db) initialize(ctx context.Context) error {
db.glock.Lock()
defer db.glock.Unlock()

txn, err := db.NewTxn(ctx, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Does initialize really need a txn. Not against it, but its run one on startup before anything else. Just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was documented in the commit message:

DB init was also covered almost as a side-affect, as the sequence needed protecting, and TBH it is probably a very good thing to protect against mutating the database state in the case of a failed init.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out. Realized things afterwards as well 👍

// The collections (including the schema version ID) will only be updated if any changes have actually
// been made, if the net result of the patch matches the current persisted description then no changes
// will be applied.
func (db *db) PatchSchema(ctx context.Context, patchString string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Obviously we have a transaction in place to handle the PatchSchem call. But should there be any additional locks for safety. eg a RWLock that is read locked on basically all ops but write locked on PatchSchema acting as a schema lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What safety do you think you would gain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more explicit semantic that is enforced beyond the the transactions. Makes it clear that schema updates are clearly intended to be a protected action that locks the DB until completed.

Since at the moment, other transactions can run along side the schema update. The only thing the transaction protects us from shared keysets, which is unlikely.

Most databases don't support "online" or "lock-free" schema updates as theyre rather complex to get right. We're not going to lose anything by adding a lock as its not intended to be a highly concurrent/perf sensitive action. And it gives us a lot of safety regarding the semantics of other ongoing transactions.

I know we descoped the concurrent tests, but this feels independant of those tests, since its a global lock. Overall it provides much clearer and stronger semantics and gurantees about the nature of ongoing events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially in lie of concurrency tests tbh. This would ensure that theres basically no concurrency tests to worry about.

We can look at online/lock-free schema updates in the future without breaking anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would ensure that theres basically no concurrency tests to worry about.

Adding a global lock wont change the testing required. The tests will not be designed around the current implementation, only the public behaviour/surface-area.

Most databases don't support "online" or "lock-free" schema updates as theyre rather complex to get right.

I am very confident that the SQLs that I have worked with do not lock the entire database on add of a single field. At least some of them lock the full table on create of an index, but I think there are very very few operations that lock the entire database.

I can add a lock here if you really want, but I do see it as an ugly, untested (i.e dead code), bottleneck that should be removed as soon as we have the bandwidth+tests.

Copy link
Member

@jsimnz jsimnz Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a lock within this function would offer very little/no protection at all. If stale data is able to make it through past the txn, then a lock here will do nothing to help.

I can see how our setup could keep things safe, as we load the collection description from the system store on queries, which would be tracked for conflictions that a schema update would trigger.

I believe we talked about it else where, but we wanted to verify schemaVersionID on transactions (data reads/writes) correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we wanted to verify schemaVersionID on transactions (data reads/writes) correct?

That is a different issue, related to the use of stale client.Collections for doc writes. Not relevant here.

I can't tell what you are asking for in your other comment starting with Yes, my reference to.... A mutex in this location will add no/almost-no protection against concurrent schema update clashes, any mutex based protect needs to align with the transaction scope - wrapping this function behind a mutex will do pretty much nothing as it still leaves massive room for state data to accumulate - if the transaction itself does not catch everything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential flow would be

// col will reference CollectionDescription with schema v1
col := db.GetCollectionByName("users")

// then another request (another goroutine for example)
// now users description has schema v2
db.UpdateSchema(desc) // users desc

// col variable still holding old description
col.UpdateWithkey(key, update)

So here the update is going to apply things relative to schema v1, even tho there is now schema v2.

My suggested lock won't help this situation regardless, which is why I asked about comparing/verifying schemVersionIDs as previously discussed (can't remember when).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tldr; im OK dropping the suggested lock as described in this thread if we are comparing/validating schemaVersionIDs on the transaction. This would A) Manaully enforce a check B) Cause the schemaVersionID key to be read from the transaction, and ensure we protect against stale (inconsistent) data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a different issue, related to the use of stale client.Collections for doc writes. Not relevant here.

fairly relevant imo. My initial hope for the lock is to prevent the stale accumulation of of CollectionDescription which has the potential to get by the transaction semantics.

I realize now as noted that the lock won't actually give us my intended desire, but verifying the schemaVersionID would, and I don't believe its implemented in this PR correct?

// The collection (including the schema version ID) will only be updated if any changes have actually
// been made, if the given description matches the current persisted description then no changes will be
// applied.
UpdateCollectionTxn(context.Context, datastore.Txn, CollectionDescription) (Collection, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: UpdateCollection should take a name (string) as a parameter to make it clear what is the intended "collection" youre trying to update.

Obviously the description itself has the name too, but it should be more explicit from an API perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this quite strongly and it creates additional points of failure and confusion where the name parameter does not pair up with the description. Such a param can also be added later if we feel the need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really convinced by that argument, but looking at the current system, CreateCollection also doesn't take a named value, so its good for now 👍.


txn.OnSuccess(
func() {
p.schemaManager = schemaManager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I wondering if theres any race condition potential here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transaction should protect against any race I think, and I do not wish to go down some kind of static analysis hellhole when the concurrency tests that would test stuff like this were pushed out of scope. If those tests are not worth doing down, spending significant amounts of time staring at this line for the sake of potential concurrency bugs are also out of scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats fair, we did descope the concurrency tests. It just jumped out at me incase there was an assumption that the txn protects thing here, but the hook functions don't necessarily inherit that semantic. Good for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers :)

@AndrewSisley AndrewSisley requested a review from jsimnz March 6, 2023 16:09
Copy link
Member

@jsimnz jsimnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - excited for this to land

Spotted and quickly corrected.  Has nothing to do with this PR though, but is small enough to include.
It is not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.
It is not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.
They are not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.
I'm not sure any more than this is really needed, especially given that we will largely be hiding this from users shortly after merge.
Previously this would partially succede - the gql types would be updated, but the saving of the collection descriptions would fail (as there is another uniqueness check there), leaving the database in an invalid state until restart.
Was incorrectly quering the version history, returning a collection per version, instead of just the current version.  Went unnoticed as previously each collection could only have a single version.
Collection stuff needs to be protected by transactions.

Partial success of either a create or a mutate cannot be permitted and the use of transactions protects against this.

The transactions are also needed to protect against the use of stale data, by including the collection in the transaction used for P2P and query/planner stuff we should ensure that stuff is done against a single, complete version of the collection, and that it is not possible for the collection to mutate whilst something is using the collection. At the moment such concurrent use should now result in a transaction conflict error - this is not ideal and the logic here should probably grow to permit the queuing of such clashes (e.g. through use of a mutex) instead of making the users retry until it succedes - such a change will very likely need to be done within the scope declared by the transaction anyway, so I see no wasted code/time by the changes in this commit here - it is a start, and prevents odd/damaging stuff from happening.

DB init was also covered almost as a side-affect, as the sequence needed protecting, and TBH it is probably a very good thing to protect against mutating the database state in the case of a failed init.
Renames and changes AddSchema to SetSchema.  SetSchema is now transactional, GQL type changes will now only be 'commited' on transaction commit, whilst allowing SetSchema to be called safely at any point during the lifetime of the transaction - allowing for the schema to be validated against GQL constraints before any changes have been persisted else where, and allowing those other changes to be executed/validated before any changes have been made to the GQL types.
Please note that the client interfaces will be reworked in the near future so that the transaction related items clutter the primary interface less, and are more consistent. For now I have just followed the existing `Txn` suffix naming.
@AndrewSisley AndrewSisley force-pushed the sisley/feat/I1004-add-fields-to-schema branch from 319aa94 to 12fb03e Compare March 6, 2023 22:49
Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering the questions, sorry for not being as involved in this review (did follow the other conversations). LGTM overall

@AndrewSisley AndrewSisley merged commit 799a242 into develop Mar 6, 2023
@AndrewSisley AndrewSisley deleted the sisley/feat/I1004-add-fields-to-schema branch March 6, 2023 23:26
shahzadlone pushed a commit that referenced this pull request Apr 13, 2023
* Correct P2P error message

Spotted and quickly corrected.  Has nothing to do with this PR though, but is small enough to include.

* Remove field kind decimal

It is not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.

* Remove field kind bytes

It is not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.

* Remove field embedded object kinds

They are not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.

* Add documentation to FieldDesc.Kind

I'm not sure any more than this is really needed, especially given that we will largely be hiding this from users shortly after merge.

* Assert for collection name uniquenss within SDL

Previously this would partially succede - the gql types would be updated, but the saving of the collection descriptions would fail (as there is another uniqueness check there), leaving the database in an invalid state until restart.

* Correctly query all existing collections

Was incorrectly quering the version history, returning a collection per version, instead of just the current version.  Went unnoticed as previously each collection could only have a single version.

* Make collection persistance transactional

Collection stuff needs to be protected by transactions.

Partial success of either a create or a mutate cannot be permitted and the use of transactions protects against this.

The transactions are also needed to protect against the use of stale data, by including the collection in the transaction used for P2P and query/planner stuff we should ensure that stuff is done against a single, complete version of the collection, and that it is not possible for the collection to mutate whilst something is using the collection. At the moment such concurrent use should now result in a transaction conflict error - this is not ideal and the logic here should probably grow to permit the queuing of such clashes (e.g. through use of a mutex) instead of making the users retry until it succedes - such a change will very likely need to be done within the scope declared by the transaction anyway, so I see no wasted code/time by the changes in this commit here - it is a start, and prevents odd/damaging stuff from happening.

DB init was also covered almost as a side-affect, as the sequence needed protecting, and TBH it is probably a very good thing to protect against mutating the database state in the case of a failed init.

* Make SetSchema (GQL) transactional

Renames and changes AddSchema to SetSchema.  SetSchema is now transactional, GQL type changes will now only be 'commited' on transaction commit, whilst allowing SetSchema to be called safely at any point during the lifetime of the transaction - allowing for the schema to be validated against GQL constraints before any changes have been persisted else where, and allowing those other changes to be executed/validated before any changes have been made to the GQL types.

* Add support for adding fields to schema

Please note that the client interfaces will be reworked in the near future so that the transaction related items clutter the primary interface less, and are more consistent. For now I have just followed the existing `Txn` suffix naming.
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
)

* Correct P2P error message

Spotted and quickly corrected.  Has nothing to do with this PR though, but is small enough to include.

* Remove field kind decimal

It is not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.

* Remove field kind bytes

It is not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.

* Remove field embedded object kinds

They are not supported and does not work.  Leaving it here will just confuse users, especially as they start to use the schema update system.

* Add documentation to FieldDesc.Kind

I'm not sure any more than this is really needed, especially given that we will largely be hiding this from users shortly after merge.

* Assert for collection name uniquenss within SDL

Previously this would partially succede - the gql types would be updated, but the saving of the collection descriptions would fail (as there is another uniqueness check there), leaving the database in an invalid state until restart.

* Correctly query all existing collections

Was incorrectly quering the version history, returning a collection per version, instead of just the current version.  Went unnoticed as previously each collection could only have a single version.

* Make collection persistance transactional

Collection stuff needs to be protected by transactions.

Partial success of either a create or a mutate cannot be permitted and the use of transactions protects against this.

The transactions are also needed to protect against the use of stale data, by including the collection in the transaction used for P2P and query/planner stuff we should ensure that stuff is done against a single, complete version of the collection, and that it is not possible for the collection to mutate whilst something is using the collection. At the moment such concurrent use should now result in a transaction conflict error - this is not ideal and the logic here should probably grow to permit the queuing of such clashes (e.g. through use of a mutex) instead of making the users retry until it succedes - such a change will very likely need to be done within the scope declared by the transaction anyway, so I see no wasted code/time by the changes in this commit here - it is a start, and prevents odd/damaging stuff from happening.

DB init was also covered almost as a side-affect, as the sequence needed protecting, and TBH it is probably a very good thing to protect against mutating the database state in the case of a failed init.

* Make SetSchema (GQL) transactional

Renames and changes AddSchema to SetSchema.  SetSchema is now transactional, GQL type changes will now only be 'commited' on transaction commit, whilst allowing SetSchema to be called safely at any point during the lifetime of the transaction - allowing for the schema to be validated against GQL constraints before any changes have been persisted else where, and allowing those other changes to be executed/validated before any changes have been made to the GQL types.

* Add support for adding fields to schema

Please note that the client interfaces will be reworked in the near future so that the transaction related items clutter the primary interface less, and are more consistent. For now I have just followed the existing `Txn` suffix naming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/no-benchmark Skips the action that runs the benchmark. area/collections Related to the collections system area/schema Related to the schema system feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow new non-relational fields to be added to existing schema locally
5 participants