Builds and entity relationships #427

psFried · 2022-03-30T20:39:28Z

psFried
Mar 30, 2022
Maintainer

Control plane needs to model the relationships between entity models.

The data plane is primarily concerned with task specifications, which are self-contained. We want users to think in terms of entity models, though, which are not self-contained. Entity models are the individual collections, captures, etc. that are described by a models::Catalog. The CLI workflow, with models expressed as yaml files, is able to effectively model relationships between entities, and I think is a good example of what that means. Say you have a bunch of catalog yaml files with a collection, a capture, and a materialization. You would have a single --source that pulls in all three of those entities (plus a storageMapping, which we'll ignore for a little bit). When you change something about the collection and then re-deploy, both the capture and the materialization will be updated (assuming you're using flowctl deploy).

This is really important for making the system usable. We've implemented a ton of validation to help ensure that the tasks you've defined will all actually work together at runtime, and that validation is a major feature that helps address the problems endemic to data pipelines, where one thing changes and an untold number of downstream stages then either break or start producing incorrect results. The fact that Flow is able to validate all these things up front is one of its killer features.

But that's a feature we could easily lose in the control plane, if we're not careful. Here's how it could play out with the current /builds endpoint:

User creates both a capture and a collection that's being captured into.
User then submits a build that modifies only the collection.
The capture is now potentially broken, because it hasn't been re-built against the new collection. It could write data that no longer validates against the collections schema, for one example.

I don't think this is a situation that we can easily address in the browser with something like a more advanced client library. That's a recent realization for me, as I suggested yesterday to @jgraettinger that we could potentially address this in a front end library. But consider the case where a user updates a collection, and there's a materialization of that collection, which the user doesn't have access to. We can't send that materialization model to the front end, since the user isn't allowed to see it. But then how does the control plane know that the collection is used in that materialization, so it can re-build it against the new collection definition? If the control plane is going to help resolve these scenarios, then it needs to be aware that the currently deployed materialization references the collection that the user wants to update.

Of course we don't really know that re-building the materialization is actually the best way to handle that scenario. I think a big part of why we've chosen to implement the /build endpoint the way it is is that we didn't want to take a position on precisely how to handle situations like that because they seem really tricky. For example, if we re-build the materialization, then we'd also need to deal with the fact that the new materialization might fail validation for some reason, and could thus fail the whole build. Then what do we do? Should we prevent you from updating your collection just because somebody materialized it using a very restrictive connector? These situations are indeed pretty tricky because they lie at the intersection of the (as yet purely conceptual) authorization system and the relational modeling of catalog entities. I don't have conviction about precisely how control plane should handle these scenarios. But I do have conviction that the control plane does need to handle them somehow.

Path forward?

What I see as a path forward is to have control plane begin to model specific flow entities. For example, it might have separate tables for collections, captures, and materializations. The gist being that each row would contain the model definition for just that entity. When entities are modified, we build up a graph of all the affected entities, and marshal that into a flow catalog that then gets built. This provides a clear way to handle foreign entity resolution, which would be done up-front and thus doesn't need to be accounted for by flowctl api build. The catalog that's submitted to the build would already include the latest definitions of each collection referenced by other entities. It would also include all of the tasks that reference any collections in the build.

We would then be free to iterate on how to handle the various edge cases like when a collection that's referenced by other materializations gets updated. But we'd also have an easy out to avoid needing to figure all that out prior to releasing an MVP. As long as we retain the simple authorization system that only allows access to entities owned by the user, the (conceptually) simple behavior of "always rebuild everything that references a modified collection" will work without running afoul of any weird edge cases.

We can also continue to use the flow catalog model as a representation that allows for the simultaneous update of multiple entities. An endpoint could accept that catalog, break it up into its constituent parts, and re-assemble the full catalog including all referenced entities. This type of thing is generally pretty easy to do. Keep in mind that the actual entity relationships that we'd need to deal with are fairly simple and don't require any recursion. Re-using the catalog model as a representation here is also nice because it would allow this "new" endpoint to be essentially compatible with the existing one.

Erata: Schemas are weird

Schemas are another area where the CLI provides an arguably superior experience to the UI. When a user updates a schema file and then runs flowctl deploy, they're almost certainly not confused about the effects of updating the schema. They would expect that the schema is updated in every entity that references it by name. This works whether the schema is referenced from a collection, derivation, or even recursively through other schemas. I think it's totally reasonable for control plane to include json schemas as named entities with relations to collections and derivations. I don't think we have to do this initially, though.

Initially, we can just continue to consider json schemas to be wholly contained within collections. This can essentially be a UI concern, and we can just ensure that all schemas are passed "in-line" as part of the collection model. If we later want to have the control plane model relationships between schemas and other entities, then we can do so by making schemas themselves named entities that participate in the authorization system.

psFried · 2022-03-30T21:03:42Z

psFried
Mar 30, 2022
Maintainer Author

I'd like to better ground out the proposed change with an example.

Existing entities that are running:

captures:
    acmeCo/legacy-db -> into acmeCo/anvils

collections:
    acmeCo/anvils

A user submits a draft with the following content:

collections:
    acmeCo/anvils -> updates the collection schema

materializations:
    acmeCo/anvilytics -> from acmeCo/anvils

Control plane receives the request, adds the missing entities, and ultimately submits a build with
the following content:

captures:
    acmeCo/legacy-db -> into the new acmeCo/anvils

collections:
    acmeCo/anvils -> updates the collection schema

materializations:
    acmeCo/anvilytics -> from acmeCo/anvils

This works for task -> collection relationships, too.

Given:

captures:
    acmeCo/legacy-db -> into acmeCo/anvils

collections:
    acmeCo/anvils

If a user submits:

captures:
    acmeCo/legacy-db -> into acmeCo/anvils

Then the control plane would end up submitting a build with:

captures:
    acmeCo/legacy-db -> into acmeCo/anvils

collections:
    acmeCo/anvils

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builds and entity relationships #427

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Builds and entity relationships #427

psFried Mar 30, 2022 Maintainer

Path forward?

Erata: Schemas are weird

Replies: 1 comment

psFried Mar 30, 2022 Maintainer Author

psFried
Mar 30, 2022
Maintainer

psFried
Mar 30, 2022
Maintainer Author