Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Field model into distinct Feature and Entity objects #655

Merged
merged 11 commits into from
May 5, 2020

Conversation

zhilingc
Copy link
Collaborator

@zhilingc zhilingc commented Apr 26, 2020

What this PR does / why we need it:
This is a split-off off #612 that introduces the model changes made in that PR in a more digestible chunk.

This PR includes:

  • Removal of Field object
  • Addition of distinct Feature and Entity objects
  • Removal of TFX fields on entities

SQL changes:

  • Surrogate long ids for feature sets, features and entities
  • drop TFX constraints from entities table

Does this PR introduce a user-facing change?:

Model changes to FeatureSets, Features and Entities. Requires Migration.

@zhilingc
Copy link
Collaborator Author

/test test-end-to-end

@ches ches added this to the v0.5.0 milestone Apr 26, 2020
@zhilingc
Copy link
Collaborator Author

/test test-end-to-end

@zhilingc
Copy link
Collaborator Author

/test test-end-to-end-batch

@woop woop mentioned this pull request Apr 29, 2020
@ches
Copy link
Member

ches commented Apr 29, 2020

Could you please summarize in the description what the SQL schema changes are that are implied with this branch? Should the PR include a SQL migration script?

@woop
Copy link
Member

woop commented Apr 29, 2020

Could you please summarize in the description what the SQL schema changes are that are implied with this branch? Should the PR include a SQL migration script?

For the script, should it perhaps be at the release level, or is there an advantage to having it at the PR level?

@ches
Copy link
Member

ches commented Apr 29, 2020

For the script, should it perhaps be at the release level, or is there an advantage to having it at the PR level?

Could be at release. With the PR, reviewers could run it on an existing development environment with data in it and a.) see that the migration script works, and b.) experiment with the PR with some existing data. Manual of course, but it's something.

This would be out of scope for this PR for sure, but separately maybe we could consider integrating something like Flyway into the project, both for development convenience and for shipping migration scripts with Feast releases that operators can have a process to apply. Could be used to load seed data for automated integration tests too.

@woop
Copy link
Member

woop commented Apr 30, 2020

This would be out of scope for this PR for sure, but separately maybe we could consider integrating something like Flyway into the project, both for development convenience and for shipping migration scripts with Feast releases that operators can have a process to apply. Could be used to load seed data for automated integration tests too.

Yea I really like Flyway. I think it makes sense. I have only used it as an external tool in the past in non-JVM projects. It seems like integration here would mean that migrations could be triggered manually using the CLI or mvn.

The value add would be mostly in the migration scripts themselves, so perhaps we could start there I think, and add documentation and Flyway around it.

@zhilingc
Copy link
Collaborator Author

zhilingc commented Apr 30, 2020

@ches @woop The necessary migration between the 0.47 schema and 0.5 is the following:

create table entity
(
	feature_set varchar(255) not null,
	name varchar(255) not null,
	project varchar(255) not null,
	version integer not null,
	type varchar(255),
	constraint entity_pkey
		primary key (feature_set, name, project, version)
);

create table feature
(
	feature_set varchar(255) not null,
	name varchar(255) not null,
	project varchar(255) not null,
	version integer not null,
	bool_domain bytea,
	domain varchar(255),
	float_domain bytea,
	group_presence bytea,
	image_domain bytea,
	int_domain bytea,
	mid_domain bytea,
	natural_language_domain bytea,
	presence bytea,
	shape bytea,
	string_domain bytea,
	struct_domain bytea,
	time_domain bytea,
	time_of_day_domain bytea,
	type varchar(255),
	url_domain bytea,
	value_count bytea,
	constraint feature_pkey
		primary key (feature_set, name, project, version)
);

create table feature_set_entities
(
	feature_set_id varchar(255) not null
		constraint fkbax98goqv127qc5su483xjhq9
			references feature_sets,
	entities_name varchar(255) not null,
	entities_project varchar(255) not null,
	entities_feature_set_id varchar(255) not null,
	entities_version integer not null,
	constraint feature_set_entities_pkey
		primary key (feature_set_id, entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint uk_byydfwkove4rtygt5pydandx4
		unique (entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint fktm3p2cp241tbqfb7r1ptfegdx
		foreign key (entities_name, entities_project, entities_feature_set_id, entities_version) references entity
);

create index idx_jobs_feature_set_entities_feature_set_id
	on feature_set_entities (feature_set_id);

create table feature_set_features
(
	feature_set_id varchar(255) not null
		constraint fk2w40q1cwd6pnqsbv5rvjkx5wj
			references feature_sets,
	features_name varchar(255) not null,
	features_project varchar(255) not null,
	features_feature_set_id varchar(255) not null,
	features_version integer not null,
	constraint feature_set_features_pkey
		primary key (feature_set_id, features_name, features_project, features_feature_set_id, features_version),
	constraint uk_tngc6xfm924rl6wxhqkb97jdl
		unique (features_name, features_project, features_feature_set_id, features_version),
	constraint fk1qcxf8joepru1qpyg1wu4lqfb
		foreign key (features_name, features_project, features_feature_set_id, features_version) references feature
);

create index idx_jobs_feature_set_features_feature_set_id
	on feature_set_features (feature_set_id);

create index idx_jobs_feature_sets_job_id
	on jobs_feature_sets (job_id);

create index idx_jobs_feature_sets_feature_sets_id
	on jobs_feature_sets (feature_sets_id);

drop table features;

drop table entities;

drop table metrics;

I can add migration scripts to this PR if necessary, the only difficulty here seems to be the entities and features.

Alternatively we can alter the existing features and entities tables but I'm not sure if column orders can be altered :/

@woop
Copy link
Member

woop commented Apr 30, 2020

Thanks @zhilingc

From an upgrade perspective I think people care about retaining their existing data.

I'm not sure if column orders can be altered :/

Are any of our systems (like Spring) picky about column order?

@zhilingc
Copy link
Collaborator Author

Are any of our systems (like Spring) picky about column order?

It's not spring, spring doesn't care, but it's not directly supported in postgres.

@woop
Copy link
Member

woop commented Apr 30, 2020

Are any of our systems (like Spring) picky about column order?

It's not spring, spring doesn't care, but it's not directly supported in postgres.

Ok, so where does the failure happen if we lose column ordering? I am trying to understand the comment you made above.

@zhilingc
Copy link
Collaborator Author

Oh, actually I suppose its not necessary to reorder columns.

drop sequence hibernate_sequence;

alter table entities
	add feature_set varchar(255) not null;

-- column reordering is not supported entities.feature_set

create unique index entities_pkey
	on entities (feature_set, name, project, version);

alter table entities
	add constraint entities_pkey
		primary key (feature_set, name, project, version);

create table feature_set_entities
(
	feature_set_id varchar(255) not null
		constraint fkbax98goqv127qc5su483xjhq9
			references feature_sets,
	entities_name varchar(255) not null,
	entities_project varchar(255) not null,
	entities_feature_set_id varchar(255) not null,
	entities_version integer not null,
	constraint feature_set_entities_pkey
		primary key (feature_set_id, entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint uk_byydfwkove4rtygt5pydandx4
		unique (entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint fkqr5sj1yp7uunob8nf8fuxwfmw
		foreign key (entities_name, entities_project, entities_feature_set_id, entities_version) references entities
);

create index idx_jobs_feature_set_entities_feature_set_id
	on feature_set_entities (feature_set_id);

alter table features
	add feature_set varchar(255) not null;

-- column reordering is not supported features.feature_set

alter table features
	add constraint features_pkey
		primary key (feature_set, name, project, version);

create table feature_set_features
(
	feature_set_id varchar(255) not null
		constraint fk2w40q1cwd6pnqsbv5rvjkx5wj
			references feature_sets,
	features_name varchar(255) not null,
	features_project varchar(255) not null,
	features_feature_set_id varchar(255) not null,
	features_version integer not null,
	constraint feature_set_features_pkey
		primary key (feature_set_id, features_name, features_project, features_feature_set_id, features_version),
	constraint uk_tngc6xfm924rl6wxhqkb97jdl
		unique (features_name, features_project, features_feature_set_id, features_version),
	constraint fk9dltpwl1lu6w7cqxyhod9sk7t
		foreign key (features_name, features_project, features_feature_set_id, features_version) references features
);

create index idx_jobs_feature_set_features_feature_set_id
	on feature_set_features (feature_set_id);

alter table entities drop column bool_domain;

alter table entities drop column domain;

alter table entities drop column float_domain;

alter table entities drop column group_presence;

alter table entities drop column image_domain;

alter table entities drop column int_domain;

alter table entities drop column mid_domain;

alter table entities drop column natural_language_domain;

alter table entities drop column presence;

alter table entities alter column project set not null;

alter table entities drop column shape;

alter table entities drop column string_domain;

alter table entities drop column struct_domain;

alter table entities drop column time_domain;

alter table entities drop column time_of_day_domain;

alter table entities drop column url_domain;

alter table entities drop column value_count;

-- column reordering is not supported entities.version

alter table entities alter column version set not null;

alter table entities alter column type drop not null;

alter table entities drop constraint fkhyblh5sfunv00a8ums8ms9otq;

alter table entities drop column feature_set_id;

alter table features alter column project set not null;

alter table features alter column type drop not null;

-- column reordering is not supported features.version

alter table features alter column version set not null;

drop index uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop constraint uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop constraint fkfxcpsscvj0g89o4p5dx4insb1;

alter table features drop column feature_set_id;

drop table metrics;

@zhilingc zhilingc force-pushed the split-fields branch 2 times, most recently from 481ab83 to 030d6e6 Compare April 30, 2020 09:15
@zhilingc
Copy link
Collaborator Author

/test test-end-to-end

ches
ches previously approved these changes Apr 30, 2020
Copy link
Member

@ches ches left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to love the migration, but I think that I'll be happier after it's done.

Functionally LGTM, I think it's a good move for internals of Feast's design, as the rest of your work based on it probably will further prove. My remarks are just doc nits.

An aside about SQL data model in general, not specific to this PR: any particular reasons for the heavy use of string keys? There are several cases where a string PK is entirely made up of data in other columns in the table, they could possibly be composite keys instead of surrogates. If there's going to be a surrogate anyway, I feel like auto-generated numeric IDs would get us into less trouble down the road than "project/feature_set_name:42" when versions are going away and projects may have changes coming…

core/src/main/java/feast/core/model/EntityReference.java Outdated Show resolved Hide resolved
core/src/main/java/feast/core/model/Entity.java Outdated Show resolved Hide resolved
core/src/main/java/feast/core/model/Entity.java Outdated Show resolved Hide resolved
@zhilingc
Copy link
Collaborator Author

zhilingc commented May 4, 2020

/test test-end-to-end

@zhilingc
Copy link
Collaborator Author

zhilingc commented May 4, 2020

Removed the embedded ids in favor of surrogate ids. I've retained the child -> parent relationship because bidirectional relationships are more efficient than unidirectional ones.

@woop
Copy link
Member

woop commented May 4, 2020

/test test-end-to-end-batch

@woop
Copy link
Member

woop commented May 4, 2020

This looks good to me. @ches I'll leave this open until EOD or until you greenlight it, whichever comes first :)

@ches
Copy link
Member

ches commented May 4, 2020

I'm still not crystal clear on what tradeoffs we've just made in the last commit. Could we update examples of the SQL schema and/or migration?

Roughly speaking, I get that with the prior @EmbeddedId approach you get a join model, a Java entity class representing join table records. This can have value especially if that model deserves first-class status in the object model—concept + behavior of its own, potential of new non-join columns in the join table in the future that enrich the join model, etc.—as references probably do, viz #674 and its cousins. This is part of what @zhilingc is getting at in #655 (comment) I think, although also an efficiency/access pattern angle.

I'm not sure whether they need/warrant a table though (objectively neutral statement here pending pros and cons to weigh, not an "I'm not sure" to be read as "I don't think they do"). Or if there is a join table, should it also use surrogate rather than composite keys because of open questions for the future about reference components, in #479?

Assuming we have simpler schema as per the most recent change, and assuming we need/want references to have first class types so we soon define canonical domain models for them without SQL backing, what have we lost or gained with the resulting design?

Feel somewhat as if we exhausted @zhilingc into submission without hearing some valid considerations articulated.

Sorry to drag out any further. This has begged for a whiteboard session.

@zhilingc
Copy link
Collaborator Author

zhilingc commented May 4, 2020

The join tables have been removed. The generated SQL for migration is following:

alter sequence hibernate_sequence nominvalue;

alter table entities
	add id bigint not null;

-- column reordering is not supported entities.id

create unique index entities_pkey
	on entities (id);

create unique index uk4hredqqfh86prhp1hf08nofvk
	on entities (name, feature_set_id);

alter table entities
	add constraint entities_pkey
		primary key (id);

alter table entities
	add constraint uk4hredqqfh86prhp1hf08nofvk
		unique (name, feature_set_id);

alter table feature_sets
	add labels text;

create unique index ukei1j8q7sfkjlhuxstxs7s3c6
	on feature_sets (name, version, project_name);

alter table feature_sets
	add constraint ukei1j8q7sfkjlhuxstxs7s3c6
		unique (name, version, project_name);

alter table features
	add id bigint not null;

-- column reordering is not supported features.id

alter table features
	add labels text;

-- column reordering is not supported features.labels

alter table features
	add constraint features_pkey
		primary key (id);

alter table features
	add constraint ukedouxmpcoev743cmstfwq25yp
		unique (name, feature_set_id);

-- Removal of constraints from entities
alter table entities drop column bool_domain;

alter table entities drop column domain;

alter table entities drop column float_domain;

alter table entities drop column group_presence;

alter table entities drop column image_domain;

alter table entities drop column int_domain;

alter table entities drop column mid_domain;

alter table entities alter column name drop not null;

alter table entities drop column natural_language_domain;

alter table entities drop column presence;

alter table entities drop column project;

alter table entities drop column shape;

alter table entities drop column string_domain;

alter table entities drop column struct_domain;

alter table entities drop column time_domain;

alter table entities drop column time_of_day_domain;

alter table entities alter column type drop not null;

alter table entities drop column url_domain;

alter table entities drop column value_count;

-- column reordering is not supported entities.feature_set_id

alter table entities alter column feature_set_id type integer using feature_set_id::integer;

alter table entities alter column feature_set_id drop not null;

alter table entities drop column version;

alter table feature_sets alter column id type integer using id::integer;

-- column reordering is not supported features.feature_set_id

alter table features alter column feature_set_id type integer using feature_set_id::integer;

alter table features alter column feature_set_id drop not null;

alter table features alter column name drop not null;

alter table features alter column type drop not null;

drop index uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop constraint uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop column project;

alter table features drop column version;

alter table jobs_feature_sets alter column feature_sets_id type integer using feature_sets_id::integer;

drop table metrics;

@woop
Copy link
Member

woop commented May 4, 2020

Disclaimer: My gripe is mostly on the join tables and their value in a one-to-many relationship, and not so much whether surrogates or composite keys are preferable. We don't need join tables to use composite keys, but I understand how they help when you want to propagate all the fields down to children.

In favour of Surrogate/Hierarchical/Normalized approach

  • Surrogates allow the data model to be more flexible and less brittle by allowing the composite fields to change independently of the relationship.
  • Surrogates are more efficient to store and index
  • They don't require each table with a foreign key reference to have all the columns in the EmbeddedId, which currently stands at four. This bloats the data model and gives me anxiety.
  • Specifically if we only store a field once and don't use EmbeddedIds, then moving that relationship around seems a bit easier. For example we can make features globally unique by removing the unique constraint on the feature set (and possibly removing the column altogether). This would be harder with the composite key approach since we’d need to drop and create new join tables and move the feature set out of the embeddedId.

In favour of Composite/EmbeddedId approach

  • Migrations are harder with surrogates (at least in my SQL Server experience where the numbers don't matter and the rules are made up)
  • Possibly one less query to the database required since Ids can be composed outside of the database.
  • EmbeddedIds/Composites are easier to read since you can find the natural keys in each table that reference them.
  • EmbeddedIds in our case can map naturally to our Refs. FeatureRef, EntityRef, FeatureSetRef.

assuming we need/want references to have first class types so we soon define canonical domain models for them without SQL backing

I take it you mean we could have FeatureRefs, FeatureSetRefs, and EntityRefs floating around without each one of them being persisted. Something like GetOnlineFeatures where no persistence layer is touched for these references, but they are still functionally used?

If that is the case then it seems like we have three possible implementations areas:

  1. Proto implementation used during cross service communication
  2. Canonical domain model used for getting shit done
  3. Spring data model used for persistence

It seems like one risk that you are flagging is a potential disconnect between (2) and (3) that may not exist if we followed the EmbeddedId approach where they could be the same thing?

Whether or not we follow the EmbeddedId (FeatureRef, EntityRef) approach or not, we will still have to make separate classes for FeatureRef, and EntityRef. I like the idea of these classes being used for querying Features/Entities with Spring’s Specifications Interface. So we basically can get the benefits of composite keys without the horrible (in my opinion) data model, and we’d also not need to query for surrogates.

This also keeps the FeatureRef and EntityRef pure (annotation free). Although the same won't apply to Features, Entities, Jobs, etc. I expect the domain model and persistence models to be the same classes.

It's clear that there are unknowns here and so we should probably err on the side of simplicity and delaying unnecessary complexity. So I think the normalized/surrogate approach has the edge here, but perhaps I am missing some obvious advantage to going the composite/embedded approach.

Useful links

@woop
Copy link
Member

woop commented May 4, 2020

Composite/EmbeddedId (d579424)
image

Surrogate/Normalized (935602e)
image

@woop
Copy link
Member

woop commented May 5, 2020

Sorry to drag out any further. This has begged for a whiteboard session.

Always happy with good feedback, but we are under time pressure to cut this release in order to roll it out. We still need to get through version removal as well, which is based on this PR.

Async communication seems to be working well here, but if you think jumping on a quick call to discuss would be better then we could do that as well.

@ches
Copy link
Member

ches commented May 5, 2020

Thanks a lot for the detail @woop, that's the analysis I was after and the ERDs are essentially the whiteboard session I was looking for.

I'm sold on the model without the join tables.

assuming we need/want references to have first class types so we soon define canonical domain models for them without SQL backing

I take it you mean we could have FeatureRefs, FeatureSetRefs, and EntityRefs floating around without each one of them being persisted. Something like GetOnlineFeatures where no persistence layer is touched for these references, but they are still functionally used?

Yes, exactly, and that's precisely a case where looking them up from persistence would be unnecessary, and undesirable unless maybe for a SQL online store.

  1. Proto implementation used during cross service communication
  2. Canonical domain model used for getting shit done
  3. Spring data model used for persistence

It seems like one risk that you are flagging is a potential disconnect between (2) and (3) that may not exist if we followed the EmbeddedId approach where they could be the same thing?

I hadn't thought very deeply about it yet, but no, so far I'm not worried that maintaining appropriate invariants in code will be a problem here.

I suggest being conservative about committing to proto schemas until we're more certain about further outlook of #479, I think that's discussion for #674.

Whether or not we follow the EmbeddedId (FeatureRef, EntityRef) approach or not, we will still have to make separate classes for FeatureRef, and EntityRef… This also keeps the FeatureRef and EntityRef pure (annotation free).

Indeed, this is part of what I had in mind in raising discussion of whether join models would have value or not—if they did, it would still likely be cumbersome because of how widely these are going to be needed, beyond SQL persistence contexts as above.

@ches
Copy link
Member

ches commented May 5, 2020

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants