Split Field model into distinct Feature and Entity objects #655

zhilingc · 2020-04-26T16:31:11Z

What this PR does / why we need it:
This is a split-off off #612 that introduces the model changes made in that PR in a more digestible chunk.

This PR includes:

Removal of Field object
Addition of distinct Feature and Entity objects
Removal of TFX fields on entities

SQL changes:

Surrogate long ids for feature sets, features and entities
drop TFX constraints from entities table

Does this PR introduce a user-facing change?:

Model changes to FeatureSets, Features and Entities. Requires Migration.

zhilingc · 2020-04-26T16:32:34Z

/test test-end-to-end

zhilingc · 2020-04-27T03:39:21Z

/test test-end-to-end

zhilingc · 2020-04-27T04:27:27Z

/test test-end-to-end-batch

ches · 2020-04-29T05:46:16Z

Could you please summarize in the description what the SQL schema changes are that are implied with this branch? Should the PR include a SQL migration script?

woop · 2020-04-29T06:01:24Z

Could you please summarize in the description what the SQL schema changes are that are implied with this branch? Should the PR include a SQL migration script?

For the script, should it perhaps be at the release level, or is there an advantage to having it at the PR level?

ches · 2020-04-29T09:42:26Z

For the script, should it perhaps be at the release level, or is there an advantage to having it at the PR level?

Could be at release. With the PR, reviewers could run it on an existing development environment with data in it and a.) see that the migration script works, and b.) experiment with the PR with some existing data. Manual of course, but it's something.

This would be out of scope for this PR for sure, but separately maybe we could consider integrating something like Flyway into the project, both for development convenience and for shipping migration scripts with Feast releases that operators can have a process to apply. Could be used to load seed data for automated integration tests too.

woop · 2020-04-30T00:44:22Z

This would be out of scope for this PR for sure, but separately maybe we could consider integrating something like Flyway into the project, both for development convenience and for shipping migration scripts with Feast releases that operators can have a process to apply. Could be used to load seed data for automated integration tests too.

Yea I really like Flyway. I think it makes sense. I have only used it as an external tool in the past in non-JVM projects. It seems like integration here would mean that migrations could be triggered manually using the CLI or mvn.

The value add would be mostly in the migration scripts themselves, so perhaps we could start there I think, and add documentation and Flyway around it.

zhilingc · 2020-04-30T03:41:19Z

@ches @woop The necessary migration between the 0.47 schema and 0.5 is the following:

create table entity
(
	feature_set varchar(255) not null,
	name varchar(255) not null,
	project varchar(255) not null,
	version integer not null,
	type varchar(255),
	constraint entity_pkey
		primary key (feature_set, name, project, version)
);

create table feature
(
	feature_set varchar(255) not null,
	name varchar(255) not null,
	project varchar(255) not null,
	version integer not null,
	bool_domain bytea,
	domain varchar(255),
	float_domain bytea,
	group_presence bytea,
	image_domain bytea,
	int_domain bytea,
	mid_domain bytea,
	natural_language_domain bytea,
	presence bytea,
	shape bytea,
	string_domain bytea,
	struct_domain bytea,
	time_domain bytea,
	time_of_day_domain bytea,
	type varchar(255),
	url_domain bytea,
	value_count bytea,
	constraint feature_pkey
		primary key (feature_set, name, project, version)
);

create table feature_set_entities
(
	feature_set_id varchar(255) not null
		constraint fkbax98goqv127qc5su483xjhq9
			references feature_sets,
	entities_name varchar(255) not null,
	entities_project varchar(255) not null,
	entities_feature_set_id varchar(255) not null,
	entities_version integer not null,
	constraint feature_set_entities_pkey
		primary key (feature_set_id, entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint uk_byydfwkove4rtygt5pydandx4
		unique (entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint fktm3p2cp241tbqfb7r1ptfegdx
		foreign key (entities_name, entities_project, entities_feature_set_id, entities_version) references entity
);

create index idx_jobs_feature_set_entities_feature_set_id
	on feature_set_entities (feature_set_id);

create table feature_set_features
(
	feature_set_id varchar(255) not null
		constraint fk2w40q1cwd6pnqsbv5rvjkx5wj
			references feature_sets,
	features_name varchar(255) not null,
	features_project varchar(255) not null,
	features_feature_set_id varchar(255) not null,
	features_version integer not null,
	constraint feature_set_features_pkey
		primary key (feature_set_id, features_name, features_project, features_feature_set_id, features_version),
	constraint uk_tngc6xfm924rl6wxhqkb97jdl
		unique (features_name, features_project, features_feature_set_id, features_version),
	constraint fk1qcxf8joepru1qpyg1wu4lqfb
		foreign key (features_name, features_project, features_feature_set_id, features_version) references feature
);

create index idx_jobs_feature_set_features_feature_set_id
	on feature_set_features (feature_set_id);

create index idx_jobs_feature_sets_job_id
	on jobs_feature_sets (job_id);

create index idx_jobs_feature_sets_feature_sets_id
	on jobs_feature_sets (feature_sets_id);

drop table features;

drop table entities;

drop table metrics;

I can add migration scripts to this PR if necessary, the only difficulty here seems to be the entities and features.

Alternatively we can alter the existing features and entities tables but I'm not sure if column orders can be altered :/

woop · 2020-04-30T04:09:38Z

Thanks @zhilingc

From an upgrade perspective I think people care about retaining their existing data.

I'm not sure if column orders can be altered :/

Are any of our systems (like Spring) picky about column order?

zhilingc · 2020-04-30T04:48:39Z

Are any of our systems (like Spring) picky about column order?

It's not spring, spring doesn't care, but it's not directly supported in postgres.

woop · 2020-04-30T06:08:39Z

Are any of our systems (like Spring) picky about column order?

It's not spring, spring doesn't care, but it's not directly supported in postgres.

Ok, so where does the failure happen if we lose column ordering? I am trying to understand the comment you made above.

zhilingc · 2020-04-30T06:42:37Z

Oh, actually I suppose its not necessary to reorder columns.

drop sequence hibernate_sequence;

alter table entities
	add feature_set varchar(255) not null;

-- column reordering is not supported entities.feature_set

create unique index entities_pkey
	on entities (feature_set, name, project, version);

alter table entities
	add constraint entities_pkey
		primary key (feature_set, name, project, version);

create table feature_set_entities
(
	feature_set_id varchar(255) not null
		constraint fkbax98goqv127qc5su483xjhq9
			references feature_sets,
	entities_name varchar(255) not null,
	entities_project varchar(255) not null,
	entities_feature_set_id varchar(255) not null,
	entities_version integer not null,
	constraint feature_set_entities_pkey
		primary key (feature_set_id, entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint uk_byydfwkove4rtygt5pydandx4
		unique (entities_name, entities_project, entities_feature_set_id, entities_version),
	constraint fkqr5sj1yp7uunob8nf8fuxwfmw
		foreign key (entities_name, entities_project, entities_feature_set_id, entities_version) references entities
);

create index idx_jobs_feature_set_entities_feature_set_id
	on feature_set_entities (feature_set_id);

alter table features
	add feature_set varchar(255) not null;

-- column reordering is not supported features.feature_set

alter table features
	add constraint features_pkey
		primary key (feature_set, name, project, version);

create table feature_set_features
(
	feature_set_id varchar(255) not null
		constraint fk2w40q1cwd6pnqsbv5rvjkx5wj
			references feature_sets,
	features_name varchar(255) not null,
	features_project varchar(255) not null,
	features_feature_set_id varchar(255) not null,
	features_version integer not null,
	constraint feature_set_features_pkey
		primary key (feature_set_id, features_name, features_project, features_feature_set_id, features_version),
	constraint uk_tngc6xfm924rl6wxhqkb97jdl
		unique (features_name, features_project, features_feature_set_id, features_version),
	constraint fk9dltpwl1lu6w7cqxyhod9sk7t
		foreign key (features_name, features_project, features_feature_set_id, features_version) references features
);

create index idx_jobs_feature_set_features_feature_set_id
	on feature_set_features (feature_set_id);

alter table entities drop column bool_domain;

alter table entities drop column domain;

alter table entities drop column float_domain;

alter table entities drop column group_presence;

alter table entities drop column image_domain;

alter table entities drop column int_domain;

alter table entities drop column mid_domain;

alter table entities drop column natural_language_domain;

alter table entities drop column presence;

alter table entities alter column project set not null;

alter table entities drop column shape;

alter table entities drop column string_domain;

alter table entities drop column struct_domain;

alter table entities drop column time_domain;

alter table entities drop column time_of_day_domain;

alter table entities drop column url_domain;

alter table entities drop column value_count;

-- column reordering is not supported entities.version

alter table entities alter column version set not null;

alter table entities alter column type drop not null;

alter table entities drop constraint fkhyblh5sfunv00a8ums8ms9otq;

alter table entities drop column feature_set_id;

alter table features alter column project set not null;

alter table features alter column type drop not null;

-- column reordering is not supported features.version

alter table features alter column version set not null;

drop index uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop constraint uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop constraint fkfxcpsscvj0g89o4p5dx4insb1;

alter table features drop column feature_set_id;

drop table metrics;

zhilingc · 2020-04-30T10:07:16Z

/test test-end-to-end

ches

I'm not going to love the migration, but I think that I'll be happier after it's done.

Functionally LGTM, I think it's a good move for internals of Feast's design, as the rest of your work based on it probably will further prove. My remarks are just doc nits.

An aside about SQL data model in general, not specific to this PR: any particular reasons for the heavy use of string keys? There are several cases where a string PK is entirely made up of data in other columns in the table, they could possibly be composite keys instead of surrogates. If there's going to be a surrogate anyway, I feel like auto-generated numeric IDs would get us into less trouble down the road than "project/feature_set_name:42" when versions are going away and projects may have changes coming…

core/src/main/java/feast/core/model/EntityReference.java

core/src/main/java/feast/core/model/Entity.java

core/src/main/java/feast/core/model/Feature.java

zhilingc · 2020-05-04T06:14:44Z

/test test-end-to-end

zhilingc · 2020-05-04T06:27:51Z

Removed the embedded ids in favor of surrogate ids. I've retained the child -> parent relationship because bidirectional relationships are more efficient than unidirectional ones.

core/src/main/java/feast/core/model/FeatureSet.java

woop · 2020-05-04T06:40:13Z

/test test-end-to-end-batch

woop · 2020-05-04T06:42:39Z

This looks good to me. @ches I'll leave this open until EOD or until you greenlight it, whichever comes first :)

ches · 2020-05-04T08:29:58Z

I'm still not crystal clear on what tradeoffs we've just made in the last commit. Could we update examples of the SQL schema and/or migration?

Roughly speaking, I get that with the prior @EmbeddedId approach you get a join model, a Java entity class representing join table records. This can have value especially if that model deserves first-class status in the object model—concept + behavior of its own, potential of new non-join columns in the join table in the future that enrich the join model, etc.—as references probably do, viz #674 and its cousins. This is part of what @zhilingc is getting at in #655 (comment) I think, although also an efficiency/access pattern angle.

I'm not sure whether they need/warrant a table though (objectively neutral statement here pending pros and cons to weigh, not an "I'm not sure" to be read as "I don't think they do"). Or if there is a join table, should it also use surrogate rather than composite keys because of open questions for the future about reference components, in #479?

Assuming we have simpler schema as per the most recent change, and assuming we need/want references to have first class types so we soon define canonical domain models for them without SQL backing, what have we lost or gained with the resulting design?

Feel somewhat as if we exhausted @zhilingc into submission without hearing some valid considerations articulated.

Sorry to drag out any further. This has begged for a whiteboard session.

zhilingc · 2020-05-04T09:19:32Z

The join tables have been removed. The generated SQL for migration is following:

alter sequence hibernate_sequence nominvalue;

alter table entities
	add id bigint not null;

-- column reordering is not supported entities.id

create unique index entities_pkey
	on entities (id);

create unique index uk4hredqqfh86prhp1hf08nofvk
	on entities (name, feature_set_id);

alter table entities
	add constraint entities_pkey
		primary key (id);

alter table entities
	add constraint uk4hredqqfh86prhp1hf08nofvk
		unique (name, feature_set_id);

alter table feature_sets
	add labels text;

create unique index ukei1j8q7sfkjlhuxstxs7s3c6
	on feature_sets (name, version, project_name);

alter table feature_sets
	add constraint ukei1j8q7sfkjlhuxstxs7s3c6
		unique (name, version, project_name);

alter table features
	add id bigint not null;

-- column reordering is not supported features.id

alter table features
	add labels text;

-- column reordering is not supported features.labels

alter table features
	add constraint features_pkey
		primary key (id);

alter table features
	add constraint ukedouxmpcoev743cmstfwq25yp
		unique (name, feature_set_id);

-- Removal of constraints from entities
alter table entities drop column bool_domain;

alter table entities drop column domain;

alter table entities drop column float_domain;

alter table entities drop column group_presence;

alter table entities drop column image_domain;

alter table entities drop column int_domain;

alter table entities drop column mid_domain;

alter table entities alter column name drop not null;

alter table entities drop column natural_language_domain;

alter table entities drop column presence;

alter table entities drop column project;

alter table entities drop column shape;

alter table entities drop column string_domain;

alter table entities drop column struct_domain;

alter table entities drop column time_domain;

alter table entities drop column time_of_day_domain;

alter table entities alter column type drop not null;

alter table entities drop column url_domain;

alter table entities drop column value_count;

-- column reordering is not supported entities.feature_set_id

alter table entities alter column feature_set_id type integer using feature_set_id::integer;

alter table entities alter column feature_set_id drop not null;

alter table entities drop column version;

alter table feature_sets alter column id type integer using id::integer;

-- column reordering is not supported features.feature_set_id

alter table features alter column feature_set_id type integer using feature_set_id::integer;

alter table features alter column feature_set_id drop not null;

alter table features alter column name drop not null;

alter table features alter column type drop not null;

drop index uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop constraint uk5trg3fpcjbjw5w2dpuuqswdxh;

alter table features drop column project;

alter table features drop column version;

alter table jobs_feature_sets alter column feature_sets_id type integer using feature_sets_id::integer;

drop table metrics;

woop · 2020-05-04T13:55:02Z

Disclaimer: My gripe is mostly on the join tables and their value in a one-to-many relationship, and not so much whether surrogates or composite keys are preferable. We don't need join tables to use composite keys, but I understand how they help when you want to propagate all the fields down to children.

In favour of Surrogate/Hierarchical/Normalized approach

Surrogates allow the data model to be more flexible and less brittle by allowing the composite fields to change independently of the relationship.
Surrogates are more efficient to store and index
They don't require each table with a foreign key reference to have all the columns in the EmbeddedId, which currently stands at four. This bloats the data model and gives me anxiety.
Specifically if we only store a field once and don't use EmbeddedIds, then moving that relationship around seems a bit easier. For example we can make features globally unique by removing the unique constraint on the feature set (and possibly removing the column altogether). This would be harder with the composite key approach since we’d need to drop and create new join tables and move the feature set out of the embeddedId.

In favour of Composite/EmbeddedId approach

Migrations are harder with surrogates (at least in my SQL Server experience where the numbers don't matter and the rules are made up)
Possibly one less query to the database required since Ids can be composed outside of the database.
EmbeddedIds/Composites are easier to read since you can find the natural keys in each table that reference them.
EmbeddedIds in our case can map naturally to our Refs. FeatureRef, EntityRef, FeatureSetRef.

assuming we need/want references to have first class types so we soon define canonical domain models for them without SQL backing

I take it you mean we could have FeatureRefs, FeatureSetRefs, and EntityRefs floating around without each one of them being persisted. Something like GetOnlineFeatures where no persistence layer is touched for these references, but they are still functionally used?

If that is the case then it seems like we have three possible implementations areas:

Proto implementation used during cross service communication
Canonical domain model used for getting shit done
Spring data model used for persistence

It seems like one risk that you are flagging is a potential disconnect between (2) and (3) that may not exist if we followed the EmbeddedId approach where they could be the same thing?

Whether or not we follow the EmbeddedId (FeatureRef, EntityRef) approach or not, we will still have to make separate classes for FeatureRef, and EntityRef. I like the idea of these classes being used for querying Features/Entities with Spring’s Specifications Interface. So we basically can get the benefits of composite keys without the horrible (in my opinion) data model, and we’d also not need to query for surrogates.

This also keeps the FeatureRef and EntityRef pure (annotation free). Although the same won't apply to Features, Entities, Jobs, etc. I expect the domain model and persistence models to be the same classes.

It's clear that there are unknowns here and so we should probably err on the side of simplicity and delaying unnecessary complexity. So I think the normalized/surrogate approach has the edge here, but perhaps I am missing some obvious advantage to going the composite/embedded approach.

Useful links

woop · 2020-05-04T14:00:45Z

Composite/EmbeddedId (d579424)

Surrogate/Normalized (935602e)

woop · 2020-05-05T00:01:56Z

Sorry to drag out any further. This has begged for a whiteboard session.

Always happy with good feedback, but we are under time pressure to cut this release in order to roll it out. We still need to get through version removal as well, which is based on this PR.

Async communication seems to be working well here, but if you think jumping on a quick call to discuss would be better then we could do that as well.

ches · 2020-05-05T02:40:45Z

Thanks a lot for the detail @woop, that's the analysis I was after and the ERDs are essentially the whiteboard session I was looking for.

I'm sold on the model without the join tables.

assuming we need/want references to have first class types so we soon define canonical domain models for them without SQL backing

I take it you mean we could have FeatureRefs, FeatureSetRefs, and EntityRefs floating around without each one of them being persisted. Something like GetOnlineFeatures where no persistence layer is touched for these references, but they are still functionally used?

Yes, exactly, and that's precisely a case where looking them up from persistence would be unnecessary, and undesirable unless maybe for a SQL online store.

Proto implementation used during cross service communication

Canonical domain model used for getting shit done

Spring data model used for persistence

It seems like one risk that you are flagging is a potential disconnect between (2) and (3) that may not exist if we followed the EmbeddedId approach where they could be the same thing?

I hadn't thought very deeply about it yet, but no, so far I'm not worried that maintaining appropriate invariants in code will be a problem here.

I suggest being conservative about committing to proto schemas until we're more certain about further outlook of #479, I think that's discussion for #674.

Whether or not we follow the EmbeddedId (FeatureRef, EntityRef) approach or not, we will still have to make separate classes for FeatureRef, and EntityRef… This also keeps the FeatureRef and EntityRef pure (annotation free).

Indeed, this is part of what I had in mind in raising discussion of whether join models would have value or not—if they did, it would still likely be cumbersome because of how widely these are going to be needed, beyond SQL persistence contexts as above.

ches · 2020-05-05T02:40:51Z

/lgtm

zhilingc requested review from davidheryanto, khorshuheng, pradithya and woop as code owners April 26, 2020 16:31

feast-ci-bot added the approved label Apr 26, 2020

zhilingc assigned ches Apr 26, 2020

feast-ci-bot added the size/XXL label Apr 26, 2020

zhilingc force-pushed the split-fields branch from 6152bee to ca93dd2 Compare April 26, 2020 16:42

ches added this to the v0.5.0 milestone Apr 26, 2020

woop mentioned this pull request Apr 29, 2020

Feast 0.5 release #527

Closed

ches mentioned this pull request Apr 29, 2020

Add feature and feature set labels, for metadata #536

Merged

zhilingc force-pushed the split-fields branch from 6c21dec to 42b1973 Compare April 29, 2020 13:00

zhilingc force-pushed the split-fields branch from eac0833 to c95a2c2 Compare April 30, 2020 05:57

zhilingc force-pushed the split-fields branch 2 times, most recently from 481ab83 to 030d6e6 Compare April 30, 2020 09:15

ches previously approved these changes Apr 30, 2020

View reviewed changes

zhilingc dismissed ches’s stale review via caa9184 May 1, 2020 01:54

mrzzy mentioned this pull request May 4, 2020

Validate feature reference and project functionality #631

Closed

woop reviewed May 4, 2020

View reviewed changes

core/src/main/java/feast/core/model/FeatureSet.java Show resolved Hide resolved

zhilingc mentioned this pull request May 4, 2020

Add support for feature set updates and remove versions #676

Merged

zhilingc added 11 commits May 4, 2020 19:10

Split Field model into distinct Feature and Entity objects

3c08158

Remove TFX fields for entities in testdata

17904da

Split Field model into distinct Feature and Entity objects

9bcbabe

Index jointables

2c0a86f

Explicitly name tables, remove redundant constructor

4939b6c

Integrate labels

25b8da9

Fix code comments

fc6c498

Change FeatureSetId to int

7031753

Retrieve featuresets from repository so that ids are consistent

fd9a9d7

Add uniqueness constraint to FeatureSets, fix tests

d579424

Remove feature and entity references

935602e

zhilingc force-pushed the split-fields branch from 0964635 to 935602e Compare May 4, 2020 11:10

feast-ci-bot added the lgtm label May 5, 2020

feast-ci-bot merged commit 6764630 into feast-dev:master May 5, 2020

zhilingc mentioned this pull request May 5, 2020

Fix DataflowJobManager to update existing job instance instead of creating new one #678

Merged

pmjacinto mentioned this pull request May 27, 2020

Better support for Feast database upgrades #745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Field model into distinct Feature and Entity objects #655

Split Field model into distinct Feature and Entity objects #655

zhilingc commented Apr 26, 2020 •

edited

Loading

zhilingc commented Apr 26, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

ches commented Apr 29, 2020

woop commented Apr 29, 2020

ches commented Apr 29, 2020

woop commented Apr 30, 2020

zhilingc commented Apr 30, 2020 •

edited

Loading

woop commented Apr 30, 2020

zhilingc commented Apr 30, 2020

woop commented Apr 30, 2020

zhilingc commented Apr 30, 2020

zhilingc commented Apr 30, 2020

ches left a comment

zhilingc commented May 4, 2020

zhilingc commented May 4, 2020

woop commented May 4, 2020

woop commented May 4, 2020 •

edited

Loading

ches commented May 4, 2020

zhilingc commented May 4, 2020 •

edited

Loading

woop commented May 4, 2020 •

edited

Loading

woop commented May 4, 2020

woop commented May 5, 2020

ches commented May 5, 2020

ches commented May 5, 2020

Split Field model into distinct Feature and Entity objects #655

Split Field model into distinct Feature and Entity objects #655

Conversation

zhilingc commented Apr 26, 2020 • edited Loading

zhilingc commented Apr 26, 2020

zhilingc commented Apr 27, 2020

zhilingc commented Apr 27, 2020

ches commented Apr 29, 2020

woop commented Apr 29, 2020

ches commented Apr 29, 2020

woop commented Apr 30, 2020

zhilingc commented Apr 30, 2020 • edited Loading

woop commented Apr 30, 2020

zhilingc commented Apr 30, 2020

woop commented Apr 30, 2020

zhilingc commented Apr 30, 2020

zhilingc commented Apr 30, 2020

ches left a comment

Choose a reason for hiding this comment

zhilingc commented May 4, 2020

zhilingc commented May 4, 2020

woop commented May 4, 2020

woop commented May 4, 2020 • edited Loading

ches commented May 4, 2020

zhilingc commented May 4, 2020 • edited Loading

woop commented May 4, 2020 • edited Loading

woop commented May 4, 2020

woop commented May 5, 2020

ches commented May 5, 2020

ches commented May 5, 2020

zhilingc commented Apr 26, 2020 •

edited

Loading

zhilingc commented Apr 30, 2020 •

edited

Loading

woop commented May 4, 2020 •

edited

Loading

zhilingc commented May 4, 2020 •

edited

Loading

woop commented May 4, 2020 •

edited

Loading