Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultipleTenancy Support to the FeatureTable #1514

Closed
qiuleiSFDC opened this issue Apr 28, 2021 · 14 comments
Closed

Add MultipleTenancy Support to the FeatureTable #1514

qiuleiSFDC opened this issue Apr 28, 2021 · 14 comments
Labels
wontfix This will not be worked on

Comments

@qiuleiSFDC
Copy link

qiuleiSFDC commented Apr 28, 2021

Is your feature request related to a problem? Please describe.
As a SaaS company, our infrastructure is entirely multi-tenant to ensure the separation of different customers’ data. We authorize API calls and store data on a per-tenant basis.
On the surface, we'd need something like the following in Feast:
(1). Ability to define an optional TenantSpec tenant attribute in FeatureTableSpec that defines the name + type of a tenant field for feature table keys (e.g., name = 'TenantId'; type = 'string')
(2). Adding an optional tenant attribute to ingestion and serving APIs (+SDK methods)
(3). Transparently adding the tenant to storage operations when the tenant attribute is set (e.g., including 'TenantId=foo' in row keys)
More details please refer to discussion in feast-dev/feast-java-old#6.

Describe the solution you'd like
In a high level, for (1) above, we would update the FeatureTable proto as below by adding an optional TenantSpec attribute.

message FeatureTableSpec {
    ...
    // Tenant identifier for data in this feature table.
    TenantSpec tenant = 9;
}

message TenantSpec {
    // Identifier name for tenants (e.g., TenantId). Not updatable.
    string name = 1;

    // Value type of the tenant identifier. Not updatable.
    feast.types.ValueType.Enum value_type = 2;
}

For (2) and (3), we would like to update the feature data retriever and writer API + SDK method accordingly. For example, for retriever, we would add an optional tenant attribute to the retriever method GetOnlineFeaturesRequestV2 as below:

message GetOnlineFeaturesRequestV2 {
    ...
    // Optional tenant to retrieve features for
    feast.types.Value tenant = 6;
}

Similarly, for the writer PushOnlineFeaturesRequest proposed in #1461, we would like to also add a tenant attribute feast.types.Value tenant = 4; as below

message PushOnlineFeaturesRequest {
    // Name of the Feature Table to write the features to.
    string feature_table = 1;
    
    // Optional field to specify project name override.
    string project = 2;
    
    // List of entity and feature values to write
    repeated FieldValues field_values = 3;

     // Optional tenant identifier for values
    feast.types.Value tenant =  4;
    
    message FieldValues {
        // Map of feature or entity name to feature/entity values.
        map<string, feast.types.Value> fields = 1;
    }
}

Please share your thoughts. Thank you.

@woop
Copy link
Member

woop commented Apr 28, 2021

@qiuleiSFDC thanks for the detailed write-up!

Would you mind sharing how tenant would be different from our existing project field? They seem very similar to me.

@aaraujo
Copy link

aaraujo commented Apr 28, 2021

The project field is more-analogous to use case whereas the tenant field specifies who owns specific rows in a feature table.

We'd like to use project as use case and not change its associated access control, sharing, discovery, etc. feature semantics. Adding tenant simply means optionally defining a feature table as "able to store data for multiple tenants", and encoding a tenant id in each row. Then we can scope tenant specific read/write/delete operations to only a single tenant's data. This ensures, for example, that tenantA's read/write/delete requests only access or modify that tenant's data.

Treating project as tenant would mean that we lose the semantics and features of project, since you flatten the relationship between project and feature table. And since each feature store use case can store data for multiple tenants, we'd end up with up to numberOfUseCases * numberOfTenants projects and feature tables in the system.

@woop
Copy link
Member

woop commented Apr 28, 2021

@aaraujo my primary concern is that this API adds complexity to our APIs. We want Feast to be simple for teams to get started with, which includes a data scientist or engineer just deploying it for a single use case. project already complicates the API, and tenant will do the same, even if it's optional.

What if we had separate "advanced" RPCs that contain functionality as above?

@aaraujo are you imagining that a data scientist would add tenant to their feature views/tables during the creation of these objects?

@qiuleiSFDC
Copy link
Author

are you imagining that a data scientist would add tenant to their feature views/tables during the creation of these objects?

I think it depends. If the data scientist(DS) from a company with multiple tenant/org data, the DS would add tenant to the view/table during the creation. If not, they could forget the tenant, which assumes all the data would belong to same organization.

@rohita5l
Copy link

rohita5l commented Apr 28, 2021

Hi @qiuleiSFDC Im a platform engineer at Tecton and was at Salesforce previously :)
What do you think about making the features have entity as the tenantID. If the feature packages have that defined as an entity then all the reads and writes of the features will need to have the entity field set (in your case with tenantID). This would also provide tenant isolation when doing reads and writes. So when you are doing training/inference and want to retreive features you would only get back the tenant's data you wish via passing it as entity. Let me know your thoughts. @woop Please correct me if Im wrong.

@woop
Copy link
Member

woop commented Apr 28, 2021

Yea, your understanding is correct. It seems like entity could work. It's not clear whether there are other concerns like isolation around a tenant entity.

@aaraujo
Copy link

aaraujo commented Apr 28, 2021

What if we had separate "advanced" RPCs that contain functionality as above?

We'd need to duplicate: ApplyFeatureTable, GetOnlineFeaturesV2, PushOnlineFeatures, DeleteOnlineFeatures (proposal forthcoming), etc. and related SDK plumbing. Would it be possible/preferable to have a "basic" SDK that only exposes a minimum set of functionality, and an "advanced" one that exposes all of it?

are you imagining that a data scientist would add tenant to their feature views/tables during the creation of these objects?

As @qiuleiSFDC mentioned it depends on the entities/data that will be served. If it's customer owned entities/data, yes a data scientist would create a multi-tenant table/view (specify a TenantSpec when defining the object) and ML applications that use the table/view would then need to specify a tenant in serving requests. If it's internal (company owned) data the table/view would not be defined this way.

What do you think about making the features have entity as the tenantID

We proposed this initially to end users and the consensus was that while this could technically work, the notion of tenant and entity are fundamentally different. The latter describes the type of data stored whereas the former describes ownership. So from a usability perspective defining a tenant "entity" for a feature table did not seem right, and neither did encoding a request scope attribute into every row. For example, if a request comes in with data for tenantA and tenantB how would the service know which tenant is authorized to make the request if the request itself does not have that defined as a top level request attribute?

@aaraujo
Copy link

aaraujo commented Apr 29, 2021

@woop @qiuleiSFDC one way we might be able to limit the scope of this change is to accept the tenant as a gRPC Metadata key/attribute in serving requests. FeatureTableSpec.TenantSpec then becomes FeatureTableSpec.TenantMetadataSpec:

message TenantMetadataSpec {
    // gRPC Metadata key for tenant (e.g., TenantId). Not updatable.
    string tenant_key = 1;

    // Value type of the tenant identifier. Not updatable.
    feast.types.ValueType.Enum value_type = 2;
}

With that approach we'd only need to change the ApplyFeatureTable API. At a minimum we can simply indicate that a table is multi-tenant by adding a bool multitenant (or similar) property to FeatureTableSpec and get the tenant "by convention" during serving requests (e.g. pull a String typed TenantId from the gRPC metadata). For our purposes we can get by with a hard-coded 'TenantId' attribute that is a String.

@woop
Copy link
Member

woop commented May 6, 2021

Would it be possible/preferable to have a "basic" SDK that only exposes a minimum set of functionality, and an "advanced" one that exposes all of it?

We'd like to make our APIs more discoverable. You can imagine us implementing a REST API with swagger like documentation. So ideally whatever the user sees there isn't overwhelming. I like the idea of having a simple/advanced SDK, but we'd need to make sure that the underlying implementations are 99% the same.

@woop @qiuleiSFDC one way we might be able to limit the scope of this change is to accept the tenant as a gRPC Metadata key/attribute in serving requests. FeatureTableSpec.TenantSpec then becomes FeatureTableSpec.TenantMetadataSpec

I like the idea of using gRPC metadata for accepting the tenant.

With that approach we'd only need to change the ApplyFeatureTable API. At a minimum we can simply indicate that a table is multi-tenant by adding a bool multitenant (or similar) property to FeatureTableSpec and get the tenant "by convention" during serving requests (e.g. pull a String typed TenantId from the gRPC metadata). For our purposes we can get by with a hard-coded 'TenantId' attribute that is a String.

What if we simplified things further and captured tenant configuration as part of labels https://github.com/feast-dev/feast/blob/master/protos/feast/core/FeatureTable.proto#L54? This is the approach that Gojek followed in order to add data quality monitoring into Feast in a way that didn't change the core API.

@aaraujo-sfdc
Copy link

I like the idea of using gRPC metadata for accepting the tenant.

Passing tenant as gRPC metadata would significantly reduce the API blast radius for this feature. We'll explore this a bit more.

What if we simplified things further and captured tenant configuration as part of labels https://github.com/feast-dev/feast/blob/master/protos/feast/core/FeatureTable.proto#L54?

This could also work, but requires adding "special labels" that are well documented and immutable once set. Are we okay with adding such labels?

@qiuleiSFDC
Copy link
Author

qiuleiSFDC commented May 6, 2021

This is the approach that Gojek followed in order to add data quality monitoring into Feast in a way that didn't change the core API.

Is there any open source examples that how people take advantage of the label like this data quality monitoring scenario? Specifically, do the engineers in Gojek use the label mainly in the client side to filter/monitor the data quality? In that sense, I guess the server side(feast service) doesn't care about what labels are out there, and the feast service will just store those labels in the DB? Or does Gojek implement their customized feast API?

What if we simplified things further and captured tenant configuration as part of labels https://github.com/feast-dev/feast/blob/master/protos/feast/core/FeatureTable.proto#L54?

I assume this is doable. But in that sense, I image we would need piece of logic like for reach label {do something} for almost every feast API implementation. Not sure this is ok.?

@aaraujo @woop

@woop
Copy link
Member

woop commented May 11, 2021

This could also work, but requires adding "special labels" that are well documented and immutable once set. Are we okay with adding such labels?

Yes. We're ok with that.

@woop
Copy link
Member

woop commented May 11, 2021

This is the approach that Gojek followed in order to add data quality monitoring into Feast in a way that didn't change the core API.

Is there any open source examples that how people take advantage of the label like this data quality monitoring scenario? Specifically, do the engineers in Gojek use the label mainly in the client side to filter/monitor the data quality? In that sense, I guess the server side(feast service) doesn't care about what labels are out there, and the feast service will just store those labels in the DB? Or does Gojek implement their customized feast API?

Have a look over here https://github.com/feast-dev/feast-spark/blob/master/python/feast_spark/contrib/validation/ge.py#L189

Basically they use a label to attach validation logic to a feature table. The label is then available to the streaming ingestion job which applies the validation on batches of data. Feast has an awareness of the label (it's hardcoded on the label name within the Spark ingestion job). However, if the label is not found then no logic is applied. Meaning you have to "opt in" to the functionality by setting the label.

I assume this is doable. But in that sense, I image we would need piece of logic like for reach label {do something} for almost every feast API implementation. Not sure this is ok.?
@aaraujo @woop

Wouldn't it be possible to isolate the tenant specific logic to within the storage implementation? We don't have to support tenant ids in all storage implementations from the start. Or did you have something else in mind?

@stale
Copy link

stale bot commented Sep 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 8, 2021
@stale stale bot closed this as completed Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants