-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add computation and retrieval of batch feature statistics #612
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhilingc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Will try to review it tomorrow morning @zhilingc |
1ef6767
to
d714016
Compare
@@ -908,6 +987,20 @@ def _build_feature_references( | |||
return features | |||
|
|||
|
|||
def _generate_dataset_id(feature_set: FeatureSet) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term dataset still seems a bit off to me.
The id
that we are returning is a job_id
or an ingestion_id
since it is unique to every invocation of ingest()
. If we refer to it as dataset_id
then I think users would expect it to be associated with their dataset, which means they provide the value or at the very least the value doesnt change in subsequent runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wary of letting users set this value since there is a chance they would assign the same value twice. And running the ingestion twice SHOULD have distinct ids because users probably don't want to treat the data as belonging to the same set... I can add the option to override the dataset id, but it really opens up the possibility of users breaking the system.
I don't think it should be a job_id
, which is already an extremely overloaded term, and retrieval of data by ingestion_id
doesn't seem as intuitive as something like a dataset_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I brought this up is because the current implementation does not fit the typical SOP for our users where they build idempotent data systems. They expect to be able to create a dataset and ingest it based on a name, and have the data system upsert. So basically they just run ingestion until it succeeds and then they are done. They don't care, and often don't know, that we are running Kafka.
And our users definitely don't want to maintain a list of UUIDs. They probably just want to query their datasets by the names they are using like "2020417-mydataset`
Regardless, I think your current approach is fine for now because nothing prevents us from eventually allowing them to set this Id if we allow idempotency. I would probably reuse this same dataset_id
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I've added the option for people to supply their own dataset id, and fixed the sql so that it only retrieves the latest value for each dataset/entity/event timestamp. This should ensure idempotency.
...rc/main/java/feast/storage/connectors/bigquery/statistics/FeatureSetStatisticsQueryInfo.java
Outdated
Show resolved
Hide resolved
...ery/src/main/java/feast/storage/connectors/bigquery/statistics/FieldStatisticsQueryInfo.java
Outdated
Show resolved
Hide resolved
...ery/src/main/java/feast/storage/connectors/bigquery/statistics/FieldStatisticsQueryInfo.java
Outdated
Show resolved
Hide resolved
I made a first run through. The only big thing that stands out is the duplication between entity level statistics and feature level statistics. I'll think about how we can simplify this implementation because as written it is a lot of code to merge in and a bit scary. Other than that I am finding it a bit slow going to review the PR because some of the methods require comments. I'd rather not ding the methods and ask for JavaDocs or comments, so would you mind just giving the PR a scan through and adding them? It would be especially useful for |
584c89c
to
750b2e0
Compare
I'll talk about the data model first because that is the most important part to get right now. Assumptions
Thoughts
Recommendation
Alternatively we could also add support for entity statistics by making all data in a row a "feature, which would then map a bit better the Tensorflow schema. Essentially |
750b2e0
to
b430a9b
Compare
I think that’s a very strong assumption to make. I’ve been back and forth about entity statistics the past few weeks; my first iteration only involved feature statistics, but i ended up adding support for entity statistics later because i think it is something that is truly valuable to users, for checking id distributions and bounds, etc. I don’t think it’s To your point:
This doesn’t make sense when you consider the user experience. Do they double-up on entity columns so that they can be ingested as features JUST to retrieve statistics? Do they ingest it as a separate dataset? Why do users have to perform hacks to transform a logical unit (the dataset) into something else just to do something that is expectedly a basic requirement?
I’m not sure what sort of complexity you mean here. I think if we do have statistics for entities it’s a lot easier to maintain compatibility with TFDV fields with inheritance, which is the main reason why the class was introduced in the first place. Additional fields that differ between features and entities can be defined in their concrete classes.
Do you mean the fact that they use the same object? I guess it’s fine to separate them for the sake of forward compatibility…
Do you mean have FeatureReference as a distinct entity and have feature sets, features and statistics reference that entity? I’m not sure if that’s the best idea.
FeatureReference is part of the definition of Feature and FeatureStatistics cannot form a relation directly to it. To be exact, with the current implementation, the table definition for Features looks like:
And FeatureStatistics has the following constraint:
|
Ok, I polled a few data scientists and they acknowledged that there is no need to have entity statistics except for occasional once-off checks, which can be done directly to the db anyway. I'll remove the entity statistics - but does that mean i remove the entity field schemas introduced in #438 as well? |
/test test-end-to-end-batch |
3 similar comments
/test test-end-to-end-batch |
/test test-end-to-end-batch |
/test test-end-to-end-batch |
/test test-end-to-end |
1 similar comment
/test test-end-to-end |
/test test-end-to-end-batch |
1d7cd68
to
7a7a5b5
Compare
/test test-end-to-end-batch |
/test test-end-to-end-batch |
@zhilingc: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Would it be feasible to break this pull request up into a few somehow? At a first glance there's a lot that I think is generally beneficial independent of the feature addition of batch statistics, like entity/domain model additions and updates like I have little doubt that the batch statistics feature addition in itself will necessarily be a large PR no matter how we slice it, but I see the 5,000 line diffstat here and my spirits get deflated that I will need to reserve the majority of a weekend to it sometime in the next few weeks, by which point it will be merged (which is totally fine, just that it feels like a drag to bring people back to it later if there are questions & comments). |
Some products alternate technical and feature releases, starting to feel Feast 0.5 could be split on its two roadmap categories that way… speaking from the perspective of someone who will upgrade a deployment at some point (i.e. merging down a branch, and rolling it out to operation), the scope is getting pretty huge (and that's not yet including this, #502, #554, #533…). It's great work, it's just creating some anxiety for me about the effort and risks of upgrading. Anyway, I'm off topic for this PR, I'll take this elsewhere… |
@ches That's fair. I'll do what I can. |
Yea the scope of 0.5 has increased a lot, which is partly good and bad. By splitting do you mean based on the existing master branch versus the additional functionality we are introducing with open PRs, or would you move out some functionality also merged into master? My intuition is to put our efforts towards making 0.5 as stable and well tested as possible, so we would probably deploy it side by side and migrate folks over slowly. I would want to include all breaking changes for that matter. I think its possible to have 0.5 and 0.6 be technical and feature releases, but we probably will roll up to 0.6 quite quickly, which means we won't have much time to maintain the 0.5 branch. If you feel strongly about splitting this then let me know. Otherwise we will try to merge these into 0.5 and take extra time in manual functional testing prior to rolling it out. |
What this PR does / why we need it:
This PR adds support for retrieval of batch statistics over data ingested into a feast warehouse store, as proposed in M2 of the Feature Validation RFC.
Note that it deviates from the RFC in the following ways:
This is a bit of a chonky PR, and the code itself requires a bit of cleaning up, hence the WIP status, but refer to the attached notebook for how this implementation looks like for a user of Feast.
Does this PR introduce a user-facing change?: