-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize historical retrieval with BigQuery offline store #1602
Conversation
Hi @MattDelac. Thanks for your PR. I'm waiting for a feast-dev member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
dfeb89a
to
e08ae6b
Compare
This is great, thanks @MattDelac
These results are very promising. One thing that isn't clear from your test is whether all your entities occur on a single timestamp, or if a range of times are being used. The pathological case that I would want to test for is when there is a table with super granular events (like real-time features being updated every second) being joined onto a table with comparatively static data (like daily customer data). I think it may make sense to
At that point it would be trivial to assess whether an optimization in our BQ template has led to improvements. We'd be happy to extend the tests if you don't have bandwidth, but it might take a week or two given our backlog. |
The tests with my data were on a single timestamp. But the integration tests are on a range of time I believe. |
Yes I don't think why it should not be supported here ... 🤔 Definitely, if you think that it's not covered by the integration tests, we should add those first. |
I think they are covered to some degree, but the problem is we don't use a very large amount of data, nor do we have proper benchmarking. So its more a matter of being able to see the performance difference between tests than a functional test. |
e03d519
to
fecae33
Compare
fecae33
to
7951926
Compare
Codecov Report
@@ Coverage Diff @@
## master #1602 +/- ##
==========================================
- Coverage 83.94% 77.56% -6.38%
==========================================
Files 67 66 -1
Lines 5911 5791 -120
==========================================
- Hits 4962 4492 -470
- Misses 949 1299 +350
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
/kind housekeeping |
/ok-to-test |
7951926
to
33dcc63
Compare
69d28c2
to
5d47795
Compare
Signed-off-by: Matt Delacour <[email protected]>
5d47795
to
5430c46
Compare
PR ready for a final round of reviews (and for approval 😉 ) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MattDelac, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Signed-off-by: Matt Delacour <[email protected]>
What this PR does / why we need it:
The current SQL template to perform historical retrieval of BigQuery datasets could be more efficient.
Which issue(s) this PR fixes:
As a former data scientist, I often realized that big data engines (Spark, BigQuery, Presto, etc.) are often much more efficient with a series of GROUP BY rather than a Window function. Moreover, we should avoid ORDER BY operations as much as possible.
So this PR comes with a new approach to compute the point in time join. This is solely compose of JOINs and GROUP BYs
Here are the results of my benchmark:
Context
The api call is the following
And some idea of the scale of this historical retrieval
feature_view_A
contains ~5B rows and ~3.6B unique "user_id"feature_view_B
contains ~5B rows and ~3.6B unique "user_id"feature_view_C
contains ~1.7B rows and ~1.1B unique "user_id"feature_view_D
contains ~42B rows and ~3.5B unique "user_id"Results
On the original SQL template
With the SQL template of this PR
So as we can see, the proposed SQL template consume half the resources that the one currently implemented when we asked for 100M "user_id".
Also, because this new SQL template is only composed of JOINs and GROUP BY, it should scale "indefinitely" except if the data is highly skewed (eg: a single "user_id" represents 20% of the dataset).
Does this PR introduce a user-facing change?: