Avoid skewed join between entity_df & feature views #1712

MattDelac · 2021-07-14T21:13:01Z

What this PR does / why we need it:

The problem is that if we ask for historical features coming from multiple entities that have a 1:many relationship between them, then we encounter skewed join

That's basically how the JOIN is performed with the current template
Imagine that driver_id=1 contains millions of rides

And that's what this PR is proposing

Let's have a look about the statistics of the SQL template on our use case (4 FeatureViews, 2 entities, entity_dataframe containing 100M rows)

SQL template currently in production
I cancelled the query as it was still running after 25min

9:14 AM Query has been running for 25 min

SQL template of this PR
Elapsed time 2 min 34 sec
Slot time consumed 1 day 21 hr
Bytes shuffled 3.27 TB
Bytes spilled to disk 0 B

Note: On our full entity_dataframe (3B rows) the current SQL template was still running after 45 min while the SQL template of this PR finished after 15min

Which issue(s) this PR fixes:

Fixes None

Does this PR introduce a user-facing change?:

NONE

feast-ci-bot · 2021-07-14T21:13:04Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sdk/python/feast/infra/offline_stores/bigquery.py

codecov-commenter · 2021-07-14T21:17:49Z

Codecov Report

Merging #1712 (c2e08d3) into master (703c4be) will increase coverage by 1.14%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #1712      +/-   ##
==========================================
+ Coverage   83.32%   84.47%   +1.14%     
==========================================
  Files          76       79       +3     
  Lines        6794     7071     +277     
==========================================
+ Hits         5661     5973     +312     
+ Misses       1133     1098      -35

Flag	Coverage Δ
integrationtests	`84.40% <66.66%> (+1.15%)`	⬆️
unittests	`69.45% <11.11%> (-0.34%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py	`80.60% <66.66%> (+4.45%)`	⬆️
sdk/python/feast/entity.py	`88.28% <0.00%> (-8.11%)`	⬇️
sdk/python/feast/infra/offline_stores/file.py	`93.49% <0.00%> (-3.54%)`	⬇️
sdk/python/feast/feature_view.py	`84.25% <0.00%> (-0.88%)`	⬇️
sdk/python/feast/registry.py	`80.82% <0.00%> (-0.48%)`	⬇️
sdk/python/feast/repo_operations.py	`31.06% <0.00%> (-0.31%)`	⬇️
sdk/python/tests/test_historical_retrieval.py	`99.09% <0.00%> (-0.01%)`	⬇️
...dk/python/tensorflow_metadata/proto/v0/path_pb2.py	`100.00% <0.00%> (ø)`
.../python/tensorflow_metadata/proto/v0/schema_pb2.py	`100.00% <0.00%> (ø)`
...hon/tensorflow_metadata/proto/v0/statistics_pb2.py	`100.00% <0.00%> (ø)`
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 703c4be...c2e08d3. Read the comment docs.

woop · 2021-07-14T23:22:34Z

What isn't clear to me is what the before and after of this test is. What is the problem that we are seeing and how do we know we are solving it? I realize it has to do with one-to-many relationships. Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships?

MattDelac · 2021-07-15T13:20:49Z

What is the problem that we are seeing and how do we know we are solving it?

That's an optimization problem. So beside running some benchmark on my side and prove you that this new template is better to scale, I don't have an idea of a good unit test for it.

Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships?

As I was saying, this is an optimization problem. We can extend the current test if we think that the coverage is not enough. A dedicated test for that does not seem like a good option IMO

Also I am going to spend time in benchmarking the 2 templates on our use case and will publish as many detail as possible in this PR

woop · 2021-07-16T22:54:24Z

What is the problem that we are seeing and how do we know we are solving it?

That's an optimization problem. So beside running some benchmark on my side and prove you that this new template is better to scale, I don't have an idea of a good unit test for it.

Can we add a test that uses a 1:many relationship and shows how this test actually fixes the response? Or could we just extend our existing historical retrieval to have one of these relationships?

As I was saying, this is an optimization problem. We can extend the current test if we think that the coverage is not enough. A dedicated test for that does not seem like a good option IMO

Also I am going to spend time in benchmarking the 2 templates on our use case and will publish as many detail as possible in this PR

Thanks. As long as it's purely an optimization change then I don't see a need for a new test. Let me know when/if you feel comfortable merging after your analysis.

Signed-off-by: Matt Delacour <[email protected]>

woop · 2021-07-19T18:39:01Z

/lgtm

feast-ci-bot · 2021-07-19T18:39:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MattDelac, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

feast-ci-bot added do-not-merge/work-in-progress do-not-merge/release-note-label-needed labels Jul 14, 2021

feast-ci-bot added the needs-kind label Jul 14, 2021

MattDelac added the ok-to-test label Jul 14, 2021

feast-ci-bot added the size/M label Jul 14, 2021

MattDelac force-pushed the optimize_sql_template branch from 7812c84 to 4ed6e5d Compare July 14, 2021 21:15

MattDelac commented Jul 14, 2021

View reviewed changes

sdk/python/feast/infra/offline_stores/bigquery.py Show resolved Hide resolved

feast-ci-bot added release-note-none and removed do-not-merge/release-note-label-needed labels Jul 15, 2021

Avoid skewed join between entity_df and feature views

c2e08d3

Signed-off-by: Matt Delacour <[email protected]>

MattDelac force-pushed the optimize_sql_template branch from 4ed6e5d to c2e08d3 Compare July 19, 2021 13:45

MattDelac added the kind/housekeeping label Jul 19, 2021

feast-ci-bot removed the needs-kind label Jul 19, 2021

MattDelac marked this pull request as ready for review July 19, 2021 13:50

MattDelac requested review from achals, tsotnet, woop and a team as code owners July 19, 2021 13:50

feast-ci-bot removed the do-not-merge/work-in-progress label Jul 19, 2021

MattDelac changed the title ~~Attempt to optimize potential skewed join~~ Optimize potential skewed join between entity_df & feature views Jul 19, 2021

MattDelac changed the title ~~Optimize potential skewed join between entity_df & feature views~~ Avoid skewed join between entity_df & feature views Jul 19, 2021

feast-ci-bot assigned woop Jul 19, 2021

feast-ci-bot added the lgtm label Jul 19, 2021

woop approved these changes Jul 19, 2021

View reviewed changes

feast-ci-bot added the approved label Jul 19, 2021

feast-ci-bot merged commit 8cfe914 into feast-dev:master Jul 19, 2021

MattDelac deleted the optimize_sql_template branch July 19, 2021 18:50

woop mentioned this pull request Jul 19, 2021

Implement Redshift historical retrieval #1720

Merged

MattDelac mentioned this pull request Jul 27, 2021

Specify unique-row-id column in get_historical_features #1736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid skewed join between entity_df & feature views #1712

Avoid skewed join between entity_df & feature views #1712

MattDelac commented Jul 14, 2021 •

edited

Loading

feast-ci-bot commented Jul 14, 2021

codecov-commenter commented Jul 14, 2021 •

edited

Loading

woop commented Jul 14, 2021

MattDelac commented Jul 15, 2021 •

edited

Loading

woop commented Jul 16, 2021

woop commented Jul 19, 2021

feast-ci-bot commented Jul 19, 2021

Avoid skewed join between entity_df & feature views #1712

Avoid skewed join between entity_df & feature views #1712

Conversation

MattDelac commented Jul 14, 2021 • edited Loading

feast-ci-bot commented Jul 14, 2021

codecov-commenter commented Jul 14, 2021 • edited Loading

Codecov Report

woop commented Jul 14, 2021

MattDelac commented Jul 15, 2021 • edited Loading

woop commented Jul 16, 2021

woop commented Jul 19, 2021

feast-ci-bot commented Jul 19, 2021

MattDelac commented Jul 14, 2021 •

edited

Loading

codecov-commenter commented Jul 14, 2021 •

edited

Loading

MattDelac commented Jul 15, 2021 •

edited

Loading