Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34039][SQL] ReplaceTable should invalidate cache #31081

Closed
wants to merge 1 commit into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Jan 7, 2021

What changes were proposed in this pull request?

This changes ReplaceTableExec/AtomicReplaceTableExec, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the uncacheTable method to DataSourceV2Strategy so that we don't need to pass a Spark session to the v2 exec.

Why are the changes needed?

Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results.

Does this PR introduce any user-facing change?

Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced.

How was this patch tested?

Added a new unit test.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38370/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38370/

@github-actions github-actions bot added the SQL label Jan 7, 2021
@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133782 has finished for PR 31081 at commit d8ea7b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @sunchao .
According to the affected version of SPARK-34039, this is merged to master.

cc @cloud-fan , @aokolnychyi

@dongjoon-hyun
Copy link
Member

@sunchao . I tested this case with Apache Spark 3.1.0 RC1 with Iceberg 0.10. This is a correctness issue. How do you think about that? If you agree, please make a backport. (Also, cc @HyukjinKwon )

{code}
spark-sql> SELECT * FROM local.db.table;
4	5

spark-sql> CACHE TABLE v AS SELECT * FROM local.db.table;

spark-sql> SELECT * FROM v;
4	5

spark-sql> REPLACE TABLE local.db.table AS SELECT 100 id, '100' string;

spark-sql> SELECT * FROM v;
4	5

spark-sql> SELECT * FROM local.db.table;
100	100

@sunchao
Copy link
Member Author

sunchao commented Jan 8, 2021

Thanks @dongjoon-hyun . Did the above include this PR? There is another issue SPARK-34052 on handling temporary view in DSv2 which could cause the above as well.

@dongjoon-hyun
Copy link
Member

No~ As I wrote, I used Apache Spark 3.1.0 RC1 with Iceberg 0.10 to see the existing behavior.

@sunchao
Copy link
Member Author

sunchao commented Jan 8, 2021

Got you. Yes I think we should backport this (I'll create one soon). In addition, we may need SPARK-34052 to completely address the issue.

In the second:

spark-sql> SELECT * FROM v;
4	5

IMO, the temporary view v should become invalid once the source table local.db.table is replaced.

@aokolnychyi
Copy link
Contributor

It would be great to see this fixed in 3.1.0. Thanks for the work!

@dongjoon-hyun
Copy link
Member

I agree with you, @sunchao .

sunchao added a commit to sunchao/spark that referenced this pull request Jan 9, 2021
This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec.

Similar to SPARK-33492 (apache#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results.

Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced.

Added a new unit test.

Closes apache#31081 from sunchao/SPARK-34039.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@HyukjinKwon
Copy link
Member

Sure, I think RC1 is failed. Thanks for letting me know @dongjoon-hyun. +1 for porting it back.

HyukjinKwon pushed a commit that referenced this pull request Jan 10, 2021
### What changes were proposed in this pull request?

This is a backport of #31081 to branch-3.1.

This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec.

### Why are the changes needed?

Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results.

### Does this PR introduce _any_ user-facing change?

Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced.

### How was this patch tested?

Added a new unit test.

Closes #31100 from sunchao/SPARK-34039-branch-3.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member

BTW, do we have more issues to fix like this? I have seen similar cache/uncache fixes often lately. If we have some issues to fix already, it might be best to delay the next RC until we have the fixes.

@sunchao
Copy link
Member Author

sunchao commented Jan 10, 2021

@HyukjinKwon I'm working on #31107 which also seems to be a bug on caching. Yes it seems there are many issues related to this area recently, and I feel we haven't identified all of them yet.

@sunchao
Copy link
Member Author

sunchao commented Jan 10, 2021

BTW some of the cache fixes are not in branch-3.0, @HyukjinKwon @dongjoon-hyun let me know if it's necessary to make the backports.

@dongjoon-hyun
Copy link
Member

Thanks, @sunchao . Yes. It would be great if we have it if the UT fails there in the same way.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 11, 2021

@sunchao and @MaxGekk, can we have an umbrella ticket or epic ticket (if an umbrella ticket is not possible) to group these caching / uncaching issues? Let's block RC until we feel sure these issues are fixed then.

@sunchao
Copy link
Member Author

sunchao commented Jan 11, 2021

@HyukjinKwon yes I've tried to put all these under https://issues.apache.org/jira/browse/SPARK-33507. @MaxGekk do you mind adding yours to here too?

@HyukjinKwon
Copy link
Member

@MaxGekk
if your goal is different from SPARK-33507, you can also file another umbrella ticket and link SPARK-33507 with incooperates link.
If the goal looks similar or a subset, please move and file JIRAs under that one.

@MaxGekk
Copy link
Member

MaxGekk commented Jan 11, 2021

I have converted my tickets to sub-tasks of SPARK-33507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants