-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3884] Adding support to let archival proceed beyond savepointed commits #5350
Conversation
cc @umehrot2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, there can be caveats when leaving holes in the active commits timeline, i.e., completed commits are archived between savepoints in the active timeline, causing some file groups to be ignored when getting the latest file groups/slices. Let's think through the scenarios before landing this.
cff8f42
to
134ef19
Compare
yes. While loading file groups, we deduce whether a commit is valid or not based on below logic
So, after first entry in active timeline, there are no holes expected. May be there are other places where we make such assumptions. So, reverting the patch to WIP for now as this needs some more investigation. Since this is very much in the hot path, we may want to do a thorough analysis before we can put it up for review. |
134ef19
to
4d25f76
Compare
@bvaradar @n3nash : I have touched one of the core methods (isBeforeTimelineStarts(String instant)) in this patch. Prior to this patch, we don't allow any holes in the active timeline and in this patch, we are loosening that constraint. I have added full context in the description. Would appreciate if you can review it once to see if there is anything else we need to consider. appreciate your time and inputs in advance. |
@nsivabalan @yihua Closing this in favor of #5837 |
What is the purpose of the pull request
For eg; C1, C2, C3, Savepoint_C3, C4, C5, Savepoint_C5, C6, C7, C8, C9.
Lets say, C1, C2, C4, and C6 are archived (with the fix in this patch. If not, archival will not proceed after C2).
so, active line is : C3, savepoint_C3, C5, Savepoint_C5, C7, C8, C9.
So, if a Filegroup committed with C4 is checked for isBeforeTimelineStarts(String instant), we might return false. So, the fix is to find the first unsavepointed commit in active timeline and treat that as the first entry in active timeline. Any instant < this first unsavepointed commit will be considered a valid completed commit.
So, in above case, its C6. so, isBeforeTimelineStarts(C4) will return true.
Brief change log
hoodie.archive.proceed.beyond.savepoints
which will guard this behavior. Have set the default value to false to retain old behavior.Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.