Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3884] Adding support to let archival proceed beyond savepointed commits #5350

Closed
wants to merge 2 commits into from

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Apr 18, 2022

What is the purpose of the pull request

  • So far, archival will stop at the first savepointed commit and will not proceed further. For users who may not be interested in incremental queries, but just "as of instant", we can let them proceed further by skipping just the savepointed commit. This opens up new opportunities for hudi users. For eg, one can retain commits for years, by adding one savepoint per day for older commits (say > 30 days old). And they can query hudi using "as of instant" for very old data. If not, one has to retain every commit and let archival stop at first commit which may not be a good experience for users.
  • Had to fix one the core methods in HoodieTimeline with this change boolean isBeforeTimelineStarts(String instant); Please do check the changes here. Prior to this patch, we don't allow any holes in the timeline. As instant is considered committed if its part of active timeline or if its < first entry in active timeline. But since we are letting archival go beyond savepoints, there could be holes in the active timeline.
    For eg; C1, C2, C3, Savepoint_C3, C4, C5, Savepoint_C5, C6, C7, C8, C9.
    Lets say, C1, C2, C4, and C6 are archived (with the fix in this patch. If not, archival will not proceed after C2).

so, active line is : C3, savepoint_C3, C5, Savepoint_C5, C7, C8, C9.
So, if a Filegroup committed with C4 is checked for isBeforeTimelineStarts(String instant), we might return false. So, the fix is to find the first unsavepointed commit in active timeline and treat that as the first entry in active timeline. Any instant < this first unsavepointed commit will be considered a valid completed commit.
So, in above case, its C6. so, isBeforeTimelineStarts(C4) will return true.

Brief change log

  • Added a new config named hoodie.archive.proceed.beyond.savepoints which will guard this behavior. Have set the default value to false to retain old behavior.
  • When this config is enabled, archival will skip savepointed commits and will proceed further to archive other commits. when disabled archival will stop at first savepointed commit.
  • Fixed implementation of isBeforeTimelineStarts(String instant) to cater to holes in active timeline to deduce valid completed commits.

Verify this pull request

This change added tests and can be verified as follows:

  • TestHoodietTimelineArchiver.testSavepointsWithArchival

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@vinothchandar
Copy link
Member

cc @umehrot2

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, there can be caveats when leaving holes in the active commits timeline, i.e., completed commits are archived between savepoints in the active timeline, causing some file groups to be ignored when getting the latest file groups/slices. Let's think through the scenarios before landing this.

@nsivabalan
Copy link
Contributor Author

nsivabalan commented Apr 19, 2022

yes. While loading file groups, we deduce whether a commit is valid or not based on below logic

  /**
   * A FileSlice is considered committed, if one of the following is true - There is a committed data file - There are
   * some log files, that are based off a commit or delta commit.
   */
  private boolean isFileSliceCommitted(FileSlice slice) {
    String maxCommitTime = lastInstant.get().getTimestamp();
    return timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime())
        && HoodieTimeline.compareTimestamps(slice.getBaseInstantTime(), HoodieTimeline.LESSER_THAN_OR_EQUALS, maxCommitTime);
  }

So, after first entry in active timeline, there are no holes expected. May be there are other places where we make such assumptions. So, reverting the patch to WIP for now as this needs some more investigation. Since this is very much in the hot path, we may want to do a thorough analysis before we can put it up for review.

@nsivabalan nsivabalan added the pr:wip Work in Progress/PRs label Apr 19, 2022
@nsivabalan nsivabalan changed the title [HUDI-3884] Adding support to let archival proceed beyond savepointed commits [HUDI-3884][WIP] Adding support to let archival proceed beyond savepointed commits Apr 19, 2022
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

@bvaradar @n3nash : I have touched one of the core methods (isBeforeTimelineStarts(String instant)) in this patch. Prior to this patch, we don't allow any holes in the active timeline and in this patch, we are loosening that constraint. I have added full context in the description. Would appreciate if you can review it once to see if there is anything else we need to consider.

appreciate your time and inputs in advance.

@nsivabalan nsivabalan changed the title [HUDI-3884][WIP] Adding support to let archival proceed beyond savepointed commits [HUDI-3884] Adding support to let archival proceed beyond savepointed commits May 4, 2022
@nsivabalan nsivabalan removed the pr:wip Work in Progress/PRs label May 4, 2022
@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label May 4, 2022
@codope
Copy link
Member

codope commented Jun 10, 2022

@nsivabalan @yihua Closing this in favor of #5837
Please review.

@codope codope closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants