Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1276] improvment(core): Optimize logic about dropping old version of data in KvGcCollector #2918

Merged
merged 8 commits into from
Apr 16, 2024

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Apr 12, 2024

What changes were proposed in this pull request?

Introduce a variable to mark the last transaction ID and perform the GC from the last transaction ID next time to fulfill incremental GC.

Why are the changes needed?

Full GC for the old version of the data takes a lot of time, we'd better not use this method.

Fix: #1276

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

Existing tests and test locally.

@yuqi1129 yuqi1129 marked this pull request as draft April 12, 2024 09:53
@yuqi1129 yuqi1129 self-assigned this Apr 15, 2024
@yuqi1129 yuqi1129 marked this pull request as ready for review April 15, 2024 03:36

long lastGCId = getTransactionId(getBinaryTransactionId(commitIdHasBeenCollected));
LOG.info(
"Start to collect data which is modified between '{}({})' and '{}({})'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the definition of "between", is it [a, b) or [a, b]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Assume the last GC has just collected data modified before time a (include the boundary), and now time is b and we are going to trigger another GC. In this time, we will only collect data that has been modified during the time a and b.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what is the actual definition of "between"? You still don't explain anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between means a specific time range with a low boundary and high boundary , between a and b which means we would only collect the data that is modified in the time range [a, b], take an example, in first time, the current time is '2024-04-16 08:00:00', we will do full gc and collect data from minimum time 1, so the range is [1, 1713225600(2024-04-16 08:00:00)], the next trigger time is '2024-04-16 09:00:00', then the range is [1713225600(2024-04-16 08:00:00), 1713229200(2024-04-16 08:00:00)].


LogHelper logHelper = decodeKey(kv.getKey());
LOG.info(
"Physically delete key that has marked deleted: name identifier: '{}', entity type: '{}', createTime: '{}({})', key: '{}'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line might be longer than 100 characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that spotless does not work for it, I 'll make the necessary changes.

Bytes.wrap(kv.getKey()));
kvBackend.delete(transactionKey);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be happened here if any of the storage operation is failed (scan, get, delete)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any steps fails, it throws an exceptions and we would repeated do it in the next time, considering the following scenario:
-1 Scan transactions id
-2 For each transaction ID, check the data in the transaction.
-3 Drop the data if it needs to be deleted( deleted or with a newer version)
-4 Remove the transaction marks.
-5 done.

For any failures from steps 1 to 5, do collectAndRemoveOldVersionData again will solve it as the value of commitIdHasBeenCollected will not move forward if anything unexpected happens.

DateFormatUtils.format(timestamp, TIME_STAMP_FORMAT),
timestamp,
Bytes.wrap(kv.getKey()));
kvBackend.delete(transactionKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the "delete" is succeed should we print out the log. So this line should move above before the log line.

@jerryshao jerryshao merged commit 6b9a47b into apache:main Apr 16, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Improve logic about dropping old version of KV data
2 participants