-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1276] improvment(core): Optimize logic about dropping old version of data in KvGcCollector #2918
Conversation
core/src/main/java/com/datastrato/gravitino/storage/kv/KvGarbageCollector.java
Outdated
Show resolved
Hide resolved
|
||
long lastGCId = getTransactionId(getBinaryTransactionId(commitIdHasBeenCollected)); | ||
LOG.info( | ||
"Start to collect data which is modified between '{}({})' and '{}({})'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the definition of "between", is it [a, b)
or [a, b]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Assume the last GC has just collected data modified before time a (include the boundary), and now time is b and we are going to trigger another GC. In this time, we will only collect data that has been modified during the time a and b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what is the actual definition of "between"? You still don't explain anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between
means a specific time range with a low boundary and high boundary , between a and b
which means we would only collect the data that is modified in the time range [a, b], take an example, in first time, the current time is '2024-04-16 08:00:00', we will do full gc and collect data from minimum time 1, so the range is [1, 1713225600(2024-04-16 08:00:00)], the next trigger time is '2024-04-16 09:00:00', then the range is [1713225600(2024-04-16 08:00:00), 1713229200(2024-04-16 08:00:00)].
core/src/main/java/com/datastrato/gravitino/storage/kv/KvGarbageCollector.java
Show resolved
Hide resolved
|
||
LogHelper logHelper = decodeKey(kv.getKey()); | ||
LOG.info( | ||
"Physically delete key that has marked deleted: name identifier: '{}', entity type: '{}', createTime: '{}({})', key: '{}'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line might be longer than 100 characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that spotless does not work for it, I 'll make the necessary changes.
Bytes.wrap(kv.getKey())); | ||
kvBackend.delete(transactionKey); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be happened here if any of the storage operation is failed (scan, get, delete)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any steps fails, it throws an exceptions and we would repeated do it in the next time, considering the following scenario:
-1 Scan transactions id
-2 For each transaction ID, check the data in the transaction.
-3 Drop the data if it needs to be deleted( deleted or with a newer version)
-4 Remove the transaction marks.
-5 done.
For any failures from steps 1 to 5, do collectAndRemoveOldVersionData
again will solve it as the value of commitIdHasBeenCollected
will not move forward if anything unexpected happens.
DateFormatUtils.format(timestamp, TIME_STAMP_FORMAT), | ||
timestamp, | ||
Bytes.wrap(kv.getKey())); | ||
kvBackend.delete(transactionKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the "delete" is succeed should we print out the log. So this line should move above before the log line.
What changes were proposed in this pull request?
Introduce a variable to mark the last transaction ID and perform the GC from the last transaction ID next time to fulfill
incremental GC
.Why are the changes needed?
Full GC for the old version of the data takes a lot of time, we'd better not use this method.
Fix: #1276
Does this PR introduce any user-facing change?
N/A.
How was this patch tested?
Existing tests and test locally.