-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove empty/dup digest in findMissingBlobs calculation #1342
Conversation
Can you provide a rationale for this? We ignore empty blobs because, by definition, all empty blobs are not missing. |
Thanks for looking into. Updated diff summary. I converted this diff to draft as I am still running few experiments. |
It seems that you already knew about this issue 3 years back. https://github.com/bazelbuild/bazel-buildfarm/blob/ad2b823a093a3b1f048e959b5a01f2540e932f28/src/main/java/build/buildfarm/instance/server/AbstractServerInstance.java#L274 :) |
@werkt can you please review it. Thanks in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of quick fixes and looks good.
src/main/java/build/buildfarm/instance/shard/ShardInstance.java
Outdated
Show resolved
Hide resolved
### Problem When workers die, their stored references are not removed from the backplane. This creates the possibility that new workers may come up with the same IP address or use an IP address previously used by another terminated host. As a result, the backplane becomes unreliable, requiring us to query each worker individually to find missing blobs. Clearly, this approach is not scalable since any problems encountered by a single worker can significantly impact the performance of the buildfarm. ### Past Work We made code modifications for the `findMissingBlobs` function to exclusively query the backplane, prs: #1310, #1333, and #1342. This update implemented the `findMissingViaBackplane` flag. However, the above issues made the `findMissingViaBackplane` flag ineffective. ### Solution To address the issue of imposter workers, updated code to compare the start time of each worker (first_registered_at) with the insertion time of the digest. Any worker whose start time is later than the digest insertion time is considered an imposter worker. Also, the code removes imposter workers associated with the digest in the same function call. **first_registered_at**: Added new field first_registered_at to the worker data type. This field stores the initial start time of the worker. Worker informs the backplane about its start time, which is the same as the creation time of the cache directory (where all digests are stored) on the worker's disk. **digest insert time**: The digest insertion time is calculated using the Time to Live (TTL) of the digest and the casExpire time. The formula for determining the digest insertion time is now() - configured casExpire + remaining ttl. In the current implementation, each worker updates the TTL of the digest upon completing the write operation. This means that the cas insert time in the backplane corresponds to the time when the last worker finished writing the digest on its disk. ### Testing Deployed the change to our buildfarm staging, and ran full monorepo build. To make sure that the code change solve terminated worker problem, terminated bunch of workers in the middle of build. This caused temporary not_found `error`, which eventually faded away (fmb call autocorrect blob location). <img width="1385" alt="Screenshot 2023-06-21 at 12 36 47 PM" src="https://github.com/bazelbuild/bazel-buildfarm/assets/119983081/62fcf8e0-847a-4632-b49e-cef2c17321dc"> In the above graph terminated workers during first build. ### Future Improvement The above solution might not work if user updates `cas_expire` time between two deployments as algorithm to calculate `digest_insert_time` depends to `cas_expire` time. closes #1371
…update-bazel-ios-fork-2024-07-25 * commit '33ea575934c4c00f5a539ce79fafe90728e912d3': Provide lombok compatible DMS Platform (buildfarm#1352) Restore DequeueMatchSettings Platform (buildfarm#1351) Provide Readonly GRPC Fallback Specification (buildfarm#1347) Remove empty/dup digest in findMissingBlobs calculation (buildfarm#1342) Use a centos version with newer toolchains (buildfarm#1341) as-nobody: don't segfault when run with no arguments (buildfarm#1336) Update tools_remote commit for bazel 6.2 support (buildfarm#1339) fix: FindMissingblobs via backplane introduced in buildfarm#1310 (buildfarm#1333) Giving names to worker threads (buildfarm#1331) Update jekyll-deploy-action and other actions Clean vuln scan for buildfarm-server (buildfarm#1232) [bugfix] [containers] Disable logback to avoid server crashes on startup (buildfarm#1329) Restore storage-only worker registration. [deps] remove uneeded ui_resources (buildfarm#1327) [bugfix] check null for wrapper arguments in executionPolicies (buildfarm#1326) [UI] Add web frontend to buildfarm servers (buildfarm#1325)
Issue: While testing
ensureOutputPresent
flag for buildfarm I found that thegetActionResult
most of the time returns empty digests under keystdout
andstderr
. ThegetActionResult
returnsnot_found
error iffindMissingBlobs
don't return empty list. As empty digest are never stored it will be always present in the missing digest list.The old implementation (querying each worker) also filters out empty digests.
Solution: Query only nonEmptyDigests for findMissingBlobs api.
Issue: I found another problem in the ensureOutputPresent function. When creating a list of digests from the getActionResult output, duplicates are not being removed. This issue leads to failures when trying to retrieve workers for these digests from the backplane. The ImmutableMap class throws an IllegalArgumentException when duplicate keys are encountered, resulting in an "unknown" error. Code related to this issue in the following locations:
Solution: Make sure list of digests are unique while querying missing blobs.