forked from buildfarm/buildfarm
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade bazel-ios-fork to v2.9.0 #8
Draft
chenj-hub
wants to merge
218
commits into
bazel-ios-fork
Choose a base branch
from
jackies/upgrade-bazel-buildfarm-to-v2.9.0
base: bazel-ios-fork
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Upgrade bazel-ios-fork to v2.9.0 #8
chenj-hub
wants to merge
218
commits into
bazel-ios-fork
from
jackies/upgrade-bazel-buildfarm-to-v2.9.0
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Throwable indicates that the response to getMessage() may be null.
Cleaned up some links and language
Removed short circuit for executeWorkers which should never inspire publish Added memoized recentExecuteWorkers which will be delayed by currently const workerSetMaxAge Cleaned up getStorageWorkers, used grpc Deadline as premium, though already expired is awkward Storage removal is overzealous and will remove matching execute workers. Received worker changes continue to only affect storage.
Adjustments will make use of options. Ensure that they have the specified values from the commandline, and they have no checked exception throws.
FMBs over a specified size limit (4MB in practice from grpc limits) will be split log2n until under the limit to make requests. Tests added to verify this. Coverage of split behavior confirmed. Fixes buildfarm#1375
A directory which is missing during the course of validation should not be identified as missing. This prevents a NPE and inspires validation to emit one MISSING violation per path to a missing directory for preconditions. Fixes buildfarm#1374
Directories reevaluated only under the enumeration hierarchy must still be guarded against empty child directories in their checks, and must handle child directories missing in the index safely, with precondition failures matching their outputs. Order is not guaranteed in precondition output, but tests now guard this case. Fixes buildfarm#1299
Include Action Mnemonic, Target Id, and Configuration Id in the bf-cat output suite for RequestMetadata.
An invocation of app.run which fails for any reason in the spring framework will exit silently. Ensure that errors are presented before exiting the application.
* Enable custom latency buckets * run formatter * Remove unused load from build * Run buildifier * update example config with infinity bucket --------- Co-authored-by: Trevor Hickey <[email protected]>
Avoid fileStore recalculation for entire trees to be deleted, instead expect the callers to provide a fileStore, and that the entire tree exists within it.
ExceutionExceptions wrap the actual exceptions of futures experienced during putDirectory, and add no tracing capacity. Unwrap these when thrown from failed futures.
Bleed grpc exposure into a retrier for copyExternalInput invocations, and ensure that enough bytes have been provided from the requested blob before returning.
The only presence of arbitrary symlinks in the CAS Filesystem is under directories. Symlinks are explicitly identified as non-readonly-executables. Prevent a dead symlink from throwing NSFE due to the readonly check.
Files are delivered via readdir in utf8 encoding (on linux for xfs at least), assume that posix will mandate this.
* Guard against fetchBlobFromWorker orphanization Any exception thrown by fetchBlobFromWorker will leave the blobObserver hanging. Ensure that the observer sees a failure and does not hang. * Restore FMB functionality included in revert.
Handle ExecDirExceptions in InputFetcher such that the client can observe a FAILED_PRECONDITION with PreconditionFailures that include Violations with VIOLATION_TYPE_MISSING to inspire virtuous loop reestablishment. A ViolationException acts as a container for this, and can be extended to include the components of a putDirectory. Interpret PutDirectoryExceptions with their imposed violations. Missing inputs at the time of execute, whether linked as immediate files or through the directory cache, will be interpreted as VIOLATION_TYPE_MISSING.
### Problem When workers die, their stored references are not removed from the backplane. This creates the possibility that new workers may come up with the same IP address or use an IP address previously used by another terminated host. As a result, the backplane becomes unreliable, requiring us to query each worker individually to find missing blobs. Clearly, this approach is not scalable since any problems encountered by a single worker can significantly impact the performance of the buildfarm. ### Past Work We made code modifications for the `findMissingBlobs` function to exclusively query the backplane, prs: buildfarm#1310, buildfarm#1333, and buildfarm#1342. This update implemented the `findMissingViaBackplane` flag. However, the above issues made the `findMissingViaBackplane` flag ineffective. ### Solution To address the issue of imposter workers, updated code to compare the start time of each worker (first_registered_at) with the insertion time of the digest. Any worker whose start time is later than the digest insertion time is considered an imposter worker. Also, the code removes imposter workers associated with the digest in the same function call. **first_registered_at**: Added new field first_registered_at to the worker data type. This field stores the initial start time of the worker. Worker informs the backplane about its start time, which is the same as the creation time of the cache directory (where all digests are stored) on the worker's disk. **digest insert time**: The digest insertion time is calculated using the Time to Live (TTL) of the digest and the casExpire time. The formula for determining the digest insertion time is now() - configured casExpire + remaining ttl. In the current implementation, each worker updates the TTL of the digest upon completing the write operation. This means that the cas insert time in the backplane corresponds to the time when the last worker finished writing the digest on its disk. ### Testing Deployed the change to our buildfarm staging, and ran full monorepo build. To make sure that the code change solve terminated worker problem, terminated bunch of workers in the middle of build. This caused temporary not_found `error`, which eventually faded away (fmb call autocorrect blob location). <img width="1385" alt="Screenshot 2023-06-21 at 12 36 47 PM" src="https://github.com/bazelbuild/bazel-buildfarm/assets/119983081/62fcf8e0-847a-4632-b49e-cef2c17321dc"> In the above graph terminated workers during first build. ### Future Improvement The above solution might not work if user updates `cas_expire` time between two deployments as algorithm to calculate `digest_insert_time` depends to `cas_expire` time. closes buildfarm#1371
When a server cannot acquire a transform token for an extended period of time, assume that it is malfunctioning and initiate a shutdown.
Worker loss can signal cascading failure and shutdown for execute-only peers. Ensure that a ReportResultStage seeing an SRE does not close the stage, and that there are basic retries for remote uploads.
Works with current redis-py-cluster 2.1.3
* Guard against writeObserver null race * Avoid cancellation log for StubWriteOutputStream Cancels will happen for all server->worker uploads on context cancels initiated by clients, and are normal behaviors. * Guarantee null write response for onCompleted Avoid a complaint by gRPC that a client-streaming request was completed without a response * Reset remote CAS write on initial Prevents the StubWriteOutputStream from issuing an unnecessary initial queryWriteStatus.
These instances of `format(...)` do not have placeholders and there's nothing to format.
* chore: Update proto file styling * fix path * move protp styling to file * add new line at eof
…dfarm#1549) * fix: Periodically Refresh Active Storage Workers With startTime
…" (buildfarm#1603) This reverts commit 413021d.
Node name strings provided via `cluster slots` will be byte arrays that require decoding. Use the inbuilt SafeEncoder from jedis which is used to decode all other strings.
* build: start adopting bzlmod Only four bazel dependencies were found in the existing bzlmod registry (https://registry.bazel.build/) * build: swap io_bazel_rules_go for bzlmod * build: swap gazelle for bzlmod * tests: support bzlmod I don't know why. But it works. * build: leave breadcrumbs for bzlmod migration * build(chore): MODULE.bazel.lock * build: swap buildtools for buildifier_prebuilt There are conflicts with go tooling between buildtools and protobuf.
The awk was pulling out the denominator of the metric, not the percent. Adjust the field. Before: ``` current line coverage: 1625% current function coverage: 340% ``` after: ``` current line coverage: 42% current function coverage: 51% ```
… entire repo with tags: `helm/X.Y.Z-b1` (buildfarm#1602) * add helm chart linting stage, bundle chart on helm/* tags * fix triggers * try try again * always lint before bundling * extract the helm chart version from the git tag ref * fix name * tickle the beast * too much stack overflow * better names * that's output * gussy up readme * simplify * stage 0.2.2 * put extraVolumeMounts under .Values.shardWorker * omit defaults * leave out emacs gitignore * wrong example
…ildfarm#1606) * Publish storage worker and execute worker pool size in prometheus * run formatter * add doc --------- Co-authored-by: Yuriy Belenitsky <[email protected]>
Avoid stripping the existingHash, used to calculate the slot with suffix modifiers for queue balancing. Client presentation may also move to presenting each balanced queue name, removing the coordinated calculation on the client. The name could only have been used by a client prior for name presentation or determination of queue names in redis from slot arrangement.
* set expire and drop invocationId * fix format * patch configuration doc * fix typo * re-run ci use luxe's patch * set expire when every active operation insert * fix return code * fix ci if statement --------- Co-authored-by: wangpengfei.pfwang <[email protected]> Co-authored-by: George Gensure <[email protected]>
* feat: add OSSF Scorecards workflow Adding OSSF Scorecards workflow as github action. It runs periodially. Output goes into Code Scanning alerts of GitHub. * docs(README): add some badges - OSSF Scorecard - License (from GitHub) - latest Release (from GitHub)
Remove rules_oss_audit and accompanying documentation
* fix template bugs with with * stage chart v0.2.3
* rework for bitnami redis helm chart * still doesn't work tho less specified * disable redis pwd -- not for production use * chart/bitnami-redis:18.14.2 * and later * stage chart v0.3.0
…upgrade-bazel-buildfarm-to-v2.9.0
chenj-hub
force-pushed
the
jackies/upgrade-bazel-buildfarm-to-v2.9.0
branch
3 times, most recently
from
August 14, 2024 18:02
1141041
to
3977b90
Compare
chenj-hub
changed the title
Jackies/upgrade bazel buildfarm to v2.9.0
Upgrade bazel-ios-fork to v2.9.0
Aug 14, 2024
Resolved git conflict between v.2.9.0 and current bazel-ios-fork from running: git show d6c1859 --remerge-diff:
|
chenj-hub
force-pushed
the
jackies/upgrade-bazel-buildfarm-to-v2.9.0
branch
from
August 15, 2024 16:03
e438475
to
9938118
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.