-
-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save CI test reports to S3. #19262
Save CI test reports to S3. #19262
Conversation
@@ -545,8 +573,38 @@ def upload_log_artifacts(self, name: str) -> Step: | |||
"if": "always()", | |||
"continue-on-error": True, | |||
"with": { | |||
"name": f"pants-log-{name.replace('/', '_')}-{self.platform_name()}", | |||
"path": ".pants.d/pants.log", | |||
"name": f"logs-{name.replace('/', '_')}-{self.platform_name()}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm here, renamed to omit the superfluous pants-
prefix, for brevity.
"name": f"pants-log-{name.replace('/', '_')}-{self.platform_name()}", | ||
"path": ".pants.d/pants.log", | ||
"name": f"logs-{name.replace('/', '_')}-{self.platform_name()}", | ||
"path": ".pants.d/*.log", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm here, this will upload exception.log as well as the pants.log, which we should always have been doing.
a82535a
to
71e83ab
Compare
Right now the path on S3 is rather long:
And the For example:
But I think this is fine. In practice it's easy to navigate, and very easy to script up a download via the aws cli's wildcarding:
The date is there so it's easy to download files for a certain date range. We lead with the platform because it's usually good to lead with a low-cardinality dimension, and because we typically investigate tests on a single platform. |
@@ -65,7 +70,8 @@ def perform_copy( | |||
src_prefix: str, | |||
dst_prefix: str, | |||
path: str, | |||
dst_region: str, | |||
region: str, | |||
acl: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logs bucket is entirely private and has ACLs disabled, so we must not set an ACL in this case, while still setting one for objects in the binaries bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it's private, if someone (eg. me) wants to explore the data, how would they do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create a user for you under the pants account!
71e83ab
to
57be6ed
Compare
I've confirmed that the files are written to S3 as expected. |
Still TODO: how to gather data on tests that time out at the Pants level, where we kill the pytest process so we won't get a pytest-level report. |
Why not just use the artifacts from GitHub itself? https://github.com/pantsbuild/pants/actions/runs/5200836905 |
Discussed here. We could, but those are harder to work with, because it's not easy to tie them back to a time period, a platform, or even a SHA or PR #. It can be done with a lot of GitHub API tennis, but it's easier to drop into S3, and potentially from there index or further process them using AWS resources (dynamodb for example). |
Hmmm OK. In a world where we don't yet have explicit dependable funding I worry about not using the off-the-shelf free thing. But that's also FUD which doesn't have grounding in any numbers. So I'll leave it up to those who know understand AWS better. I'm happy we're taking strides to collect metrics/logs. ...speaking of I have a Pants plugin to upload metrics/logs to AWS CloudWatch. Is that something we might want to explore? |
It's teeny tiny amounts of data/outgoing bandwidth compared to our binary storage on S3. I would like to move away from that first. I agree with cutting AWS costs, but we'd best do that in cost order. |
Oooh! What sort of metrics? |
Anything gatherable from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
I think the cost aspect of s3 is fine, but there's a larger downside: we can't write to it from PRs from forks. This means the data doesn't include a large batch of data, and, in particular, doesn't include the flakes that affect humans most embarrassingly (discouraging new/irregular contributors: us maintainers/people with push access have already overcome that hurdle). I assume the rate of flakes is similar/the distribution is the same, but it still seems unfortunate to drop that data...
However, the alternative is GHA artifacts, which are hard to query and have a retention time, after which they're deleted. Neither of these are complete blockers... but they certainly don't make GHA a slam-dunk.
Something is better than nothing, and neither option is obviously better than the other, so more than happy to go with S3 and iterate.
For timeouts, I wonder if it's a generally-useful feature for pants to be generating its own junit reports that capture them, for real users to put through their own test dashboards? Either that or zipkin/opentelemetry/... traces for whole builds, that report the work units as timed-out spans? (I'm assuming that metrics reported to CloudWatch might not be saying "test_foo.py timed out" specifically?)
./build-support/bin/copy_to_s3.py \ | ||
--src-prefix=dist/test/reports \ | ||
--dst-prefix=s3://logs.pantsbuild.org/{s3_path} \ | ||
--path= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This --path=
empty arg looks suspicious. Just confirming it's intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. I could use --path=''
to emphasize.
@@ -65,7 +70,8 @@ def perform_copy( | |||
src_prefix: str, | |||
dst_prefix: str, | |||
path: str, | |||
dst_region: str, | |||
region: str, | |||
acl: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it's private, if someone (eg. me) wants to explore the data, how would they do it?
Hmm yeah, this not working for forks is not great. But at least this is some good data to be starting with. Maybe we can have a cron job that copies the gha artifacts to S3... |
I had in fact thought of Pants generating its own junit xml for timed-out tests, since they won't have one generated by pytest. |
So reinventing BuildSense ;-) |
- Replace the long REF with the short one, and replace any slashes in that with underscores. - Add the job id to the path. - Set --path="" explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
+ self.platform_name() | ||
+ "/" | ||
+ "$(git show --no-patch --format=%cd --date=format:%Y-%m-%d)/" | ||
+ "${GITHUB_REF_NAME//\\//_}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}/${GITHUB_JOB}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking: does this still give sensible keys for this PR?
In particular, I'm not at all sure of the value of GITHUB_REF_NAME
in a pull request from the GHA docs?
(As a broader point, maybe this step (or the copy-to-S3 script) could print the fully substituted bucket/key-prefix for easier debugging and finding of the artefacts, when looking at the CI logs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 12345/merge
, which we write as 1235_merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, seems good. Thanks for confirming!
@@ -975,12 +1027,25 @@ jobs: | |||
|
|||
' | |||
- continue-on-error: true | |||
env: | |||
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW - Switching to use GH OIDC w/ AWS is more secure (done in the TC repo so you can refer to it for examples)
https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services
AWS keys can leak and need to be rotated from time to time, by using OIDC you can avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! Will look into that for a followup for all our AWS use in GHA.
This will let us gather data on test performance/flakiness/timeouts. The first commit is the interesting one. The second is just the regenerated yaml files.
This will let us gather data on test performance/flakiness/timeouts.
The first commit is the interesting one. The second is just the regenerated yaml files.