Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save CI test reports to S3. #19262

Merged
merged 5 commits into from
Jun 8, 2023
Merged

Save CI test reports to S3. #19262

merged 5 commits into from
Jun 8, 2023

Conversation

benjyw
Copy link
Contributor

@benjyw benjyw commented Jun 6, 2023

This will let us gather data on test performance/flakiness/timeouts.

The first commit is the interesting one. The second is just the regenerated yaml files.

@benjyw benjyw added the category:internal CI, fixes for not-yet-released features, etc. label Jun 6, 2023
@@ -545,8 +573,38 @@ def upload_log_artifacts(self, name: str) -> Step:
"if": "always()",
"continue-on-error": True,
"with": {
"name": f"pants-log-{name.replace('/', '_')}-{self.platform_name()}",
"path": ".pants.d/pants.log",
"name": f"logs-{name.replace('/', '_')}-{self.platform_name()}",
Copy link
Contributor Author

@benjyw benjyw Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm here, renamed to omit the superfluous pants- prefix, for brevity.

"name": f"pants-log-{name.replace('/', '_')}-{self.platform_name()}",
"path": ".pants.d/pants.log",
"name": f"logs-{name.replace('/', '_')}-{self.platform_name()}",
"path": ".pants.d/*.log",
Copy link
Contributor Author

@benjyw benjyw Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm here, this will upload exception.log as well as the pants.log, which we should always have been doing.

@benjyw benjyw force-pushed the benjyw_save_test_reports_to_s3 branch 5 times, most recently from a82535a to 71e83ab Compare June 7, 2023 13:17
@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

Right now the path on S3 is rather long:

s3://logs.pantsbuild.org/test/reports/<platform>/<date>/<ref>/<workflow run id>/<workflow run attempt>/

And the <ref> itself can have 4 "path" components: e.g., for a PR it would be refs/pull/19262/merge

For example:

s3://logs.pantsbuild.org/test/reports/Linux-x86_64/2023-06-07/refs/pull/19262/merge/5200484861/1/

But I think this is fine. In practice it's easy to navigate, and very easy to script up a download via the aws cli's wildcarding:

$ aws s3 sync s3://logs.pantsbuild.org/test/reports/Linux-x86_64/2023-06-07/ . --exclude "*" --include "*.xml" 

The date is there so it's easy to download files for a certain date range. We lead with the platform because it's usually good to lead with a low-cardinality dimension, and because we typically investigate tests on a single platform.

@@ -65,7 +70,8 @@ def perform_copy(
src_prefix: str,
dst_prefix: str,
path: str,
dst_region: str,
region: str,
acl: str | None = None,
Copy link
Contributor Author

@benjyw benjyw Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs bucket is entirely private and has ACLs disabled, so we must not set an ACL in this case, while still setting one for objects in the binaries bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's private, if someone (eg. me) wants to explore the data, how would they do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create a user for you under the pants account!

@benjyw benjyw force-pushed the benjyw_save_test_reports_to_s3 branch from 71e83ab to 57be6ed Compare June 7, 2023 13:51
@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

I've confirmed that the files are written to S3 as expected.

@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

Still TODO: how to gather data on tests that time out at the Pants level, where we kill the pytest process so we won't get a pytest-level report.

@thejcannon
Copy link
Member

Why not just use the artifacts from GitHub itself? https://github.com/pantsbuild/pants/actions/runs/5200836905

@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

Why not just use the artifacts from GitHub itself? https://github.com/pantsbuild/pants/actions/runs/5200836905

Discussed here. We could, but those are harder to work with, because it's not easy to tie them back to a time period, a platform, or even a SHA or PR #. It can be done with a lot of GitHub API tennis, but it's easier to drop into S3, and potentially from there index or further process them using AWS resources (dynamodb for example).

@thejcannon
Copy link
Member

Hmmm OK. In a world where we don't yet have explicit dependable funding I worry about not using the off-the-shelf free thing. But that's also FUD which doesn't have grounding in any numbers. So I'll leave it up to those who know understand AWS better. I'm happy we're taking strides to collect metrics/logs.

...speaking of I have a Pants plugin to upload metrics/logs to AWS CloudWatch. Is that something we might want to explore?

@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

It's teeny tiny amounts of data/outgoing bandwidth compared to our binary storage on S3. I would like to move away from that first. I agree with cutting AWS costs, but we'd best do that in cost order.

@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

...speaking of I have a Pants plugin to upload metrics/logs to AWS CloudWatch. Is that something we might want to explore?

Oooh! What sort of metrics?

@thejcannon
Copy link
Member

...speaking of I have a Pants plugin to upload metrics/logs to AWS CloudWatch. Is that something we might want to explore?

Oooh! What sort of metrics?

Anything gatherable from a WorkunitsCallback.

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

I think the cost aspect of s3 is fine, but there's a larger downside: we can't write to it from PRs from forks. This means the data doesn't include a large batch of data, and, in particular, doesn't include the flakes that affect humans most embarrassingly (discouraging new/irregular contributors: us maintainers/people with push access have already overcome that hurdle). I assume the rate of flakes is similar/the distribution is the same, but it still seems unfortunate to drop that data...

However, the alternative is GHA artifacts, which are hard to query and have a retention time, after which they're deleted. Neither of these are complete blockers... but they certainly don't make GHA a slam-dunk.

Something is better than nothing, and neither option is obviously better than the other, so more than happy to go with S3 and iterate.


For timeouts, I wonder if it's a generally-useful feature for pants to be generating its own junit reports that capture them, for real users to put through their own test dashboards? Either that or zipkin/opentelemetry/... traces for whole builds, that report the work units as timed-out spans? (I'm assuming that metrics reported to CloudWatch might not be saying "test_foo.py timed out" specifically?)

./build-support/bin/copy_to_s3.py \
--src-prefix=dist/test/reports \
--dst-prefix=s3://logs.pantsbuild.org/{s3_path} \
--path=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This --path= empty arg looks suspicious. Just confirming it's intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I could use --path='' to emphasize.

@@ -65,7 +70,8 @@ def perform_copy(
src_prefix: str,
dst_prefix: str,
path: str,
dst_region: str,
region: str,
acl: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's private, if someone (eg. me) wants to explore the data, how would they do it?

build-support/bin/generate_github_workflows.py Outdated Show resolved Hide resolved
build-support/bin/generate_github_workflows.py Outdated Show resolved Hide resolved
@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

Hmm yeah, this not working for forks is not great. But at least this is some good data to be starting with.

Maybe we can have a cron job that copies the gha artifacts to S3...

@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

I had in fact thought of Pants generating its own junit xml for timed-out tests, since they won't have one generated by pytest.

@benjyw
Copy link
Contributor Author

benjyw commented Jun 7, 2023

...speaking of I have a Pants plugin to upload metrics/logs to AWS CloudWatch. Is that something we might want to explore?

Oooh! What sort of metrics?

Anything gatherable from a WorkunitsCallback.

So reinventing BuildSense ;-)

- Replace the long REF with the short one, and replace any
  slashes in that with underscores.
- Add the job id to the path.
- Set --path="" explicitly.
@benjyw benjyw requested a review from huonw June 7, 2023 20:35
Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

+ self.platform_name()
+ "/"
+ "$(git show --no-patch --format=%cd --date=format:%Y-%m-%d)/"
+ "${GITHUB_REF_NAME//\\//_}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}/${GITHUB_JOB}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking: does this still give sensible keys for this PR?

In particular, I'm not at all sure of the value of GITHUB_REF_NAME in a pull request from the GHA docs?

(As a broader point, maybe this step (or the copy-to-S3 script) could print the fully substituted bucket/key-prefix for easier debugging and finding of the artefacts, when looking at the CI logs?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 12345/merge, which we write as 1235_merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, seems good. Thanks for confirming!

@@ -975,12 +1027,25 @@ jobs:

'
- continue-on-error: true
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW - Switching to use GH OIDC w/ AWS is more secure (done in the TC repo so you can refer to it for examples)
https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services
AWS keys can leak and need to be rotated from time to time, by using OIDC you can avoid this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! Will look into that for a followup for all our AWS use in GHA.

@benjyw benjyw merged commit ad20ba4 into main Jun 8, 2023
@benjyw benjyw deleted the benjyw_save_test_reports_to_s3 branch June 8, 2023 02:59
alonsodomin pushed a commit to alonsodomin/pants that referenced this pull request Jun 8, 2023
This will let us gather data on test performance/flakiness/timeouts. 

The first commit is the interesting one. The second is just the
regenerated yaml files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants