Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add annotations attribute to job spans #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tcsullens
Copy link

@tcsullens tcsullens commented Oct 24, 2024

Add span attribute for annotation messages from failed jobs to better attribute GHA job failures. This will allow us to collect and analyze root causes of failed jobs.
This implementation collects annotations messages from only expected GitHub/Infra errors and joins each failure message into a ;-separated string as the attribute value. We expect to first collect some data and then determine how best to setup an SLI to track/alert on GitHub/Infra failures.

The actual code changes are in the src/ folder

An example using this test PR, https://github.com/wrapbook/app/pull/30740, sets a new span attribute github.job.annotations to the message of the annotation (Trace Link):
Screenshot 2024-10-24 at 3 09 36 PM

Additional traces from testing: https://ui.honeycomb.io/wrapbook/environments/gha-ci/datasets/ci-v2/result/ANkTYWgHYXi?hideCompare

sc-126750

@tcsullens tcsullens force-pushed the feature/sc-126750/collect-job-annotation-in-github-actions branch from dc192fc to bae9dab Compare October 24, 2024 17:23
@tcsullens tcsullens force-pushed the feature/sc-126750/collect-job-annotation-in-github-actions branch from 469650d to c6452af Compare October 27, 2024 21:00
Comment on lines +173 to +178
// We only care about failure annotations generated by github files,
// excluding excluding annotations from other tools like rspec failures
const annotations = annotationsResponse.data.filter(
(a: JobAnnotation) =>
a.annotation_level === "failure" && a.path === ".github",
);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on investigation of some historical jobs, GitHub and Infra-related failures originate from a path of .github, while an rspec failure originates from a path of the ruby spec file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For the infra-related failures, is it possible to perform additional attribution e.g. which failures originated from GitHub vs. our own infrastructure?
  2. Based on the commentary (in docstrings) above, does this mean that we have an infrastructure failure when github.conclusion == failure and github.job.annotations is not null?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the infra-related failures, is it possible to perform additional attribution e.g. which failures originated from GitHub vs. our own infrastructure?

This is the hope! However, the messages in these annotations vary and I think we'll need to gather some data and work to classify distinct messages into buckets, eg. the PR description example, Could not assume role with OIDC... I think we'd classify as a infra-type failure.

Based on the commentary (in docstrings) above, does this mean that we have an infrastructure failure when github.conclusion == failure and github.job.annotations is not null?

Following on my clarification above, that's not necessarily the case as we'd need to inspect the annotation message to really determine root cause; the comments/logic here is just to exclude capturing other types of Job failures, like a rspec failure — does that make sense?

Comment on lines +226 to +228
const annotationMessages = jobAnnotations
.map((a: JobAnnotation) => a.message)
.join(";");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joining several failure messages together isn't ideal, but currently don't see a better way to ensure we're capturing any/all failures (there can be more than one). This still gives us the ability to create an SLI based on matching specific keywords to this string.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Honeycomb doesn't have support for arrays. As for capturing these failure messages and possibly using these to created an SLI (based on a derived field), we can feature flag this using the ci-experiments action to gather some data on a subset of the failures

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can feature flag this using the ci-experiments action to gather some data on a subset of the failures

We can do that!

@tcsullens tcsullens marked this pull request as ready for review October 28, 2024 14:26
@tcsullens tcsullens changed the title add annotations attribute to job spans feat: add annotations attribute to job spans Oct 28, 2024
@mrobinson513
Copy link

@tcsullens when able could you pls update this repository so that:

  • platform team are all admins
  • main branch is protected

Much thanks

@tcsullens
Copy link
Author

@tcsullens when able could you pls update this repository so that:

  • platform team are all admins
  • main branch is protected

Much thanks

@brett-anderson Do you have permissions to help out here?

@Nulluminati
Copy link

@tcsullens when able could you pls update this repository so that:

  • platform team are all admins
  • main branch is protected

Much thanks

I have added all platform team members as admins on the repo and protected main. There are no required CI checks right now.

Copy link

@brett-anderson brett-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks @tcsullens!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants