Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reporting: deduplicate by bot name #4359

Open
oliverchang opened this issue Oct 28, 2024 · 3 comments
Open

Error reporting: deduplicate by bot name #4359

oliverchang opened this issue Oct 28, 2024 · 3 comments

Comments

@oliverchang
Copy link
Collaborator

Sometimes, a single machine can cause an error to bubble up to the top of our error reporting dashboard.

e.g. https://pantheon.corp.google.com/errors/detail/CNTsq_Sb7qfXSw;locations=global?e=-13802955&inv=1&invt=Abf8Rw&mods=logs_tg_prod&project=clusterfuzz-external happens 100k+ times a day, but it's all from a single bot having clock skew issues.

We should investigate if there is a way to reduce noise here by deduplicating error reporting entries by machine / origin.

@oliverchang
Copy link
Collaborator Author

@vitorguidi @alhijazi any thoughts here?

@vitorguidi
Copy link
Collaborator

It does not seem like we have that flexibility, knobs for deduplication are not exposed to the end user in GCP (ref). It takes exception and stacktrace info into account when grouping.

I opened a YAQS for the GCP logging folks, to see if there is anything we can explore and is not evident in the documentation.

@jonathanmetzman
Copy link
Collaborator

How about we put something on the bots to alleviate the problem of a bot polluting errors. Some options:

  1. Exit after hitting a certain number of errors. On linux the container will be restarted. On Windows we can reboot.
  2. Rate limiting error reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants