Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterFuzz monitoring improvements #4271

Open
vitorguidi opened this issue Sep 25, 2024 · 0 comments
Open

ClusterFuzz monitoring improvements #4271

vitorguidi opened this issue Sep 25, 2024 · 0 comments

Comments

@vitorguidi
Copy link
Collaborator

This issue serves as an umbrella for the monitoring initiative

vitorguidi added a commit that referenced this issue Sep 25, 2024
### Motivation

Kubernetes signals that cronjobs fail, after retries, through
[events](https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/).

GKE does not make it easy to alert on this failure mode. For cronjobs
without retries, the failure will be evident in the GCP panel. For
cronjobs with retries, there will be a success indicator for the last
successful cronjob, and the failure will be registered under the events
panel.

### Alternatives considered

Options available to monitor failing cronjobs are:

- the container/restart_count metric, from
[GKE](https://cloud.google.com/monitoring/api/metrics_kubernetes). This
will be flaky, since a job might succeed on the third attempt. Also, it
is not easy to pinpoint the cronjob, since we get the container name as
a label
- alert on a log based metric from the cluster events. The output of
kubectl get events gets dumped to cloud logging, so we can create a
metric on events of the form "Saw completed job:
oss-fuzz-apply-ccs-28786460, status: Failed". However this requires
regex manipulation, has to be manually applied across all projects, and
also makes it hard to derive the failing cronjob from the container
name. It also adds a hard dependency on kubernetes.

### Solution

The proposed solution is to reuse the builtin clusterfuzz metrics
implementation, and add a gauge metric CLUSTERFUZZ_CRON_EXIT_CODE, with
the cron name as a label.

If the metric is at 1, then the cronjob is unhealthy, otherwise it is
healthy. An alert must be set for when it reaches the 1 state, for every
label.
 
Since cronjobs are ephemeral, there is no need for a thread to
continuously flush metrics. The option to use monitoring without a
flushing thread was added. The same approach can be used to fix metrics
for swarming/batch.
 
Also, the flusher thread was changed to make sure that leftover metrics
are flushed before it stops.
 
Note: this PR is part of this
[initiative](#4271)
vitorguidi added a commit that referenced this issue Oct 7, 2024
As things currently stand, metrics are flushed by a background thread
every 10m. However, for the batch/swarming use case, we will lose
metrics for jobs that finish before this interval. To handle that,
monitor.stop will be called before the bot exits.

Finally, sigterm is handled in run_bot to avoid losing metrics when
instances get preempted

This PR is part of #4271.
vitorguidi added a commit that referenced this issue Oct 10, 2024
Whitelisting a service account is not enough to get GCP Uptime to work,
to verify if the endpoints on the clusterfuzz frontend are available.
Attempting to annotate get handlers with oauth, so the healthcheck does
not get stuck at the login page.

This pr is part of #4271
vitorguidi added a commit that referenced this issue Oct 11, 2024
### Motivation

In order to get healthchecks on appengine handlers, we need to use GCP
Uptimes. They authenticate through an [oauth id
token](https://cloud.google.com/monitoring/uptime-checks#create). As
things currently stand, oauth on appengine only supportes access tokens,
so this PR solves that.


### Alternatives considered

It would be ideal if we only did one api call, either to validade an id
or access token. However, the
[specification](https://auth0.com/docs/secure/tokens/id-tokens/id-token-structure)
for oauth2 does not present a standard way of, given a token,
differentiating between the two. For that reason, two api calls to GCP
are made.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 11, 2024
### Motivation

Uptime healthchecks are breaking for fuzzer/jobs/corpora. This happens
because the check_user_access annotation is placed BEFORE the oauth one,
which leads to the verification being
[asserted](https://github.com/google/clusterfuzz/blob/master/src/appengine/libs/access.py#L89)
before credentials are fetched.

This PR fixes the annotation order, making authentication happen before
authorization.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 14, 2024
### Motivation

Opening and closing bugs is the last step in the clusterfuzz user
journey, and we currently do not have that instrumentation. This PR:

* Enables monitoring in the kubernetes environment
* Enables monitoring on cron jobs
* Moves the wrap_with_monitoring context manager from run_bot.py to
monitoring.py, so it can get reused in run_cron.py
* Collects bugs opened metrics from the triage cronjob
* Collects bugs closed metrics from the cleanup cronjob

The only relevant label for these metrics is the fuzzer name.

For bug filing, we measure how many attempts:
* Succeeded
* Failed
* Got throttled

For bug closing, we measure how many attempts:
* Succeeded
* Failed

Part of #4271
vitorguidi added a commit that referenced this issue Oct 15, 2024
### Motivation

Sync admins only replicates admins to datastore if they conform to
'user:{email}', but it is interesting to allow service accounts for
monitoring purposes (uptime) of the form 'serviceAccount:{email}'.

This PR fixes that.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 16, 2024
### Motivation

As a request from the Chrome team, it would be nice to know what:
* OS type
* OS version
* Release (prod/candidate)

a bot corresponds to. This PR implements that.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 23, 2024
### Motivation

We currently have no way to tell if the chrome test syncer is running
successfully. This PR implements a counter metric, that can be used to
alert on if missing for 12h (or some arbitrary treshold).

The monitor wrapper must be done in main for this script, since it is
the entrypoint
Reference:
https://github.com/google/clusterfuzz/blob/master/docker/chromium/tests-syncer/Dockerfile#L17

Part of #4271
vitorguidi added a commit that referenced this issue Oct 23, 2024
### Motivation 

#4312 added the oauth handler to several get endpoints, in order for GCP
uptime to probe them. However, the decorator assumed that all handlers
would be of the form func(self), not declaring args or kwargs.

This is not true, for the following signatures:
```
coverage_report.py
...
  def get(self, report_type=None, argument=None, date=None, extra=None):
  
fuzzer_stats.py
...
  def get(self, extra=None):
```

This PR adds *args and **kwargs to the wrapper, so it can work for these
endpoints.

Part of #4271 

Error groups:

[coverage](https://pantheon.corp.google.com/errors/detail/CKrE1Jfd88vKIQ;service=;version=;filter=%5B%22handler%22%5D;time=P7D;locations=global?e=-13802955&inv=1&invt=AbfeYw&mods=logs_tg_prod&project=clusterfuzz-external)

[stats](https://pantheon.corp.google.com/errors/detail/CMiEwKaYs4DfEA;service=;version=;filter=%5B%22handler%22%5D;time=P7D;locations=global?e=-13802955&inv=1&invt=AbfeYw&mods=logs_tg_prod&project=clusterfuzz-external)
vitorguidi added a commit that referenced this issue Oct 29, 2024
### Motivation

Clusterfuzz only tracks fuzzing time for blackbox fuzzers at the moment,
this PR extends the tracking to engine fuzzers as well by emmiting the
JOB_TOTAL_FUZZ_TIME and FUZZER_TOTAL_FUZZ_TIME metrics.

Since all engine fuzzing is single process/single threaded, it suffices
to track start and end time for each test case run. The only difference
in behavior is that only libfuzzer indicates a timeout, and thus all
other engines are expected to concentrate their metrics on
timeout=False.

### Testing strategy

Ran a fuzz task locally and verified the code path for _TrackFuzzTime is
reached and produces sane output.
Command used:
```
fuzz libFuzzer libfuzzer_asan_log4j2
```

![image](https://github.com/user-attachments/assets/908895e2-16a5-4cf5-843c-d1e57412ff19)

Part of #4271
vitorguidi added a commit that referenced this issue Oct 29, 2024
### Motivation

As part of the initiative to improve clusterfuzz monitoring, it is
necessary to enrich some metrics regarding fuzzing outcomes.

This PR adds the platform field to the following metrics:

- FUZZER_KNOWN_CRASH_COUNT
- FUZZER_NEW_CRASH_COUNT
- FUZZER_RETURN_CODE_COUNT
- FUZZER_TOTAL_FUZZ_TIME
- JOB_KNOWN_CRASH_COUNT
- JOB_TOTAL_FUZZ_TIME
- JOB_NEW_CRASH_COUNT

It also adds the job_type field to the FUZZER_RETURN_CODE_COUNT metric.

Part of #4271
vitorguidi added a commit that referenced this issue Oct 29, 2024
### Motivation

We currently lack metrics for fuzzing session duration. This PR adds
that as a histogram metric, with granularity by fuzzer, job and
platform.

Part of #4271
vitorguidi added a commit that referenced this issue Nov 4, 2024
### Motivation

The Chrome team has no easy visibility into how many manually uploaded
test cases flake or successfully reproduce. This PR implements a counter
metric to track that.

There are three possible outcomes, each represented by a string label:
'reproduces', 'one_timer' and 'does_not_reproduce'

Part of #4271
vitorguidi added a commit that referenced this issue Nov 6, 2024
### Motivation

We currently lack metrics for build retrieval and unpacking times. This
PR adds that, with granularity by fuzz target and job type.

There are two different implementations for build downloading/unpacking:

- In the Build class, from which RegularBuild, SplitTargetBuild,
FuchsiaBuild and SymbolizedBuild inherit the downloading/unpacking
behavior
- In the CustomBuild class, which implements its own logic

There are two possible cases for downloading/unpacking: clusterfuzz
either downloads the whole build and unpacks it locally, or unpacks it
remotely. This is the case for all build types except CustomBuild, which
does not perform remote unpacking.

For build retrieval over http, we do not track download time. For all
the other cases, it suffices to keep track of start/finish time for
download and unpacking.

Finally, a _build_type is added to the constructor of the Build class,
from which all other inherit. This is used to track the build type
(debug or release), and is only mutated by SymbolizedBuild when
attempting to fetch a debug build.

Part of #4271
vitorguidi added a commit that referenced this issue Nov 8, 2024
### Motivation

Adding a metric to keep track of rate limits

Part of #4271
vitorguidi added a commit that referenced this issue Nov 13, 2024
…4381)

### Motivation


Once a testcase is generated (or manually uploaded), followup tasks
(analyze/progression) are started. This happens by publishing to a
pubsub queue, both for the manually uploaded case, and for the fuzzer
generated case.

If for any reason the messages are not processed, the testcase gets
stuck. To get better visibility into these stuck testcases, the
UNTRIAGED_TESTCASE_AGE metric is introduced, to pinpoint how old these
testcases that have not yet been triaged are(more precisely, gone
through analyze/regression/impact/progression tasks).


### Attention points

Testcase.timestamp mutates in analyze task:


https://github.com/google/clusterfuzz/blob/6ed80851ad0f6f624c5b232b0460c405f0a018b5/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py#L589

This makes it unreliable as a source of truth for testcase creation
time. To circumvent that, a new ```created``` field is added to the
Testcase entity, from which we can derive the correct creation time.

Since this new field will only apply for testcases created after this PR
is merged, Testcase.timestamp will be used instead to calculate the
testcase age when the new field is missing.

### Testing strategy

Ran the triage cron locally, and verified the codepath for the metric is
hit and produces sane output (reference testcase: 4505741036158976).

![image](https://github.com/user-attachments/assets/6281b44f-768a-417e-8ec1-763f132c8181)


Part of #4271
vitorguidi added a commit that referenced this issue Nov 13, 2024
### Motivation

We currently lack awareness on how old builds are during fuzz task. This
PR implements that, by making the assumption that the Last Update Time
metadata field in GCS is a good proxy for build age. [Documentation
reference](https://cloud.google.com/storage/docs/json_api/v1/objects#resource)

### Approach

Symbolized and custom builds do not matter, thus all builds of interest
will be fetched from ```build_manager.setup_regular_build```. Logic for
collecting all bucket paths and the latest revision was refactored, so
that ```setup_regular_build``` can also figure out the latest revision
for a given build and conditionally emit the proposed metric.

### Testing strategy

!Todo: test this for fuzz, analyze, progression

Locally ran tasks, with instructions from #4343 and #4345 , and verified
the _emmit_build_age_metric function gets invoked and produces sane
output.

Commands used:
```
fuzz libFuzzer libfuzzer_asan_log4j2
```

![image](https://github.com/user-attachments/assets/66937297-20ec-44cf-925e-0004a905c92e)

```
progression 4992158360403968 libfuzzer_asan_qt
```

![image](https://github.com/user-attachments/assets/0e1f1199-d1d8-4da5-814e-8d8409d1f806)

```
analyze 4992158360403968 libfuzzer_asan_qt (disclaimer: build revision was overriden mid flight to force a trunk build, since this testcase was already tied to a crash revision)
```

![image](https://github.com/user-attachments/assets/dd3d5a60-36a1-4a9e-a21b-b72177ffdecd)


Part of #4271
vitorguidi added a commit that referenced this issue Nov 13, 2024
### Motivation

Chrome security shepherds manually upload testcases through appengine,
triggering analyze task and, in case of a legitimate crash, the followup
progression tasks:
* Minimize
* Analyze
* Impact
* Regression
* Cleanup cronjob, when updating a bug to inform the user that all above
stages were finished

This PR adds instrumentation to track the time elapsed between the user
upload, and the completion of the above events.

### Attention points

* TestcaseUploadMetadata.timestamp was being mutated on the preprocess
stage for analyze task. This mutation was removed, so that this entity
can be the source of truth for when a testcase was in fact uploaded by
the user.

* The job name could be retrieved from the JOB_NAME env var within the
uworker, however this does not work for the cleanup use case. For this
reason, the job name is fetched from datastore instead.

* The ```query_testcase_upload_metadata``` method was moved from
analyze_task.py to a helpers file, so it could be reused across tasks
and on the cleanup cronjob

### Testing strategy

Every task mentioned was executed locally, with a valid uploaded
testcase. The codepath for the metric emission was hit and produced the
desired output, both for the tasks and the cronjob.


Part of #4271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant