Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix autohealing policy update logic. #4264

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

letitz
Copy link
Collaborator

@letitz letitz commented Sep 20, 2024

There are two logic fixes:

  1. API field names are in camelCase, not snake_case.
  2. Previously, we would treat trying to update a single field as an error and delete the policy.
    It is only an error to supply a single field in the config, but we may legitimately find a single field needs updating.

Also moved config validation to loading time. It would likely be best to explode if the auto-healing policy in the config has a single field set, instead of ignoring it with a warning, but this preserves behavior for now.

@letitz letitz marked this pull request as ready for review September 23, 2024 13:39
@jonathanmetzman
Copy link
Collaborator

Awesome thanks! Do you think some tests can be added, or would you like me to just merge as-is.

@letitz
Copy link
Collaborator Author

letitz commented Sep 24, 2024

Thanks for asking! Looking into this made me realize that I need to update the instance group creation path as well as the update path, now that Cluster.auto_healing_policy has a different type:

instance_group.create(
resource_name,
resource_name,
size=cpu_count,
auto_healing_policy=cluster.auto_healing_policy,

We do have tests for config parsing:

def test_load_test_project(self):
"""Test that test config (project test-clusterfuzz) loads without any
exceptions."""
self.assertIsNotNone(
compute_engine_projects.load_project('test-clusterfuzz'))

But they do not cover auto-healing policies:

clusters:
# Regular bots run all task types (e.g fuzzing, minimize, regression,
# impact, progression, etc).
clusterfuzz-linux:
gce_zone: gce-zone # Change to actual GCE zone (e.g. us-central1-a).
instance_count: 1 # Change to actual number of instances needed.
instance_template: clusterfuzz-linux
distribute: False
# Pre-emptible bots must have '-pre-' in name. They only run fuzzing tasks.
clusterfuzz-linux-pre:
gce_zone: gce-zone # Change to actual GCE zone (e.g. us-central1-a).
instance_count: 2 # Change to actual number of instances needed.
instance_template: clusterfuzz-linux-pre
distribute: False

I can fix that.

Now, there exists a manage_vms_test.py file: https://github.com/google/clusterfuzz/blob/677920bd4331ea8b654955f86bf8459c37571c5b/src/clusterfuzz/_internal/tests/core/infra/cron/manage_vms_test.py

Sadly I could not get it to run on my machine.

$ python butler.py run_unittests -t appengine -p 'manage_vms_test.py'
$ python butler.py run_unittests -t core -p 'manage_vms_test.py'

Both ran 0 tests.

It seems like it covers applying the auto-healing policy:

AUTO_HEALING_POLICY = {
'healthCheck': 'global/healthChecks/example-check',
'initialDelaySec': 300
}
INSTANCE_GROUPS = {
'oss-fuzz-linux-zone2-pre-proj2': {
'targetSize': 1,
'autoHealingPolicies': [AUTO_HEALING_POLICY],
},

Which is then checked later:

mock_bot_manager.instance_group(
'oss-fuzz-linux-zone2-pre-proj1').create.assert_called_with(
'oss-fuzz-linux-zone2-pre-proj1',
'oss-fuzz-linux-zone2-pre-proj1',
size=100,
auto_healing_policy=AUTO_HEALING_POLICY,
wait_for_instances=False)

Overall the tests are for OSS-Fuzz code, which relies ClustersManager code too.

I think I should indeed add tests for the non-OSS-Fuzz case, but I'll need some help figuring out how to run the tests and it will take me a while to get to doing it.

@jonathanmetzman
Copy link
Collaborator

OK I suppose this is fine to merge now.
@vitorguidi Can you merge and deploy this change since I am waiting for my workstation to work again.

@vitorguidi
Copy link
Collaborator

OK I suppose this is fine to merge now. @vitorguidi Can you merge and deploy this change since I am waiting for my workstation to work again.

Will do as soon as I stabilize the cronjob metrics

@letitz
Copy link
Collaborator Author

letitz commented Sep 26, 2024

Please don't merge this yet!

Looking into this made me realize that I need to update the instance group creation path as well as the update path, now that Cluster.auto_healing_policy has a different type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants