Fix autohealing policy update logic. #4264

letitz · 2024-09-20T14:51:38Z

There are two logic fixes:

API field names are in camelCase, not snake_case.
Previously, we would treat trying to update a single field as an error and delete the policy.
It is only an error to supply a single field in the config, but we may legitimately find a single field needs updating.

Also moved config validation to loading time. It would likely be best to explode if the auto-healing policy in the config has a single field set, instead of ignoring it with a warning, but this preserves behavior for now.

jonathanmetzman · 2024-09-23T13:41:20Z

Awesome thanks! Do you think some tests can be added, or would you like me to just merge as-is.

letitz · 2024-09-24T09:58:04Z

Thanks for asking! Looking into this made me realize that I need to update the instance group creation path as well as the update path, now that Cluster.auto_healing_policy has a different type:

clusterfuzz/src/clusterfuzz/_internal/cron/manage_vms.py

Lines 308 to 312 in d5d4865

    
           instance_group.create( 
        
               resource_name, 
        
               resource_name, 
        
               size=cpu_count, 
        
               auto_healing_policy=cluster.auto_healing_policy,

We do have tests for config parsing:

clusterfuzz/src/clusterfuzz/_internal/tests/core/google_cloud_utils/compute_engine_projects_test.py

Lines 24 to 28 in d5d4865

    
             def test_load_test_project(self): 
        
               """Test that test config (project test-clusterfuzz) loads without any 
        
               exceptions.""" 
        
               self.assertIsNotNone( 
        
                   compute_engine_projects.load_project('test-clusterfuzz'))

But they do not cover auto-healing policies:

clusterfuzz/configs/test/gce/clusters.yaml

Lines 16 to 30 in d5d4865

    
           clusters: 
        
             # Regular bots run all task types (e.g fuzzing, minimize, regression, 
        
             # impact, progression, etc). 
        
             clusterfuzz-linux: 
        
               gce_zone: gce-zone  # Change to actual GCE zone (e.g. us-central1-a). 
        
               instance_count: 1   # Change to actual number of instances needed. 
        
               instance_template: clusterfuzz-linux 
        
               distribute: False 
        
             # Pre-emptible bots must have '-pre-' in name. They only run fuzzing tasks. 
        
             clusterfuzz-linux-pre: 
        
               gce_zone: gce-zone  # Change to actual GCE zone (e.g. us-central1-a). 
        
               instance_count: 2   # Change to actual number of instances needed. 
        
               instance_template: clusterfuzz-linux-pre 
        
               distribute: False

I can fix that.

Now, there exists a manage_vms_test.py file: https://github.com/google/clusterfuzz/blob/677920bd4331ea8b654955f86bf8459c37571c5b/src/clusterfuzz/_internal/tests/core/infra/cron/manage_vms_test.py

Sadly I could not get it to run on my machine.

$ python butler.py run_unittests -t appengine -p 'manage_vms_test.py'
$ python butler.py run_unittests -t core -p 'manage_vms_test.py'

Both ran 0 tests.

It seems like it covers applying the auto-healing policy:

clusterfuzz/src/clusterfuzz/_internal/tests/core/infra/cron/manage_vms_test.py

Lines 30 to 39 in 677920b

    
           AUTO_HEALING_POLICY = { 
        
               'healthCheck': 'global/healthChecks/example-check', 
        
               'initialDelaySec': 300 
        
           } 
        
           INSTANCE_GROUPS = { 
        
               'oss-fuzz-linux-zone2-pre-proj2': { 
        
                   'targetSize': 1, 
        
                   'autoHealingPolicies': [AUTO_HEALING_POLICY], 
        
               },

Which is then checked later:

clusterfuzz/src/clusterfuzz/_internal/tests/core/infra/cron/manage_vms_test.py

Lines 710 to 716 in 677920b

    
           mock_bot_manager.instance_group( 
        
               'oss-fuzz-linux-zone2-pre-proj1').create.assert_called_with( 
        
                   'oss-fuzz-linux-zone2-pre-proj1', 
        
                   'oss-fuzz-linux-zone2-pre-proj1', 
        
                   size=100, 
        
                   auto_healing_policy=AUTO_HEALING_POLICY, 
        
                   wait_for_instances=False)

Overall the tests are for OSS-Fuzz code, which relies ClustersManager code too.

I think I should indeed add tests for the non-OSS-Fuzz case, but I'll need some help figuring out how to run the tests and it will take me a while to get to doing it.

jonathanmetzman · 2024-09-25T18:31:22Z

OK I suppose this is fine to merge now.
@vitorguidi Can you merge and deploy this change since I am waiting for my workstation to work again.

vitorguidi · 2024-09-25T19:09:29Z

OK I suppose this is fine to merge now. @vitorguidi Can you merge and deploy this change since I am waiting for my workstation to work again.

Will do as soon as I stabilize the cronjob metrics

letitz · 2024-09-26T08:15:47Z

Please don't merge this yet!

Looking into this made me realize that I need to update the instance group creation path as well as the update path, now that Cluster.auto_healing_policy has a different type

letitz added 4 commits September 20, 2024 16:32

Fix auto-healing policy setting logic.

13edc73

Compare dicts.

a971d9a

Fixes.

98d01f6

Fix lint errors.

fe60741

letitz marked this pull request as ready for review September 23, 2024 13:39

letitz requested a review from jonathanmetzman September 23, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix autohealing policy update logic. #4264

Fix autohealing policy update logic. #4264

letitz commented Sep 20, 2024

jonathanmetzman commented Sep 23, 2024

letitz commented Sep 24, 2024 •

edited

Loading

jonathanmetzman commented Sep 25, 2024

vitorguidi commented Sep 25, 2024

letitz commented Sep 26, 2024

Fix autohealing policy update logic. #4264

Are you sure you want to change the base?

Fix autohealing policy update logic. #4264

Conversation

letitz commented Sep 20, 2024

jonathanmetzman commented Sep 23, 2024

letitz commented Sep 24, 2024 • edited Loading

jonathanmetzman commented Sep 25, 2024

vitorguidi commented Sep 25, 2024

letitz commented Sep 26, 2024

letitz commented Sep 24, 2024 •

edited

Loading