Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Consolidate CloudWatch agent/dashboard/alarm support; Add unit tests for AWS autoscaler CloudWatch integration #22070

Merged

Conversation

Zyiqin-Miranda
Copy link
Member

Why are these changes needed?

This PR mainly adds two improvements:

  1. We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
  2. Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.

Related issue number

"Closes #9644 #8967"

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@stale
Copy link

stale bot commented Mar 6, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 6, 2022
@Zyiqin-Miranda Zyiqin-Miranda removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 7, 2022
@pdames
Copy link
Member

pdames commented Mar 9, 2022

As a high-level comment, there's a number of typing mismatches in the existing cloudwatch_helper.py code (e.g. Dict[str, bool] returned here instead of bool] and other relevant warnings that you should be able to see from an IDE like PyCharm configured w/ a Python 3.7+ interpreter. Can we get these resolved as part of this PR to cleanup the code?

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch-pt3-test branch 3 times, most recently from 7d309d6 to 3dfe6bb Compare March 11, 2022 04:04
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a lot of places here where we seem to be confusing types and, in particular, mixing CloudwatchConfigType enum types with their string values, which are introducing breaking changes.

I've added some suggestions here under the premise that it's probably easiest to just convert from enum-to-string as early as possible and then just pass strings around everywhere for backwards compatibility, but there's probably more - can we take another pass to try to clean this up, and make sure integration tests pass for all supported config types?

node_id: str,
retry_failed: bool = True,
) -> bool:
) -> Dict[str, any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Dict[str, any]:
) -> Dict[str, Any]:

@@ -281,19 +303,19 @@ def _replace_config_variables(

def _replace_all_config_variables(
self,
collection: Union[dict, list],
collection: Union[Dict[str, any], str],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
collection: Union[Dict[str, any], str],
collection: Union[Dict[str, Any], List[Dict[str, Any]]],

node_id: str,
cluster_name: str,
region: str,
) -> Tuple[(Union[dict, list], int)]:
) -> Union[str, Dict[str, any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Union[str, Dict[str, any]]:
) -> Union[str, Dict[str, Any]]:

The input collection must be either a dict or list.
Returns a tuple consisting of the output collection and the number of
modified strings in the collection (which is not necessarily equal to
the number of variables replaced).
"""

for key in collection:
if type(collection) is dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a final else statement that catches unexpected collection types and raises a ValueError like:

raise ValueError(f"Can't replace config variables in unsupported collection type: {type(collection)}"

As it is, you'll wind up with a NameError at line 326 if the collection isn't a dict or list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Patrick, added ValueError check.

@@ -311,7 +333,7 @@ def _replace_all_config_variables(
)
return collection

def _load_config_file(self, config_type: str) -> Dict[str, Any]:
def _load_config_file(self, config_type: CloudwatchConfigType) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _load_config_file(self, config_type: CloudwatchConfigType) -> Dict[str, Any]:
def _load_config_file(self, config_type: str) -> Union[Dict[str, Any], List[Dict[str, Any]]]:

Comment on lines 43 to 54
CloudwatchConfigType.AGENT: self._replace_cwa_config_variables,
CloudwatchConfigType.DASHBOARD: self._replace_dashboard_config_variables,
CloudwatchConfigType.ALARM: self._load_config_file,
}
self.CLOUDWATCH_CONFIG_TYPE_TO_UPDATE_FUNC_HEAD_NODE = {
CloudwatchConfigType.AGENT: self._restart_cloudwatch_agent,
CloudwatchConfigType.DASHBOARD: self._put_cloudwatch_dashboard,
CloudwatchConfigType.ALARM: self._put_cloudwatch_alarm,
}
self.CLOUDWATCH_CONFIG_TYPE_TO_UPDATE_FUNC_WORKER_NODE = {
CloudwatchConfigType.AGENT: self._restart_cloudwatch_agent,
CloudwatchConfigType.ALARM: self._put_cloudwatch_alarm,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CloudwatchConfigType.AGENT: self._replace_cwa_config_variables,
CloudwatchConfigType.DASHBOARD: self._replace_dashboard_config_variables,
CloudwatchConfigType.ALARM: self._load_config_file,
}
self.CLOUDWATCH_CONFIG_TYPE_TO_UPDATE_FUNC_HEAD_NODE = {
CloudwatchConfigType.AGENT: self._restart_cloudwatch_agent,
CloudwatchConfigType.DASHBOARD: self._put_cloudwatch_dashboard,
CloudwatchConfigType.ALARM: self._put_cloudwatch_alarm,
}
self.CLOUDWATCH_CONFIG_TYPE_TO_UPDATE_FUNC_WORKER_NODE = {
CloudwatchConfigType.AGENT: self._restart_cloudwatch_agent,
CloudwatchConfigType.ALARM: self._put_cloudwatch_alarm,
CloudwatchConfigType.AGENT.value: self._replace_cwa_config_variables,
CloudwatchConfigType.DASHBOARD.value: self._replace_dashboard_config_variables,
CloudwatchConfigType.ALARM.value: self._load_config_file,
}
self.CLOUDWATCH_CONFIG_TYPE_TO_UPDATE_FUNC_HEAD_NODE = {
CloudwatchConfigType.AGENT.value: self._restart_cloudwatch_agent,
CloudwatchConfigType.DASHBOARD.value: self._put_cloudwatch_dashboard,
CloudwatchConfigType.ALARM.value: self._put_cloudwatch_alarm,
}
self.CLOUDWATCH_CONFIG_TYPE_TO_UPDATE_FUNC_WORKER_NODE = {
CloudwatchConfigType.AGENT.value: self._restart_cloudwatch_agent,
CloudwatchConfigType.ALARM.value: self._put_cloudwatch_alarm,

self._update_cloudwatch_config(is_head_node, "alarm")
for config_type in CloudwatchConfigType:
if CloudwatchHelper.cloudwatch_config_exists(
self.provider_config, config_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.provider_config, config_type
self.provider_config, config_type.value

@@ -118,7 +134,9 @@ def _put_cloudwatch_dashboard(self) -> Dict[str, Any]:
dashboard_name_cluster = dashboard_config.get("name", self.cluster_name)
dashboard_name = self.cluster_name + "-" + dashboard_name_cluster

widgets = self._replace_dashboard_config_variables()
widgets = self._replace_dashboard_config_variables(
CloudwatchConfigType.DASHBOARD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CloudwatchConfigType.DASHBOARD
CloudwatchConfigType.DASHBOARD.value

@@ -190,7 +208,9 @@ def _ssm_command_waiter(
command_id = response["Command"]["CommandId"]

cloudwatch_config = self.provider_config["cloudwatch"]
agent_retryer_config = cloudwatch_config.get("agent").get("retryer", {})
agent_retryer_config = cloudwatch_config.get(CloudwatchConfigType.AGENT).get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
agent_retryer_config = cloudwatch_config.get(CloudwatchConfigType.AGENT).get(
agent_retryer_config = cloudwatch_config.get(CloudwatchConfigType.AGENT.value).get(

if CloudwatchHelper.cloudwatch_config_exists(
self.provider_config, config_type
):
self._update_cloudwatch_config(config_type, is_head_node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._update_cloudwatch_config(config_type, is_head_node)
self._update_cloudwatch_config(config_type.value, is_head_node)

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch-pt3-test branch 2 times, most recently from 3d93915 to d91ce72 Compare March 14, 2022 22:58
@stale
Copy link

stale bot commented Apr 16, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 16, 2022
@wuisawesome
Copy link
Contributor

@Zyiqin-Miranda do we still need this pr?

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 26, 2022
@Zyiqin-Miranda
Copy link
Member Author

@wuisawesome, Yes, sorry for the late response. This PR mainly adds tests for CloudWatch code and add some fixes to typing hints. I think it's helpful to ensure correct behavior of CloudWatch part.

@DmitriGekhtman
Copy link
Contributor

@pdames @wuisawesome could you help review and merge this?

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @pdames do you want to take a final look before we merge?

@wuisawesome
Copy link
Contributor

@pdames gentle bump

@stale
Copy link

stale bot commented Jul 7, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 7, 2022
@Zyiqin-Miranda Zyiqin-Miranda removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 7, 2022
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch-pt3-test branch 2 times, most recently from 7a15fbe to 8f39e03 Compare August 3, 2022 18:39
…d unit tests for AWS autoscaler CloudWatch integration
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for all of the tests and the code cleanup, @Zyiqin-Miranda!

@pdames
Copy link
Member

pdames commented Aug 8, 2022

@wuisawesome @DmitriGekhtman This should be ready to merge!

@DmitriGekhtman
Copy link
Contributor

@wuisawesome @DmitriGekhtman This should be ready to merge!

Test failures look unrelated. I'm going to push the button.

@DmitriGekhtman DmitriGekhtman merged commit b3f06d9 into ray-project:master Aug 8, 2022
scottsun94 pushed a commit to scottsun94/ray that referenced this pull request Aug 9, 2022
…d unit tests for AWS autoscaler CloudWatch integration (ray-project#22070)

This PR mainly adds two improvements:

We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.

Signed-off-by: Huaiwei Sun <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…d unit tests for AWS autoscaler CloudWatch integration (ray-project#22070)

This PR mainly adds two improvements:

We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants