Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] AWS Autoscaler CloudWatch Dashboard support #20266

Merged
merged 1 commit into from
Jan 10, 2022

Conversation

Zyiqin-Miranda
Copy link
Member

@Zyiqin-Miranda Zyiqin-Miranda commented Nov 11, 2021

Why are these changes needed?

These changes add a set of improvements to enable automatic creation and update of CloudWatch dashboards when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:

  1. Get rapid insights into their cluster state via CloudWatch dashboards.
  2. Allow users to update their CloudWatch dashboard JSON configuration files during Ray up execution time.

Notes:

  1. This PR is a follow-up PR for [autoscaler] AWS Autoscaler CloudWatch Integration #18619, adds dashboard support.

Related issue number

Closes ##9644 #8967

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Zyiqin-Miranda Zyiqin-Miranda changed the title [autoscaler] AWS Autoscaler CloudWatch Dashboard/Alarm support [autoscaler] AWS Autoscaler CloudWatch Dashboard and Alarm support Nov 11, 2021
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 2 times, most recently from e61d068 to 64e273a Compare November 11, 2021 21:43
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 3 times, most recently from 2aa6ff7 to 28ea734 Compare November 12, 2021 03:25
@DmitriGekhtman
Copy link
Contributor

I'm mostly just reviewing to check that existing Ray functionality is unaffected, which looks to be the case.

Will defer to @pdames for a more thorough review based on the domain knowledge.

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 2 times, most recently from ed9deab to e98c50e Compare November 24, 2021 21:39
pdames
pdames previously requested changes Nov 24, 2021
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good overall - thanks for making the updates to support cluster-level dashboards! I think if we now just make a few updates to the default/example dashboard config then we'll be in good shape!

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 2 times, most recently from 63fd476 to cbe5f96 Compare December 3, 2021 01:13
@Zyiqin-Miranda
Copy link
Member Author

Zyiqin-Miranda commented Dec 3, 2021

This looks pretty good overall - thanks for making the updates to support cluster-level dashboards! I think if we now just make a few updates to the default/example dashboard config then we'll be in good shape!

Thanks Patrick for the review, updated the dashboard configuration json file to auto generate a cluster-level example dashboard now.

@pdames
Copy link
Member

pdames commented Dec 9, 2021

The latest updates LGTM. @wuisawesome, @DmitriGekhtman - could one of you take a look?

@DmitriGekhtman
Copy link
Contributor

Looks like the CI hit some (hopefully transient) problem. Could you rebase/merge master?

@DmitriGekhtman DmitriGekhtman dismissed pdames’s stale review December 9, 2021 23:51

Seems like requested changes were made.

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 4 times, most recently from 36056fe to b32469a Compare December 13, 2021 02:48
@Zyiqin-Miranda Zyiqin-Miranda added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 13, 2021
Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split this PR up into pieces? If I'm understanding it correctly, it seems like it should be possible to split into 3 or 4 pieces.

  1. Agent related changes (the changes related to updating the head/worker nodes)
  2. The refactor which generalizes the logic and introduces CloudwatchConfigType
  3. Dashboard feature
  4. Alarm feature

Also please let me know if I'm missing something here and it can't be split up.

@Zyiqin-Miranda
Copy link
Member Author

Zyiqin-Miranda commented Dec 14, 2021

Would it be possible to split this PR up into pieces? If I'm understanding it correctly, it seems like it should be possible to split into 3 or 4 pieces.

  1. Agent related changes (the changes related to updating the head/worker nodes)
  2. The refactor which generalizes the logic and introduces CloudwatchConfigType
  3. Dashboard feature
  4. Alarm feature

Also please let me know if I'm missing something here and it can't be split up.

Thanks Alex, the agent related updates are already included in the first PR that is merged and I can split this PR into:
1. dashboard support
2. alarm support
3. generalize and consolidate agent, dashboard and alarm logic with all unit tests.
May I know if it looks good to you?

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 2 times, most recently from fa6021e to 3486a26 Compare December 15, 2021 21:35
@Zyiqin-Miranda Zyiqin-Miranda changed the title [autoscaler] AWS Autoscaler CloudWatch Dashboard and Alarm support [autoscaler] AWS Autoscaler CloudWatch Dashboard support Dec 15, 2021
@pdames
Copy link
Member

pdames commented Dec 15, 2021

The refactor to include only the changes relevant to the CloudWatch dashboard here look good to me. One thing to note is that we'll want to quickly follow this with another PR that adds unit tests to guard against regressions. Thanks @Zyiqin-Miranda!

@Zyiqin-Miranda Zyiqin-Miranda added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Dec 15, 2021
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 2 times, most recently from c1d494a to edc823a Compare December 16, 2021 18:45
@Zyiqin-Miranda Zyiqin-Miranda deleted the cloudwatch branch December 17, 2021 05:27
@Zyiqin-Miranda Zyiqin-Miranda restored the cloudwatch branch December 17, 2021 05:32
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the cloudwatch branch 2 times, most recently from 0714ece to 948ddaf Compare December 20, 2021 19:19
@wuisawesome
Copy link
Contributor

@DmitriGekhtman do you mind taking a look at this too

@@ -14,8 +14,9 @@ provider:
# We depend on AWS Systems Manager (SSM) to deploy CloudWatch configuration updates to your cluster,
# with relevant configuration created or updated in the SSM Parameter Store during `ray up`.

# The `AmazonCloudWatch-ray_agent_config_{cluster_name}` SSM Parameter Store Config Key is used to
# store a remote cache of the last Unified CloudWatch Agent config applied.
# We support three CloudWatch related config type under this cloudwatch section: agent, dashboard and alarm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't support alarm yet in this pr right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, removed alarm, thanks!

except botocore.exceptions.WaiterError as e:
logger.error(
"Failed while waiting for EC2 instance checks to complete: {}".
format(e.message))
raise e

def _update_cloudwatch_agent_config(self, is_head_node: bool) -> None:
def _update_cloudwatch_config(self, is_head_node: bool,
config_type: str) -> None:
""" check whether update operations are needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this description a bit more detailed (it seems like this applies update operations in addition to checking?), and include a description of the args?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added description and args doc string.

""" check whether update operations are needed.
"""
cwa_installed = self._setup_cwa()
param_name = self._get_ssm_param_name()
param_name = self._get_ssm_param_name(config_type)
if cwa_installed:
if is_head_node:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we dedupe some of this logic? it seems like both cases are checkign if hashes map and then restarting the cloduwatch agent?

Btw just checking, is

elif config_type == "dashboard":
                        self._put_cloudwatch_dashboard()

not valid on a non-head node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just head node is responsible for putting a cluster-level dashboard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hashes comparison, head node compares the hash of local file with the hash of remote file stored at aws Systems Manager parameter store; worker nodes compare its ec2 hash tag value with head node ec2 hash tag value.

DocumentName=document_name,
Parameters=parameters,
MaxConcurrency=str(min(len(node_ids), 100)),
MaxConcurrency=str(min(len(node_id), 100)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just be 1 now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated to 1, thanks Alex!

@wuisawesome
Copy link
Contributor

Test failure (client proxy) is known flaky and looks unrelated, merging

@wuisawesome wuisawesome merged commit 71fae21 into ray-project:master Jan 10, 2022
wuisawesome pushed a commit that referenced this pull request Feb 2, 2022
These changes add a set of improvements to enable automatic creation and update of CloudWatch alarms when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:

Setup alarms against Ray CloudWatch metrics to get notified about increased load, service outage.
Update their CloudWatch alarm JSON configuration files during Ray up execution time.
Notes:

This PR is a follow-up PR for #20266, which adds CloudWatch alarm support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants