-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] AWS Autoscaler CloudWatch Dashboard support #20266
Conversation
e61d068
to
64e273a
Compare
2aa6ff7
to
28ea734
Compare
I'm mostly just reviewing to check that existing Ray functionality is unaffected, which looks to be the case. Will defer to @pdames for a more thorough review based on the domain knowledge. |
28ea734
to
362a53e
Compare
python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py
Outdated
Show resolved
Hide resolved
ed9deab
to
e98c50e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good overall - thanks for making the updates to support cluster-level dashboards! I think if we now just make a few updates to the default/example dashboard config then we'll be in good shape!
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
Outdated
Show resolved
Hide resolved
python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json
Outdated
Show resolved
Hide resolved
63fd476
to
cbe5f96
Compare
Thanks Patrick for the review, updated the dashboard configuration json file to auto generate a cluster-level example dashboard now. |
cbe5f96
to
88ee45d
Compare
The latest updates LGTM. @wuisawesome, @DmitriGekhtman - could one of you take a look? |
Looks like the CI hit some (hopefully transient) problem. Could you rebase/merge master? |
Seems like requested changes were made.
36056fe
to
b32469a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to split this PR up into pieces? If I'm understanding it correctly, it seems like it should be possible to split into 3 or 4 pieces.
- Agent related changes (the changes related to updating the head/worker nodes)
- The refactor which generalizes the logic and introduces
CloudwatchConfigType
- Dashboard feature
- Alarm feature
Also please let me know if I'm missing something here and it can't be split up.
Thanks Alex, the agent related updates are already included in the first PR that is merged and I can split this PR into: |
fa6021e
to
3486a26
Compare
The refactor to include only the changes relevant to the CloudWatch dashboard here look good to me. One thing to note is that we'll want to quickly follow this with another PR that adds unit tests to guard against regressions. Thanks @Zyiqin-Miranda! |
c1d494a
to
edc823a
Compare
0714ece
to
948ddaf
Compare
@DmitriGekhtman do you mind taking a look at this too |
@@ -14,8 +14,9 @@ provider: | |||
# We depend on AWS Systems Manager (SSM) to deploy CloudWatch configuration updates to your cluster, | |||
# with relevant configuration created or updated in the SSM Parameter Store during `ray up`. | |||
|
|||
# The `AmazonCloudWatch-ray_agent_config_{cluster_name}` SSM Parameter Store Config Key is used to | |||
# store a remote cache of the last Unified CloudWatch Agent config applied. | |||
# We support three CloudWatch related config type under this cloudwatch section: agent, dashboard and alarm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't support alarm yet in this pr right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, removed alarm, thanks!
except botocore.exceptions.WaiterError as e: | ||
logger.error( | ||
"Failed while waiting for EC2 instance checks to complete: {}". | ||
format(e.message)) | ||
raise e | ||
|
||
def _update_cloudwatch_agent_config(self, is_head_node: bool) -> None: | ||
def _update_cloudwatch_config(self, is_head_node: bool, | ||
config_type: str) -> None: | ||
""" check whether update operations are needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this description a bit more detailed (it seems like this applies update operations in addition to checking?), and include a description of the args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added description and args doc string.
""" check whether update operations are needed. | ||
""" | ||
cwa_installed = self._setup_cwa() | ||
param_name = self._get_ssm_param_name() | ||
param_name = self._get_ssm_param_name(config_type) | ||
if cwa_installed: | ||
if is_head_node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we dedupe some of this logic? it seems like both cases are checkign if hashes map and then restarting the cloduwatch agent?
Btw just checking, is
elif config_type == "dashboard":
self._put_cloudwatch_dashboard()
not valid on a non-head node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just head node is responsible for putting a cluster-level dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hashes comparison, head node compares the hash of local file with the hash of remote file stored at aws Systems Manager parameter store; worker nodes compare its ec2 hash tag value with head node ec2 hash tag value.
DocumentName=document_name, | ||
Parameters=parameters, | ||
MaxConcurrency=str(min(len(node_ids), 100)), | ||
MaxConcurrency=str(min(len(node_id), 100)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just be 1 now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated to 1, thanks Alex!
948ddaf
to
5d02400
Compare
Test failure (client proxy) is known flaky and looks unrelated, merging |
These changes add a set of improvements to enable automatic creation and update of CloudWatch alarms when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to: Setup alarms against Ray CloudWatch metrics to get notified about increased load, service outage. Update their CloudWatch alarm JSON configuration files during Ray up execution time. Notes: This PR is a follow-up PR for #20266, which adds CloudWatch alarm support.
Why are these changes needed?
These changes add a set of improvements to enable automatic creation and update of CloudWatch dashboards when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:
Notes:
Related issue number
Closes ##9644 #8967
Checks
scripts/format.sh
to lint the changes in this PR.