Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Add support for EC2 launch templates. #17236

Merged
merged 4 commits into from
Jul 30, 2021

Conversation

pdames
Copy link
Member

@pdames pdames commented Jul 21, 2021

These changes add support for EC2 Launch Templates as part of a user's AWS Autoscaler node config. Any parameters specified in node_config override the same parameters in the launch template, in compliance with the behavior of EC2's create_instances API. These changes were previously proposed in: #9336.

Related issue number

Closes #9334
Resolves milestone [8] of #8420.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@pdames
Copy link
Member Author

pdames commented Jul 21, 2021

Opening this PR with required changes for EC2 launch template support and an example autoscaler config for review. Will update with unit tests.

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fairly good (and quite a reasonably sized PR :) )!

Comment on lines 770 to 778
def _configure_from_launch_template(config):
for node_type in config["available_node_types"].values():
config = _configure_node_type_from_launch_template(config, node_type)
return config


def _configure_node_type_from_launch_template(config, node_type):
node_cfg = node_type["node_config"]
if "LaunchTemplate" not in node_cfg:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add docstrings & type hints for these functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstrings and type hints for both this PR and the related changes in #14080. The new unit test also exercises the E2E happy-path for these changes in config.py and node_provider.py.

Comment on lines 783 to 784
template_version = kwargs.pop("Version", "$Default")
kwargs["Versions"] = [template_version] if template_version else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be setdefault? (as in kwargs.setdefault("Versions", ...)). I'm wondering if we should avoid adding a field Version (no s) in the LaunchTemplate dict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! My thinking here is that we should keep the launch template autoscaler config model and behavior consistent with the boto3 ec2.create_instances API (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances) that this LaunchTemplate is ultimately forwarded to inside of node_provider.py:

The ec2.create_instances input launch template model is:

LaunchTemplate={
        'LaunchTemplateId': 'string',
        'LaunchTemplateName': 'string',
        'Version': 'string'
 }

Since this API only takes a singular Version argument, and since we in turn require only a single input version to describe for correct E2E behavior, I wanted to keep our internal dependency on the Versions argument used by ec2.describe_launch_template_versions hidden from the end-user writing the autoscaler config. This also gives clients one less opportunity to shoot themselves in the foot by trying to specify more than 1 launch template version in their config.

However, one usability enhancement I caught on this line yesterday is that kwargs.pop("Version", "$Default") should be changed to str(kwargs.pop("Version", "$Default")) so that users can simply specify an integer version number like Version: 2 in their autoscaler config YAML (as-written, they would be forced to specify integer strings like Version: "2"). 🙂

@pdames pdames requested a review from ijrsvt July 24, 2021 08:32
Comment on lines 770 to 794
def _configure_from_launch_template(config: Dict[str, Any]) -> Dict[str, Any]:
"""
Merges any launch template data referenced by the node config of all
available node type's into their parent node config. Any parameters
specified in node config override the same parameters in the launch
template, in compliance with the behavior of the ec2.create_instances
API. The config to bootstrap is modified in place.

Args:
config (Dict[str, Any]): config to bootstrap
Returns:
config (Dict[str, Any]): The input config with all launch template
data merged into the node config of all available node types. If no
launch template data is found, then the config is returned
unchanged.
Raises:
ValueError: If no launch template is found for any launch
template [name|id] and version, or more than one launch template is
found.
"""

# iterate over sorted node types to support deterministic unit test stubs
for _, node_type in sorted(config["available_node_types"].items()):
config = _configure_node_type_from_launch_template(config, node_type)
return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this not modify the config in place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Note that this also required patching a couple areas that depend on bootstrap_aws to invalidate their assumption of in-place config modification. I've also introduced a copy.deepcopy(config) line at the top of bootstrap_aws to both prevent in-place modification of the input config to bootstrap by any existing code path, and to make this behavior more immediately clear to readers.

Cleaning up in-place modification of each individual section of the copied config (e.g. provider config, node type config, node config, etc.) seems like a good thing to target across subsequent PRs.

Comment on lines 797 to 830
def _configure_node_type_from_launch_template(
config: Dict[str, Any], node_type: Dict[str, Any]) -> Dict[str, Any]:
"""
Merges any launch template data referenced by the given node type's
node config into the parent node config. Any parameters specified in
node config override the same parameters in the launch template. The
config to bootstrap is modified in place.

Note that this merge is simply a bidirectional dictionary update, from
the node config to the launch template data, and from the launch
template data to the node config. Thus, the final result captures the
relative complement of launch template data with respect to node config,
and allows all subsequent config bootstrapping code paths to act as
if the complement was explicitly specified in the user's node config. A
deep merge of nested elements like tag specifications isn't required
here, since the AWSNodeProvider's ec2.create_instances call will do this
for us after it fetches the referenced launch template data.

Args:
config (Dict[str, Any]): config to bootstrap
node_type (Dict[str, Any]): node type config to bootstrap
Returns:
config (Dict[str, Any]): The input config with all launch template
data merged into the node config of the input node type. If no
launch template data is found, then the config is returned
unchanged.
Raises:
ValueError: If no launch template is found for the given launch
template [name|id] and version, or more than one launch template is
found.
"""
node_cfg = node_type["node_config"]
if "LaunchTemplate" not in node_cfg:
return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the modification of node_type not in-place and return the modified node_type instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +317 to +332
"""
Merges user-provided node config tag specifications into a base
list of node provider tag specifications. The base list of
node provider tag specs is modified in-place.

This allows users to add tags and override values of existing
tags with their own, and only applies to the resource type
"instance". All other resource types are appended to the list of
tag specs.

Args:
tag_specs (List[Dict[str, Any]]): base node provider tag specs
user_tag_specs (List[Dict[str, Any]]): user's node config tag specs
"""

for user_tag_spec in user_tag_specs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this a separate function!
Modification in place seems fine here :)

Comment on lines 59 to 62
"""
Applies default updates made by AWSNodeProvider to node_cfg during node
creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention that this is only used in tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only outstanding thing to change here!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 30, 2021

Windows Failure is not related (object_spilling) and this PR is strictly autoscaler!

@ijrsvt ijrsvt merged commit 131710f into ray-project:master Jul 30, 2021
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[autoscaler] Support LaunchTemplate/NetworkInterfaces in Node Config
5 participants