Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS EC2 - Automatically chosen subnet does not match parametrized AZ #428

Open
cmillani opened this issue Jun 9, 2024 · 1 comment
Open

Comments

@cmillani
Copy link

cmillani commented Jun 9, 2024

Describe the issue:

ec2.py chooses the first subnet from the specified (or default, if none specified) VPC, ignoring the availability_zone(AZ) parameter.

Some VMs are not supported on all AZs, so it is necessary to provide an AZ, but doing so may conflict with the subnet selected on the step described above.

Minimal Complete Verifiable Example:

At the time of writing m5.large instance is not supported on us-east-1e, and that is the subnet returned at index 0 when listing subnets from default VPC in my case.

To better reproduce we can force use of the AZ of second subnet, creating this code:

from dask_cloudprovider.aws.ec2 import EC2Cluster
from dask_cloudprovider.aws.helper import (
    get_default_vpc,
)
from aiobotocore.session import get_session
import botocore

boto_config = botocore.config.Config(retries=dict(max_attempts=10))
region = "us-east-1"
async with get_session().create_client("ec2", region_name=region, config=boto_config) as client:
    vpc = await get_default_vpc(client)
    subnets = (await client.describe_subnets())["Subnets"]
    az = subnets[1]['AvailabilityZone'] # Code at `dask_cloudprovider.aws.ec2` gets subnet at [0], so this will force the issue
    EC2Cluster(
        region="us-east-1",
        availability_zone=az,
        security=False, # Simply to avoid requiring criptography package
        scheduler_instance_type="m5.large",
        worker_instance_type="m5.large",
    )

This outputs:

2024-06-09 14:46:30,639 - distributed.deploy.spec - WARNING - Cluster closed without starting up

And inspecting the stack we can see the following error:

[...]
ClientError: An error occurred (InvalidParameterValue) when calling the RunInstances operation: Value (us-east-1a) for parameter availabilityZone is invalid. Subnet '<REDACTED>' is in the availability zone us-east-1e

During handling of the above exception, another exception occurred:
[...]

Anything else we need to know?:

Changing dask_cloudprovider.aws.helper.get_vpc_subnets to receive and consider the AvailabilityZone should fix the issue. If this makes sense I could open a PR! :)

async def get_vpc_subnets(client, vpc, availability_zone):
    vpcs = (await client.describe_vpcs())["Vpcs"]
    [vpc] = [x for x in vpcs if x["VpcId"] == vpc]
    subnets = (await client.describe_subnets())["Subnets"]
    return [subnet["SubnetId"] for subnet in subnets if subnet["VpcId"] == vpc["VpcId"] and subnet["AvailabilityZone"] == availability_zone]

Environment:

  • Dask version:
    • dask==2024.4.1
    • dask-cloudprovider==2022.10.0
    • distributed==2024.4.1
  • Python version: 3.10.12
  • Operating System: Ubuntu 20.04 LTS (Linux 5.15.146.1-microsoft-standard-WSL2)
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

A PR to do this would be very welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants