Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS stability improvements #4326

Closed
wants to merge 1 commit into from
Closed

ECS stability improvements #4326

wants to merge 1 commit into from

Conversation

hngkr
Copy link

@hngkr hngkr commented Dec 15, 2015

We've experienced issues with deleting ECS cluster services and ECS clusters. It's mostly due to peculiarities within AWS:

  1. It's possible to look up a service using DescribeServices(servicename, clustername) and get a "DRAINING" response even if the cluster is already deleted. Fix seems to be using ListServices(clustername) and then check if the service is gone or not before calling DescribeServices.

  2. For the servies, we've also seen the wait condition go from DRAINING to MISSING -- and the wait condition in Terraform has the target INACTIVE before it exits. Which made the deletion of the service hang - even when the service was already gone

  3. We've seen issues with deleting a cluster that has running services or services that have DesiredCount > 0 (for instance: if services has been started in the cluster from outside Terraform). I acknowledge that this might be controversial - and a corner case regarding the philosophy of "not deleting anything that isn't started by Terraform itself".

@radeksimko
Copy link
Member

Hi @hngkr ,

  1. It's possible to look up a service using DescribeServices(servicename, clustername) and get a "DRAINING" response even if the cluster is already deleted. Fix seems to be using ListServices(clustername) and then check if the service is gone or not before calling DescribeServices.

Have you also tried reporting this back to AWS? I wouldn't say that calling ecs:ListServices instead of ecs:DescribeServices is solution, it's rather workaround and we should also be pushing Amazon, so they fix it. I'm ok with merging this as a short-term solution, but before we do so, I want to be sure it has been reported, so that we can remove this workaround at some point.

  1. For the servies, we've also seen the wait condition go from DRAINING to MISSING -- and the wait condition in Terraform has the target INACTIVE before it exits. Which made the deletion of the service hang - even when the service was already gone

Good catch! I'm totally happy to merge this change.

  1. We've seen issues with deleting a cluster that has running services or services that have DesiredCount > 0 (for instance: if services has been started in the cluster from outside Terraform). I acknowledge that this might be controversial - and a corner case regarding the philosophy of "not deleting anything that isn't started by Terraform itself".

I think the philosophy you mentioned is quite essential part of Terraform and I'd like to keep it that way. The preferred approach is to error out or wait until services have been drained.
If you choose to use two or more tools, each to manage different part of your infrastructure, you will likely be choosing between tools that don't stamp on each others' toes and I'd like Terraform to be one of the choices, hence I'm not inclined to merge this part of the PR.

I have noticed recently that some ECS acceptance tests are intermittently failing because of ECS services launched by Terraform take time to drain and that's something we can definitely fix (I'm working on that in a separate PR).

// Check if it's not already gone
resp, err := conn.DescribeServices(&ecs.DescribeServicesInput{
Services: []*string{aws.String(d.Id())},
Cluster: aws.String(d.Get("cluster").(string)),
Services: []*string{aws.String(serviceName)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARNs are AFAIK the most specific ids and also may help us while debugging some issues related to region or account ID - imagine you would use wrong AWS credentials and then you'd be wondering why the ECS service is/isn't there or why it has different settings.

For those reasons I'd be personally inclined to keep ARNs where possible, especially in logs.

@radeksimko radeksimko added the waiting-response An issue/pull request is waiting for a response from the community label Dec 17, 2015
@hngkr
Copy link
Author

hngkr commented Dec 17, 2015

Hi Radek,

  1. It's possible to look up a service using DescribeServices(servicename,

clustername) and get a "DRAINING" response even if the cluster is already
deleted. Fix seems to be using ListServices(clustername) and then check if
the service is gone or not before calling DescribeServices.

Have you also tried reporting this back to AWS? I wouldn't say that
calling ecs:ListServices instead of ecs:DescribeServices is solution,
it's rather workaround and we should also be pushing Amazon, so they fix
it. I'm ok with merging this as a short-term solution, but before we do
so, I want to be sure it has been reported, so that we can remove this
workaround at some point.

I haven't reported it. I found a forum post that seemed to indicate that I
might not be the only one, and then I went around fixing it instead. I
believe that it was this post:
https://forums.aws.amazon.com/thread.jspa?messageID=621911

  1. For the services, we've also seen the wait condition go from DRAINING to

MISSING -- and the wait condition in Terraform has the target INACTIVE
before it exits. Which made the deletion of the service hang - even when
the service was already gone

Good catch! I'm totally happy to merge this change.

  1. We've seen issues with deleting a cluster that has running services or
    services that have DesiredCount > 0 (for instance: if services has been
    started in the cluster from outside Terraform). I acknowledge that this
    might be controversial - and a corner case regarding the philosophy of "not
    deleting anything that isn't started by Terraform itself".

I think the philosophy you mentioned is quite essential part of Terraform
and I'd like to keep it that way. The preferred approach is to error out or
wait until services have been drained.
If you choose to use two or more tools, each to manage different part of
your infrastructure, you will likely be choosing between tools that don't
stamp on each others' toes and I'd like Terraform to be one of the choices,
hence I'm not inclined to merge this part of the PR.

I have noticed recently that some ECS acceptance tests are intermittently
failing because of ECS services launched by Terraform take time to drain
and that's something we can definitely fix (I'm working on that in a
separate PR).

I certainly respect and understand your point.

In this specific case, we've actually made an application that runs as an
ECS service and can create- and start new ECS services in the same cluster.
Works pretty well - except when bringing the cluster down. I acknowledge
that it's certainly a corner case and that we probably just should move the
fix to our own custom Terraform provider (which I just figured out how to
do).

I'm probably side-tracking, but I would like to describe the environment
and processes that we're working within. Often, we're more on the Dev than
the Ops-side of things. Setting up the system is only the first step.
Terraform works fine for that use. Running the systems in production
usually means transferring the day-to-day responsibility to operators,
who'll do "operations" stuff in reaction to customer usage.

Ops duties might include tuning values for read/write capacity on DynamoDB
tables, upping minimum number of instances in an autoscale group in
anticipation of a spike in usage and a lot more - I'm sure that you get the
point. If Ops didn't do this, then production would be affected. At our
site, we can't really expect the Ops guys to start loving (or just
learning) Terraform either.

The problem then arises when we as infrastructure-developers have to get
back and extend the system with new components - and re-run Terraform on
production systems that might have deviated - even if it's in a
non-structural, strictly "tunable parameters" way. We can run "terraform
plan" to determine the differences, and we can change the state file to
reflect the current realities (even if that's a bit of a dirty thing to do

  • and incredibly tedious when there's 40+ DynamoDB tables in an
    environment).

If you have any insight in how to handle a "multi-tool" environment, then I
would certainly like to know, because I/we haven't cracked that problem.

I'll try to update the PR to exclude 3) and look at your other comments.

@radeksimko
Copy link
Member

@hngkr Friendly ping. Do you mind updating the PR as mentioned, so we can merge it?

@radeksimko
Copy link
Member

I wanted to pick this up, so I checked the documentation and I don't see MISSING as an expected status of an ECS service:
http://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Service.html#ECS-Type-Service-status and I was not able to reproduce the other bug you mentioned with Describe/List either.

The forum thread you mentioned is only describing services kept in DRAINING state, not MISSING. I noticed that bug myself couple of times, that is also why we suggest on using depends_on in the docs:

https://www.terraform.io/docs/providers/aws/r/ecs_service.html

screen shot 2016-02-07 at 13 30 28

While I appreciate the effort you have invested into this PR, I'm afraid I cannot confidently merge any of those two changes unfortunately.

I will keep this open until the end of next week and then I'll close it, unless you give me reason not to.

@radeksimko radeksimko closed this Feb 25, 2016
@ghost
Copy link

ghost commented Apr 27, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug provider/aws waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants