ECS stability improvements #4326

hngkr · 2015-12-15T17:28:36Z

We've experienced issues with deleting ECS cluster services and ECS clusters. It's mostly due to peculiarities within AWS:

It's possible to look up a service using DescribeServices(servicename, clustername) and get a "DRAINING" response even if the cluster is already deleted. Fix seems to be using ListServices(clustername) and then check if the service is gone or not before calling DescribeServices.
For the servies, we've also seen the wait condition go from DRAINING to MISSING -- and the wait condition in Terraform has the target INACTIVE before it exits. Which made the deletion of the service hang - even when the service was already gone
We've seen issues with deleting a cluster that has running services or services that have DesiredCount > 0 (for instance: if services has been started in the cluster from outside Terraform). I acknowledge that this might be controversial - and a corner case regarding the philosophy of "not deleting anything that isn't started by Terraform itself".

radeksimko · 2015-12-17T12:42:03Z

Hi @hngkr ,

It's possible to look up a service using DescribeServices(servicename, clustername) and get a "DRAINING" response even if the cluster is already deleted. Fix seems to be using ListServices(clustername) and then check if the service is gone or not before calling DescribeServices.

Have you also tried reporting this back to AWS? I wouldn't say that calling ecs:ListServices instead of ecs:DescribeServices is solution, it's rather workaround and we should also be pushing Amazon, so they fix it. I'm ok with merging this as a short-term solution, but before we do so, I want to be sure it has been reported, so that we can remove this workaround at some point.

For the servies, we've also seen the wait condition go from DRAINING to MISSING -- and the wait condition in Terraform has the target INACTIVE before it exits. Which made the deletion of the service hang - even when the service was already gone

Good catch! I'm totally happy to merge this change.

We've seen issues with deleting a cluster that has running services or services that have DesiredCount > 0 (for instance: if services has been started in the cluster from outside Terraform). I acknowledge that this might be controversial - and a corner case regarding the philosophy of "not deleting anything that isn't started by Terraform itself".

I think the philosophy you mentioned is quite essential part of Terraform and I'd like to keep it that way. The preferred approach is to error out or wait until services have been drained.
If you choose to use two or more tools, each to manage different part of your infrastructure, you will likely be choosing between tools that don't stamp on each others' toes and I'd like Terraform to be one of the choices, hence I'm not inclined to merge this part of the PR.

I have noticed recently that some ECS acceptance tests are intermittently failing because of ECS services launched by Terraform take time to drain and that's something we can definitely fix (I'm working on that in a separate PR).

radeksimko · 2015-12-17T12:46:59Z

builtin/providers/aws/resource_aws_ecs_service.go

 	// Check if it's not already gone
 	resp, err := conn.DescribeServices(&ecs.DescribeServicesInput{
-		Services: []*string{aws.String(d.Id())},
-		Cluster:  aws.String(d.Get("cluster").(string)),
+		Services: []*string{aws.String(serviceName)},


ARNs are AFAIK the most specific ids and also may help us while debugging some issues related to region or account ID - imagine you would use wrong AWS credentials and then you'd be wondering why the ECS service is/isn't there or why it has different settings.

For those reasons I'd be personally inclined to keep ARNs where possible, especially in logs.

hngkr · 2015-12-17T20:42:20Z

Hi Radek,

It's possible to look up a service using DescribeServices(servicename,

clustername) and get a "DRAINING" response even if the cluster is already
deleted. Fix seems to be using ListServices(clustername) and then check if
the service is gone or not before calling DescribeServices.

Have you also tried reporting this back to AWS? I wouldn't say that
calling ecs:ListServices instead of ecs:DescribeServices is solution,
it's rather workaround and we should also be pushing Amazon, so they fix
it. I'm ok with merging this as a short-term solution, but before we do
so, I want to be sure it has been reported, so that we can remove this
workaround at some point.

I haven't reported it. I found a forum post that seemed to indicate that I
might not be the only one, and then I went around fixing it instead. I
believe that it was this post:
https://forums.aws.amazon.com/thread.jspa?messageID=621911

For the services, we've also seen the wait condition go from DRAINING to

MISSING -- and the wait condition in Terraform has the target INACTIVE
before it exits. Which made the deletion of the service hang - even when
the service was already gone

Good catch! I'm totally happy to merge this change.

We've seen issues with deleting a cluster that has running services or
services that have DesiredCount > 0 (for instance: if services has been
started in the cluster from outside Terraform). I acknowledge that this
might be controversial - and a corner case regarding the philosophy of "not
deleting anything that isn't started by Terraform itself".

I think the philosophy you mentioned is quite essential part of Terraform
and I'd like to keep it that way. The preferred approach is to error out or
wait until services have been drained.
If you choose to use two or more tools, each to manage different part of
your infrastructure, you will likely be choosing between tools that don't
stamp on each others' toes and I'd like Terraform to be one of the choices,
hence I'm not inclined to merge this part of the PR.

I have noticed recently that some ECS acceptance tests are intermittently
failing because of ECS services launched by Terraform take time to drain
and that's something we can definitely fix (I'm working on that in a
separate PR).

I certainly respect and understand your point.

In this specific case, we've actually made an application that runs as an
ECS service and can create- and start new ECS services in the same cluster.
Works pretty well - except when bringing the cluster down. I acknowledge
that it's certainly a corner case and that we probably just should move the
fix to our own custom Terraform provider (which I just figured out how to
do).

I'm probably side-tracking, but I would like to describe the environment
and processes that we're working within. Often, we're more on the Dev than
the Ops-side of things. Setting up the system is only the first step.
Terraform works fine for that use. Running the systems in production
usually means transferring the day-to-day responsibility to operators,
who'll do "operations" stuff in reaction to customer usage.

Ops duties might include tuning values for read/write capacity on DynamoDB
tables, upping minimum number of instances in an autoscale group in
anticipation of a spike in usage and a lot more - I'm sure that you get the
point. If Ops didn't do this, then production would be affected. At our
site, we can't really expect the Ops guys to start loving (or just
learning) Terraform either.

The problem then arises when we as infrastructure-developers have to get
back and extend the system with new components - and re-run Terraform on
production systems that might have deviated - even if it's in a
non-structural, strictly "tunable parameters" way. We can run "terraform
plan" to determine the differences, and we can change the state file to
reflect the current realities (even if that's a bit of a dirty thing to do

and incredibly tedious when there's 40+ DynamoDB tables in an
environment).

If you have any insight in how to handle a "multi-tool" environment, then I
would certainly like to know, because I/we haven't cracked that problem.

I'll try to update the PR to exclude 3) and look at your other comments.

radeksimko · 2016-01-13T22:01:08Z

@hngkr Friendly ping. Do you mind updating the PR as mentioned, so we can merge it?

radeksimko · 2016-02-07T13:34:01Z

I wanted to pick this up, so I checked the documentation and I don't see MISSING as an expected status of an ECS service:
http://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Service.html#ECS-Type-Service-status and I was not able to reproduce the other bug you mentioned with Describe/List either.

The forum thread you mentioned is only describing services kept in DRAINING state, not MISSING. I noticed that bug myself couple of times, that is also why we suggest on using depends_on in the docs:

https://www.terraform.io/docs/providers/aws/r/ecs_service.html

While I appreciate the effort you have invested into this PR, I'm afraid I cannot confidently merge any of those two changes unfortunately.

I will keep this open until the end of next week and then I'll close it, unless you give me reason not to.

ghost · 2020-04-27T02:37:36Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ECS stability improvements

24ab93f

jen20 added enhancement provider/aws labels Dec 15, 2015

radeksimko added bug and removed enhancement labels Dec 16, 2015

radeksimko reviewed Dec 17, 2015
View reviewed changes

radeksimko mentioned this pull request Dec 17, 2015

provider/aws: Fix bug w/ changing ECS svc/ELB association #4366

Merged

radeksimko added the waiting-response An issue/pull request is waiting for a response from the community label Dec 17, 2015

radeksimko closed this Feb 25, 2016

ghost locked and limited conversation to collaborators Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS stability improvements #4326

ECS stability improvements #4326

hngkr commented Dec 15, 2015

radeksimko commented Dec 17, 2015

radeksimko Dec 17, 2015

hngkr commented Dec 17, 2015

radeksimko commented Jan 13, 2016

radeksimko commented Feb 7, 2016

ghost commented Apr 27, 2020

ECS stability improvements #4326

ECS stability improvements #4326

Conversation

hngkr commented Dec 15, 2015

radeksimko commented Dec 17, 2015

radeksimko Dec 17, 2015

Choose a reason for hiding this comment

hngkr commented Dec 17, 2015

radeksimko commented Jan 13, 2016

radeksimko commented Feb 7, 2016

ghost commented Apr 27, 2020